Skip to content

Latest commit

 

History

History

smell-pittsburgh-dataset-v2

Smell Pittsburgh is a web-based application that crowdsources smell reports so we all can better track how odors from pollutants travel through the air across the Pittsburgh region. More information is on the Smell Pittsburgh website.

This is the second version of the Smell Pittsburgh dataset from 10/31/2016 (month/day/year) to 1/23/2022, including the following zipcodes in the Pittsburgh region in Pennsylvania, USA:

  • 15006, 15007, 15014, 15015, 15017, 15018, 15020, 15024, 15025, 15028, 15030, 15031, 15032, 15034, 15035, 15037, 15044, 15045, 15046, 15047, 15049, 15051, 15056, 15064, 15065, 15071, 15075, 15076, 15082, 15084, 15086, 15088, 15090, 15091, 15095, 15096, 15101, 15102, 15104, 15106, 15108, 15110, 15112, 15116, 15120, 15122, 15123, 15126, 15127, 15129, 15131, 15132, 15133, 15134, 15135, 15136, 15137, 15139, 15140, 15142, 15143, 15144, 15145, 15146, 15147, 15148, 15201, 15202, 15203, 15204, 15205, 15206, 15207, 15208, 15209, 15210, 15211, 15212, 15213, 15214, 15215, 15216, 15217, 15218, 15219, 15220, 15221, 15222, 15223, 15224, 15225, 15226, 15227, 15228, 15229, 15230, 15231, 15232, 15233, 15234, 15235, 15236, 15237, 15238, 15239, 15240, 15241, 15242, 15243, 15244, 15250, 15251, 15252, 15253, 15254, 15255, 15257, 15258, 15259, 15260, 15261, 15262, 15264, 15265, 15267, 15268, 15270, 15272, 15274, 15275, 15276, 15277, 15278, 15279, 15281, 15282, 15283, 15286, 15289, 15290, 15295

This dataset is released under the Creative Commons Zero (CC0) license. Please feel free to use this dataset for your own research. If you found this dataset and the code useful, we would greatly appreciate it if you could cite our paper below.

Yen-Chia Hsu, Jennifer Cross, Paul Dille, Michael Tasota, Beatrice Dias, Randy Sargent, Ting-Hao (Kenneth) Huang, and Illah Nourbakhsh. 2020. Smell Pittsburgh: Engaging Community Citizen Science for Air Quality. ACM Transactions on Interactive Intelligent Systems. 10, 4, Article 32. DOI:https://doi.org/10.1145/3369397. Preprint:https://arxiv.org/pdf/1912.11936.pdf.

One thing to keep in mind is that the above paper only uses a part of the zipcodes, listing below:

  • 15221, 15218, 15222, 15219, 15201, 15224, 15213, 15232, 15206, 15208, 15217, 15207, 15260, 15104

A similar previous version v1 dataset (with a smaller number of zipcodes and time range) was used for the data analysis in the above paper. This version v2 dataset has not been analyzed and remains an open challenge.

Below are descriptions about what each column means in the file that contains smell reports (the "smell_raw.csv"):

  • EpochTime: the Epoch timestamp when the smell is experienced
  • skewed_latitude: the skewed latitude of the location where the smell is experienced
  • skewed_longitude: the skewed longitude of the location where the smell is experienced
  • smell_value: the self-reported rating of the smell (described on the Smell Pittsburgh website)
  • smell_description: the self-reported description of the smell (e.g., woodsmoke)
  • feelings_symptoms: the self-reported symptoms that may caused by the source of the smell (e.g., eye irritation)
  • additional_comments: the self-provided comment to the agency that receives the smell report
  • zipcode: the zipcode of the location where the smell is experienced

Information about the metadata (e.g., latitude, longitude, feed ID, channel name) of the sensor monitoring stations used in this dataset (all files in the "esdr_raw" folder) can be found on the ESDR data visualization page. ESDR means the Environmental Sensor Data Repository, a service for hosting environmental data. The feed ID and the channel name in the code for gettting the sensor data corresponds to the metadata on the visualization page. More description about the sensor data is in the next section.

Description of the air quality sensor data

The files in the "esdr_raw" folder contains tables of air quality data from multiple monitoring stations. Every air quality monitoring station has a unique feed ID. Some stations are operated by the municipality (which is ACHD, the Allegany County Health Department), and some of them are operated by local citizens. Every feed has several channels, for example, H2S. To find the metadata of an air quality monitoring station, go to the following website to search using the feed ID.

The above-mentioned website is a service that collects and visualizes environmental sensor measurements. The following screenshot shows the search result of feed ID 28, which is a monitoring station south of Pittsburgh. This monitoring station is near a major pollution source, which is the Clairton Mill Works which belongs to the United States Steel Corporation. The raw data from the monitoring station is regularly published by the ACHD.

module-3-data.png

The following list shows the URL with metadata for available air quality and weather variables in the dataset. The variable names (i.e., column names) are provided under the corresponding feed. Notice that some monitoring stations were replaced by others at some time point, so some variables in the dataset represent the combination of multiple channels or feeds, which is explained in the comments in the Python script for getting data. Here is a link to the locations of all the sensor stations that are listed below. An archived location metadata can be found in the esdr_metadata.json file.

Below are explanations about the suffix of the variable names in the above list. There is also an online documentation of the air quality data.

  • SO2_PPM: sulfur dioxide in ppm (parts per million)
  • SO2_PPB: sulfur dioxide in ppb (parts per billion)
  • H2S_PPM: hydrogen sulfide in ppm
  • SIGTHETA_DEG: standard deviation of the wind direction
  • SONICWD_DEG: wind direction (the direction from which it originates) in degrees
  • SONICWS_MPH: wind speed in mph (miles per hour)
  • CO_PPM: carbon monoxide in ppm
  • CO_PPB: carbon monoxide in ppb
  • PM10_UG_M3: particulate matter (PM10) in micrograms per cubic meter
  • PM10B_UG_M3: same as PM10_UG_M3
  • PM25_UG_M3: fine particulate matter (PM2.5) in micrograms per cubic meter
  • PM25T_UG_M3: same as PM25_UG_M3
  • PM25_640_UG_M3: same as PM25_UG_M3
  • PM2_5: same as PM25_UG_M3
  • PM25B_UG_M3: same as PM25_UG_M3
  • NO_PPB: nitric oxide in ppb
  • NO2_PPB: nitrogen dioxide in ppb
  • NOX_PPB: sum of of NO and NO2 in ppb 
  • NOY_PPB: sum of all oxidized atmospheric odd-nitrogen species in ppb
  • OZONE_PPM: ozone (or trioxygen) in ppm
  • OZONE: same as OZONE_PPM