readme.txt - readme file for “Spatially interpolated PM2.5 concentrations for the US from 2006-2018, version 2” dataset by Katelyn O'Dell. ############################################################################ General information ############################################################################ TITLE: Spatially interpolated PM2.5 concentrations for the US from 2006-2018, version 2. CONTACT: Katelyn O'Dell (katelyn.odell@colostate.edu) and Jeffrey Pierce (jeffrey.pierce@colostate.edu) RECOMMENDED CITATION: Katelyn O’Dell, Bonne Ford, Emily V. Fischer, and Jeffrey R. Pierce. Environmental Science & Technology 2019 53 (4), 1797-1804 DOI: 10.1021/acs.est.8b05430 ############################################################################ Data source and structure ############################################################################ DATA SOURCE: The raw PM2.5 in-site measurement data are downloaded from: Environmental Protection Agency's (EPA) Air Quality System (https://aqs.epa.gov/aqsweb/documents/data_mart_welcome.html). We download the annual daily 24-hr average PM2.5 observations. We use data files 88502 and 88101 relating to Federal Reference Method monitors and monitors that reasonably agree with the federal reference method. Note the EPA continually updates these files with quality controls, so the data used here could slightly differ from current versions. File access date: 01.15.2021 SPATIAL DOMAIN: Contiguous United States of America SPATIAL RESOLUTION: 15 x 15 kilometer grid TEMPORAL DOMAIN: 2006-01-01 through 2018-12-31 TEMPORAL RESOLUTION: daily DATA FORMAT: netCDF formats of this data are available. Supplemental files displaying the data grid and leave-one-out cross validation results are in pdf format. FILE INFORMATION: This dataset contains 18 files. 13 datafiles, 3 grid shapefiles, 1 visualization and description of the data grid, 1 document containing figures of results from the leave-one-out cross validation, and 1 readme (this document). netCDF datafiles contain the kriged PM2.5, estimated seasonal background PM2.5, gridded smoke plume flag from the National Oceanic and Atmospheric Administration Hazard Mapping System Smoke Product, performance statistics (R2, mean bias, mean absolute error, slope, number of observations used for validation at each site), day of year, and latitudes and longitudes for the grid boxes. (Note each year is on the same grid). netCDF datafiles follow the following naming convention: krigedPM25_YYYY_v2.nc. grid-fullUS.dbf/.shp/.shx are the shapefiles for the data grid kriging_LOOCV_stats_figures.pdf shows maps of the leave-one-out cross validation results (R2, mean bias, meas absolute error, and slope) for each year for each site with at least 60 observations across the year grid_description.pdf is the visualization of the grid for the data and contains a brief description of the gridpoints available in the datafiles readme.txt (this document) describes the dataset DEFINITION OF ACRONYMS: PM2.5 - the mass concentration of particulate matter with a diameter smaller than 2.5 micrometers VARIABLE INFORMATION: datafiles contain the following variables: doy - day number for the observations (1-365 or 366) lon - longitude of center point of grid, [degrees] lat - latitude of center point of grid, [degrees] we_lon - longitude of middle points of west-east borders of grid cells, [degrees] we_lat - latitude of middle points of west-east borders of grid cells, [degrees] ns_lon - longitude of middle points of north-south borders of grid cells, [degrees] ns_lat - latitude of middle points of north-south borders of grid cells, [degrees] PM25 - 24hr average PM2.5 for each grid cell on each day, [ug/m3] Background_PM25 - no smoke seasonal background PM2.5 for each grid cell on each day, [ug/m3] HMS_Smoke - Smoke flag from the Hazard Mapping System (HMS) smoke product regridded to the kriging grid. 1 = smoke plume, 0 = no smoke plume testing_sites_longitudes - longitudes of the EPA AQS sites used in the leave-one-out cross validation testing_sites_latitudes - latitudes of the EPA AQS sites used in the leave-one-out cross validation r_squared - r2 for the leave-one-out cross validation of kriging against surface observations at each monitor mean_bias - mean bias for the leave-one-out cross validation of kriging against surface observations at each monitor, [ug/m3] mean_absolute_error - mean absolute error for the leave-one-out cross validation of kriging against surface observations at each monitor, [ug/m3] slope - slope of a linear regression between the leave-one-out cross validation kriging estimate at the removed monitor location and the PM2.5 values observed by the monitor nobs - number of observations available for the kriging validation of each monitor ############################################################################ Methods and software ############################################################################ METHODS: We krige in situ 24-hour average PM2.5 observations from the EPA’s Air Quality System (AQS) monitoring network (US EPA). The AQS network contains air pollution data collected by air quality monitors across the US maintained by the EPA, state, local, and tribal agencies. These data are collected from sites using both the gravimetric and beta-attenuation techniques and both 24-hr and 1-hr sample durations. We use PM2.5 data from two sets of PM2.5 monitors that exist in the AQS database: 1) federal reference method (FRM) and federal equivalent method (FEM) sites (EPA parameter code 88101) and 2) acceptable non-FRM sites that reasonably match the FRM (EPA parameter code 88502). To krige the data we follow the methods outlined in Lassman et al. 2017*. Kriging is an inverse-distance-weighted data interpolation method that has been recently used in air quality research. Kriging estimates values between data points by assuming a functional form for the rate of decay of the sites spatial autocorrelation. We select a spherical semivariogram, which has been shown to work well in previous studies using kriging with air quality data. The function is fit using three parameters: nugget, sill, and range. The parameters are determined using a k-fold cross validation with ten folds. 2,700 different combinations of parameters are tested. Values tested for each parameter are as follows: sill: 0.2, 0.4, …, 2.8, 3.0; range: 0.5, 1.5, …, 9.5, 10.0; nugget: 0.1, 0.2, …, 0.9, 1.0. The parameters are evaluated over the western US for May - October of 2015. For each set of parameters, the available monitoring sites were divided into ten unique groups (or ‘folds’). We remove one group of monitors and krige the remaining monitors to obtain a continuous estimate of PM2.5 across the domain. We evaluate the kriged estimate against the PM2.5 concentrations reported by the removed monitors for each day by calculating R2, slope, mean bias, and mean absolute error. This process is repeated for each group of monitors. We then average the statistical parameters across the ten folds. The set of parameters used for this dataset was then selected from the 15 sets of parameters that produced an R2 in the highest 10%, slope in the highest 10%, mean bias in the lowest 10% absolute values, and mean absolute error in the lowest 10%. Using this method we select the parameters: sill = 2.6, range = 8.5, nugget = 0.1. The no-smoke background was estimated using smoke plume information from the Hazard Mapping System (HMS)**. Each day HMS provides aerial smoke plume polygons indicating the presence of smoke somewhere in the atmospheric column. For a given day, if there is no smoke plume above a grid cell, we call this day's grid cell PM2.5 concentration a no-smoke PM2.5. We then take a seasonal (JFM AMJ JAS OND) median of these no-smoke PM2.5 days to estimate a seasonal non-smoke PM2.5 background concentration for each grid cell. *Lassman, W., Ford, B., Gan, R. W., Pfister, G., Magzamen, S., Fischer, E. V., & Pierce, J. R. (2017). Spatial and Temporal Estimates of Population Exposure to Wildfire Smoke during the Washington State 2012 Wildfire Season Using Blended Model, Satellite, and In-Situ Data. GeoHealth, 2017GH000049. https://doi.org/10.1002/2017GH000049 **Rolph, G. D.; Draxler, R. R.; Stein, A. F.; Taylor, A.; Ruminski, M. G.; Kondragunta, S.; Zeng, J.; Huang, H.-C.; Manikin, G.; McQueen, J. T.; et al. Description and Verification of the NOAA Smoke Forecasting System: The 2007 Fire Season. Weather Forecast. 2009, 24 (2), 361–378. https://doi.org/10.1175/2008WAF2222165.1. SOFTWARE USED: Data were processed using python version 3.8.5, python package numpy version 1.19.2, python package pandas version 1.1.3, and python package scipy version 1.5.2. To krige the data we use the python package PyKrige's OrdinaryKriging version 1.5.1 (https://media.readthedocs.org/pdf/pykrige/latest/pykrige.pdf). Figures were made using python package cartopy version 0.18.0 and matplotlib version 3.3.2. Data were stored in the netCDF file format using the netCDF python package version 1.5.3. CODE AVAILABILITY: All python code used to create this data is publicly available at: https://github.com/kaodell/US_EPA_AQS_krige ############################################################################ Quality control, data limitations, recommended use ############################################################################ QUALITY CONTROL: Although the data grid extends beyond the borders of the contiguous US, these data should not be included in any analysis as there are no, or very few, observations outside the US included in the kriging and therefore estimating PM2.5 here with the kriging methods would involve extrapolation and would be unreliable. DATA LIMITATIONS: 1 ) These data contain several instances of negative PM25. This is unphysical. Negative values could come from the EPA AQS observations or as a product of the kriging methods. 0.2% of the datapoints across the full dataset are negative, and can reach below -10 µg m-3 (~23,000 data points, or 0.008% of all data points). As a result of this, the estimated non-smoke PM25 can also be negative. It is recommended that these points be removed or set to 0 when calculating smoke PM25. 2 ) In estimating the non-smoke PM25 background, we use the already krigged data for all monitors and subset smoke and non-smoke days in the krigged dataset. This allows a smoke-impacted monitor to influence a non-smoke day in a nearby location and artificially increase the non-smoke PM25 background. However, since we calculate the background as the median not the mean of non-smoke days, this impact is partially mitigated. RECOMMENDED USE: If using this dataset to estimate wildland-fire smoke PM2.5, we recommend subtracting the non-smoke PM2.5 background from the total PM2.5 (simply PM2.5) and multiplying this by the HMS smoke flag. This will give an estimate of the daily PM2.5 enhancement due to smoke on days with a smoke plume and a 0 µg m-3 PM2.5 enhancement on days without a smoke plume. ############################################################################ Updates from previous version ############################################################################ The previous version of these data are available at: https://dx.doi.org/10.25675/10217/193258 for years 2006-2015 https://doi.org/10.25675/10217/208602 for years 2016-2018 This updated version of the data contains the following changes: 1) Many observations were double-counted in the previous version of this data for years 2006-2015. This has been resolved. There was not a large change in the data overall, where in most years the 95th percentile of the daily grid-cell level differences were less than 10%. 2) Several EPA AQS monitors which always had an event flag ‘included’ for all observations had been removed. This was anywhere from 5-21 monitors each year of ~1500-2000 total monitors. The impact of this difference did alter kriging estimates at the location of those monitors, but overall the change was much smaller than that from the double-counting of the monitors. 3) Daily average concentrations with less than 25% of the hourly data available were removed from the kriging input in this version. 4) The most recent EPA AQS daily-average PM2.5 files were used. These files were downloaded on 01.15.21. 5) The seasons used to estimate the seasonal no-smoke background PM2.5 concentrations were changed from [DJF, MAM, JJA, SON] to [JFM, AMJ, JAS, OND].