"Concentrations of criteria pollutants in the contiguous U.S., 1979 – 2" by Sun-Young Kim, Matthew Bechle et al.

UW Biostatistics Working Paper Series

Title

Concentrations of criteria pollutants in the contiguous U.S., 1979 – 2015: Role of model parsimony in integrated empirical geographic regression

Authors

Sun-Young Kim, University of Washington - Seattle CampusFollow
Matthew Bechle, University of WashingtonFollow
Steve Hankey, Virginia TechFollow
Elizabeth (Lianne) A. Sheppard, University of WashingtonFollow
Adam A. Szpiro, University of WashingtonFollow
Julian D. Marshall, University of WashingtonFollow

Abstract

BACKGROUND: National- or regional-scale prediction models that estimate individual-level air pollution concentrations commonly include hundreds of geographic variables. However, these many variables may not be necessary and parsimonious approach including small numbers of variables may achieve sufficient prediction ability. This parsimonious approach can also be applied to most criteria pollutants. This approach will be powerful when generating publicly available datasets of model predictions that support research in environmental health and other fields. OBJECTIVES: We aim to (1) build annual-average integrated empirical geographic (IEG) regression models for the contiguous U.S. for six criteria pollutants, for all years with regulatory monitoring data during 1979 – 2015; (2) explore the impact of model parsimony on model performance by comparing the model performance depending on the numbers or variables offered into a model; and (3) provide publicly available model predictions. METHODS: We compute annual-average concentrations from regulatory monitoring data for PM10, PM2.5, NO2, SO2, CO, and ozone at all monitoring sites for 1979-2015. We also compute ~900 geographic characteristics at each location including measures of traffic, land use, and satellite-based estimates of air pollution and landcover. We then develop IEG models, employing universal kriging and summary factors estimated by partial least squares (PLS) of independent variables. For all pollutants and years, we compare three approaches for choosing variables to include in the model: (1) no variables (kriging only), (2) a limited number of variables chosen by forward selection, and (3) all variables. We evaluate model performance using 10-fold cross-validation (CV) using conventional randomly-selected and spatially-clustered test data. RESULTS: Models using 3 to 30 variables generally have the best performance across all pollutants and years (median R2 conventional [clustered] CV: 0.66 [0.47]) compared to models with no (0.37 [0]) or all variables (0.64 [0.27]). Using the best models mostly including 3-30 variables, we predicted annual-average concentrations of six criteria pollutants for all Census Blocks in the contiguous U.S.

DISCUSSION: Our findings suggest that national prediction models can be built on only a small number (30 or fewer) of important variables and provide robust concentration estimates. Model estimates are freely available online.

Disciplines

Biostatistics | Environmental Public Health

Suggested Citation

Kim, Sun-Young; Bechle, Matthew; Hankey, Steve; Sheppard, Elizabeth (Lianne) A.; Szpiro, Adam A.; and Marshall, Julian D., "Concentrations of criteria pollutants in the contiguous U.S., 1979 – 2015: Role of model parsimony in integrated empirical geographic regression" (November 2018). UW Biostatistics Working Paper Series. Working Paper 425.
https://biostats.bepress.com/uwbiostat/paper425

Download

Included in

Biostatistics Commons, Environmental Public Health Commons

COinS

Collection of Biostatistics Research Archive

UW Biostatistics Working Paper Series

Title

Authors

Abstract

Disciplines

Suggested Citation

Included in

Browse

Search

Author Corner

UW Biostatistics

Collection of Biostatistics Research Archive

UW Biostatistics Working Paper Series

Title

Authors

Abstract

Disciplines

Suggested Citation

Included in

Share

Browse

Search

Author Corner

UW Biostatistics