BACKGROUND: National- or regional-scale prediction models that estimate individual-level air pollution concentrations commonly include hundreds of geographic variables. However, these many variables may not be necessary and parsimonious approach including small numbers of variables may achieve sufficient prediction ability. This parsimonious approach can also be applied to most criteria pollutants. This approach will be powerful when generating publicly available datasets of model predictions that support research in environmental health and other fields. OBJECTIVES: We aim to (1) build annual-average integrated empirical geographic (IEG) regression models for the contiguous U.S. for six criteria pollutants, for all years with regulatory monitoring data during 1979 – 2015; (2) explore the impact of model parsimony on model performance by comparing the model performance depending on the numbers or variables offered into a model; and (3) provide publicly available model predictions. METHODS: We compute annual-average concentrations from regulatory monitoring data for PM10, PM2.5, NO2, SO2, CO, and ozone at all monitoring sites for 1979-2015. We also compute ~900 geographic characteristics at each location including measures of traffic, land use, and satellite-based estimates of air pollution and landcover. We then develop IEG models, employing universal kriging and summary factors estimated by partial least squares (PLS) of independent variables. For all pollutants and years, we compare three approaches for choosing variables to include in the model: (1) no variables (kriging only), (2) a limited number of variables chosen by forward selection, and (3) all variables. We evaluate model performance using 10-fold cross-validation (CV) using conventional randomly-selected and spatially-clustered test data. RESULTS: Models using 3 to 30 variables generally have the best performance across all pollutants and years (median R2 conventional [clustered] CV: 0.66 [0.47]) compared to models with no (0.37 ) or all variables (0.64 [0.27]). Using the best models mostly including 3-30 variables, we predicted annual-average concentrations of six criteria pollutants for all Census Blocks in the contiguous U.S.
DISCUSSION: Our findings suggest that national prediction models can be built on only a small number (30 or fewer) of important variables and provide robust concentration estimates. Model estimates are freely available online.
Biostatistics | Environmental Public Health
Kim, Sun-Young; Bechle, Matthew; Hankey, Steve; Sheppard, Elizabeth (Lianne) A.; Szpiro, Adam A.; and Marshall, Julian D., "Concentrations of criteria pollutants in the contiguous U.S., 1979 – 2015: Role of model parsimony in integrated empirical geographic regression" (November 2018). UW Biostatistics Working Paper Series. Working Paper 425.