Simulation studies that are carefully designed under realistic
survey conditions can be used to evaluate the quality of new
statistical methodology for Census Bureau data. Furthermore, new
computationally intensive statistical methodology is often
beneficial because it can require less strict assumptions, offer
more flexibility in sampling or modeling, accommodate complex
features in the data, enable valid inference where other methods
might fail, etc. Statistical modeling is at the core of the design
of realistic simulation studies and the development of
computationally intensive statistical methods. Modeling also
enables one to efficiently use all available information when
producing estimates. Such studies can benefit from software for
data processing. Statistical disclosure avoidance methods are also
developed and properties studied.
Research Problem:
- Systematically develop an environment for simulating complex
surveys that can by used as a test-bed for new data analysis
methods.
- Develop flexible model-based estimation methods for survey
data.
- Develop new methods for statistical disclosure control that
simultaneously protect confidential data from disclosure while
enabling valid inferences to be drawn on relevant population
parameters.
- Investigate the bootstrap for analyzing data from complex
sample surveys.
- Develop models for the analysis of measurement errors in
Demographic sample surveys (e.g., Current Population Survey or the
Survey of Income and Program Participation).
- Identify and develop statistical models (e.g., loglinear
models, mixture models, and mixed-effects models) to characterize
relationships between variables measured in censuses, sample
surveys, and administrative records.
- Investigate noise multiplication for statistical disclosure
control.
Potential Applications:
- Simulating data collection operations using Monte Carlo
techniques can help the Census Bureau make more efficient
changes.
- Use noise multiplication or synthetic data as an alternative to
top coding for statistical disclosure control in publicly released
data. Both noise multiplication and synthetic data have the
potential to preserve more information in the released data over
top coding.
- Rigorous statistical disclosure control methods allow for the
release of new microdata products.
- Using an environment for simulating complex surveys,
statistical properties of new methods for missing data imputation,
model-based estimation, small area estimation, etc. can be
evaluated.
- Model-based estimation procedures enable efficient use of
auxiliary information (for example, Economic Census information in
business surveys), and can be applied in situations where variables
are highly skewed and sample sizes are not sufficiently large to
justify normal approximations. These methods may also be applicable
to analyze data arising from a mechanism other than random
sampling.
- Variance estimates and confidence intervals in complex surveys
can be obtained via the bootstrap.
- Modeling approaches with administrative records can help
enhance the information obtained from various sample surveys.
Accomplishments (October 2016 - September
2017):
- Developed new methodology that uses the principle of
sufficiency to create synthetic data whose distribution is
identical to the distribution of the original data under the normal
linear regression model.
- Further developed and refined several data visualization
methods for comparing populations and determining if there is a
statistically significant difference between pairs of population
parameters; applied the methodology to American Community Survey
data.
- Developed finite sample methodology for drawing inference based
on multiply imputed synthetic data under the multiple linear
regression model.
- Evaluated bootstrap confidence intervals for unknown population
ranks using simulation and proposed new uncertainty measures for
estimated ranks using bootstrap.
- Applied small area estimation methodology to compute state and
county level estimates based on the Tobacco Use Supplement to the
Current Population Survey.
- Developed an interactive application using R Shiny to visualize
high dimensional synthetic data and associated metrics.
- Further developed methodology for modeling response propensity
using data from the National Crime Victimization Survey Field
Representatives.
- Refined, expanded, and further developed a realistic artificial
population that can now be used to simulate Monthly Wholesale Trade
Survey data for a period representative of over four years.
Short-Term Activities (FY 2018):
- Continue developing finite sample methodology for drawing
inference based on multiply imputed synthetic data and extend to
multivariate models.
- Evaluate properties of bootstrap-based uncertainty measures for
unknown population ranks.
- Evaluate properties of synthetic data when data generating,
imputation, and analysis models differ under multivariate
models.
- Use the constructed artificial population to implement
simulation studies to evaluate properties of model-based estimation
procedures for the Monthly Wholesale Trade Survey and other similar
surveys.
- Develop and refine visualizations for synthetic data in higher
dimensions.
- Implement model selection and diagnostics for a small area
model applied to the Tobacco Use Supplement of the Current
Population Survey.
- Develop methodology for drawing inference based on singly
imputed synthetic data.
Longer-Term Activities (beyond FY 2018):
- Develop methodology for analyzing singly and multiply imputed
synthetic data under various realistic scenarios.
- Develop noise infusion methods for statistical disclosure
control.
- Study ways of quantifying the privacy protection/data utility
tradeoff in statistical disclosure control.
- Develop and study bootstrap methods for sample survey
data.
- Create an environment for simulating complex aspects of
economic/demographic surveys.
- evelop bootstrap and/or other methodology for quantifying
uncertainty in statistical rankings, and refine
visualizations.