A statistically representative synthetic sample of 20,000 Americans. Each record is a simulated survey respondent.
people
A tibble with 20,000 rows and 40 variables:
Sequential unique ID
Random first name, see details
Random last name, see details
Biological sex
Age capped at 85
Race and Ethnicity
Educational attainment
Census regional division
Marital status
Household size
Has children
Is a US citizen
Was born in the Us
Family income
Employment status
Employment sector
Hours worked per week
Hours vary week to week
Has served in the military
Home ownership
Lives in metropolitan area
Household has internet access
Receives food stamps
Moved in the last year
Contacted or visited a public official
Participated in a community association
Talked with neighbors
Trusts neighbors
Uses a tablet or e-reader
Uses text messaging
Uses social media
Volunteered
Is registered to vote
Voted in the 2014 midterm elections
Political party
Religious (evangelical) affiliation
Political ideology
Follows government and public affairs
Owns a gun
“For Weighting Online Opt-In Samples, What Matters Most?” Pew Research Center, Washington, D.C. (January 26, 2018) https://www.pewresearch.org/methods/2018/01/26/for-weighting-online-opt-in-samples-what-matters-most/
This dataset was originally produced by the Pew Research center for their paper entitled For Weighting Online Opt-In Samples, What Matters Most? The synthetic population dataset was created to serve as a reference for making online opt-in surveys more representative of the overall population.
See Appendix B: Synthetic population dataset for a more detailed description of the method for and rationale behind creating this dataset.
In short, the dataset was created to overcome the limitations of using large, federal benchmark survey datasets such as the American Community Survey (ACS) or Current Population Survey (CPS). These surveys often do not contain the exact questions asked in online-opt in surveys, keeping them from being used for proper adjustment.
This synthetic dataset was created by combining nine separate benchmark datasets. Each had a set of common demographic variables but many added unique variables such as gun ownership or voter registration. The surveys were combined, stratified, sampled, combined, and imputed to fill missing values from each. From this large dataset, the original 20,000 surveys from the ACS were kept to ensure accurate demographic distribution.
The names were RANDOMLY assigned to respondents to better simulate a
synthetic sample of the population. First names were taken from the
babynames
dataset which contains the Social Security Administration's
record of baby names from 1880 to 2017 along with gender and proportion.
First names were proportionally randomly assigned by birth year and sex. Last
names were taken from the Census Bureau, who provides the 162,254 most common
last names in the 2010 Census, covering over 90% of the population. For a
given surname, the proportion of that name belonging to members of each race
and ethnicity is provided. The last names were proportionally randomly
assigned by race.