A statistically representative synthetic sample of 20,000 Americans. Each record is a simulated survey respondent.

people

Format

A tibble with 20,000 rows and 40 variables:

id

Sequential unique ID

fname

Random first name, see details

lname

Random last name, see details

gender

Biological sex

age

Age capped at 85

race

Race and Ethnicity

edu

Educational attainment

div

Census regional division

married

Marital status

house_size

Household size

children

Has children

us_citizen

Is a US citizen

us_born

Was born in the Us

house_income

Family income

emp_status

Employment status

emp_sector

Employment sector

hours_work

Hours worked per week

hours_vary

Hours vary week to week

mil

Has served in the military

house_own

Home ownership

metro

Lives in metropolitan area

internet

Household has internet access

foodstamp

Receives food stamps

house_moved

Moved in the last year

pub_contact

Contacted or visited a public official

boycott
hood_group

Participated in a community association

hood_talks

Talked with neighbors

hood_trust

Trusts neighbors

tablet

Uses a tablet or e-reader

texting

Uses text messaging

social

Uses social media

volunteer

Volunteered

register

Is registered to vote

vote

Voted in the 2014 midterm elections

party

Political party

religion

Religious (evangelical) affiliation

ideology

Political ideology

govt

Follows government and public affairs

guns

Owns a gun

Source

“For Weighting Online Opt-In Samples, What Matters Most?” Pew Research Center, Washington, D.C. (January 26, 2018) https://www.pewresearch.org/methods/2018/01/26/for-weighting-online-opt-in-samples-what-matters-most/

Details

This dataset was originally produced by the Pew Research center for their paper entitled For Weighting Online Opt-In Samples, What Matters Most? The synthetic population dataset was created to serve as a reference for making online opt-in surveys more representative of the overall population.

See Appendix B: Synthetic population dataset for a more detailed description of the method for and rationale behind creating this dataset.

In short, the dataset was created to overcome the limitations of using large, federal benchmark survey datasets such as the American Community Survey (ACS) or Current Population Survey (CPS). These surveys often do not contain the exact questions asked in online-opt in surveys, keeping them from being used for proper adjustment.

This synthetic dataset was created by combining nine separate benchmark datasets. Each had a set of common demographic variables but many added unique variables such as gun ownership or voter registration. The surveys were combined, stratified, sampled, combined, and imputed to fill missing values from each. From this large dataset, the original 20,000 surveys from the ACS were kept to ensure accurate demographic distribution.

The names were RANDOMLY assigned to respondents to better simulate a synthetic sample of the population. First names were taken from the babynames dataset which contains the Social Security Administration's record of baby names from 1880 to 2017 along with gender and proportion. First names were proportionally randomly assigned by birth year and sex. Last names were taken from the Census Bureau, who provides the 162,254 most common last names in the 2010 Census, covering over 90% of the population. For a given surname, the proportion of that name belonging to members of each race and ethnicity is provided. The last names were proportionally randomly assigned by race.