Section A.1 Data sets within the text
Each data set within the text is described in this appendix. For those data sets that are in multiple sections in a chapter, only the first section is listed in that chapter. If a data set is not listed here, e.g. Chapter 3 Bayes’ Theorem lists imagined probabilities for whether a parking garage will fill up and whether there is a sporting event that same evening for an unnamed college, it may not be listed in this data appendix. When a raw data set is available vs just a description, there is a corresponding page for the data set at openintro.org/data. That webpage also includes many more data sets than are covered in this textbook, and each data set on the website includes a description, it’s source, a detailed overview of each data set’s variables, and download options.
1
openintro.org/data
Subsection A.1.1 Chapter 1: Data Collection
In Section 1.1: , \(\rightarrow\)The stent data is split across two data sets, one for the 0-30 day and one for the 0-365 day results. Chimowitz MI, Lynn MJ, Derdeyn CP, et al. 2011. Stenting versus Aggressive Medical Therapy for Intracranial Arterial Stenosis. New England Journal of Medicine 365:993-1003. >www.nejm.org/doi/full/10.1056/NEJMoa1105335. NY Times article: www.nytimes.com/2011/09/08/health/research/08stent.html.
stent30
2
www.openintro.org/data/index.php?data=stent30
stent365
3
www.openintro.org/data/index.php?data=stent365
4
www.nejm.org/doi/full/10.1056/NEJMoa1105335
5
www.nytimes.com/2011/09/08/health/research/08stent.html
In Section 1.2: , \(\rightarrow\) This data comes from Lending Club (lendingclub.com), which provides a large set of data on the people who received loans through their platform. The data used in the textbook comes from a sample of the loans made in Q1 (Jan, Feb, March) 2018.
loan50
6
www.openintro.org/data/index.php?data=loan50
loan_full_schema
7
www.openintro.org/data/index.php?data=loans_full_schema
8
www.lendingclub.com/
In Section 1.2: , \(\rightarrow\) These data come from several government sources. For those variables included in the county data set, only the most recent data is reported, as of what was available in late 2018. Data prior to 2011 is all from census.gov, where the specific Quick Facts page providing the data is no longer available. The more recent data comes from USDA (ers.usda.gov), Bureau of Labor Statistics (bls.gov/lau), SAIPE (census.gov/did/www/saipe), and American Community Survey (census.gov/programs-surveys/acs).
county
9
www.openintro.org/data/index.php?data=county
county_complete
10
www.openintro.org/data/index.php?data=county_complete
11
www.census.gov/
12
www.ers.usda.gov/data-products/county-level-data-sets/download-data/
13
www.bls.gov/lau/
14
www.census.gov/programs-surveys/saipe.html
15
www.census.gov/programs-surveys/acs/
In Section 1.4 The study in mind regarding chocolate and heart attack patients: Janszky et al. 2009. Chocolate consumption and mortality following a first acute myocardial infarction: the Stockholm Heart Epidemiology Program. Journal of Internal Medicine 266:3, p248-257.
16
onlinelibrary.wiley.com/doi/full/10.1111/j.1365-2796.2009.02088.x/
In Section 1.4: The Nurses’ Health Study was mentioned. For more information on this data set, see www.channing.harvard.edu/nhs
17
www.nurseshealthstudy.org/
In Section 1.5: The study we had in mind when discussing the simple randomization (no blocking) study was Anturane Reinfarction Trial Research Group. 1980. Sulfinpyrazone in the prevention of sudden death after myocardial infarction. New England Journal of Medicine 302(5):250-256
Subsection A.1.2 Chapter 2: Summarizing Data
In Section 2.1: \(\rightarrow\) This data set is described in the data for Chapter 1.
county
18
www.openintro.org/data/index.php?data=county
In Section 2.1: , \(\rightarrow\text{.}\) These data represent emails sent to David Diez. Each data set includes 21 variables. The
email50
19
www.openintro.org/data/index.php?data=email50
email
20
www.openintro.org/data/index.php?data=email
email50
data set is a random sample of 50 emails from email
.In Section 2.2: , \(\rightarrow\) These data sets are described in the data for Chapter 1. , \(\rightarrow\) These data sets are described in the data for Section 2.1.
loan50
21
www.openintro.org/data/index.php?data=loan50
county
22
www.openintro.org/data/index.php?data=county
email50
23
www.openintro.org/data/index.php?data=email50
email
24
www.openintro.org/data/index.php?data=email
In Section 2.2: 2019 mean and median income https://data.census.gov/table/ACSST1Y2019.S1901?hidePreview=true
25
data.census.gov/table/ACSST1Y2019.S1901?hidePreview=true
In Section 2.2: \(\rightarrow\) The brushtail possum statistics are based on a sample of possums from Australia and New Guinea. The original source of this data is as follows: Lindenmayer DB, et al. 1995. Morphological variation among columns of the mountain brushtail possum, Trichosurus caninus Ogilby (Phalangeridae: Marsupiala). Australian Journal of Zoology 43: 449-458.
possum
26
www.openintro.org/data/index.php?data=possum
In Section 2.3: SAT and ACT score distributions \(\rightarrow\) The SAT score data comes from the 2018 distribution, which is provided at https://reports.collegeboard.org/pdf/2018-total-group-sat-suite-assessments-annual-report.pdf#page=4&zoom=auto,-63,775. The ACT score data is available at https://www.act.org/content/dam/act/unsecured/documents/cccr2018/P_99_999999_N_S_N00_ACT-GCPR_National.pdf#page=15. We also acknowledge that the actual ACT score distribution is not nearly normal. However, since the topic is very accessible, we decided to keep the context and examples.
27
reports.collegeboard.org/pdf/2018-total-group-sat-suite-assessments-annual-report.pdf#page=4&zoom=auto,-63,775
28
www.act.org/content/dam/act/unsecured/documents/cccr2018/P_99_999999_N_S_N00_ACT-GCPR_National.pdf#page=15
In Section 2.3: \(\rightarrow\) Summary information from the NBA players for the 2018-2019 season. Data were retrieved from www.nba.com/players.
nba_players_19
29
www.openintro.org/data/index.php?data=nba_players_19
30
www.nba.com/players
In Section 2.4: \(\rightarrow\) This data set is described in the data for Chapter 1.
loans_full_schema
31
www.openintro.org/data/index.php?data=loans_full_schema
In Section 2.5: \(\rightarrow\) Lyke et al. 2017. PfSPZ vaccine induces strain-transcending T cells and durable protection against heterologous controlled human malaria infection. PNAS 114(10):2711-2716. www.pnas.org/content/114/10/2711
malaria
32
www.openintro.org/data/index.php?data=malaria
33
www.pnas.org/content/114/10/2711
Subsection A.1.3 Chapter 3: Probability
In Section 3.1: \(\rightarrow\) This data set is described in the data for Chapter 2.
email
34
www.openintro.org/data/index.php?data=email
In Section 3.1: \(\rightarrow\) A table of the 52 cards in a standard deck.
playing_cards
35
www.openintro.org/data/index.php?data=playing_cards
In Section 3.2: Machine learning on fashion. \(\rightarrow\) This is a simulated data set, not based on any specific machine learning classifier.
In Section 3.2: \(\rightarrow\) Fenner F. 1988. Smallpox and Its Eradication (History of International Public Health, No. 6). Geneva: World Health Organization. ISBN 92-4-156110-6.
smallpox
36
www.openintro.org/data/index.php?data=smallpox
In Section 3.2: \(\rightarrow\) A simulated data set based on real population summaries at nces.ed.gov/pubs2001/2001126.pdf.
family_college
37
www.openintro.org/data/index.php?data=family_college
38
nces.ed.gov/pubs2001/2001126.pdf
In Section 3.2: Mammogram screening, probabilities. \(\rightarrow\) The probabilities reported were obtained using studies reported at www.breastcancer.org and www.ncbi.nlm.nih.gov/pmc/articles/PMC1173421.
39
www.breastcancer.org/
40
www.ncbi.nlm.nih.gov/pmc/articles/PMC1173421/
In Section 3.4: \(\rightarrow\) Monthly returns for Caterpillar, Exxon Mobil Corp, and Google for November 2015 to October 2018.
stocks_18
41
www.openintro.org/data/index.php?data=stocks_18
In Section 3.5: Blood type prevalence. \(\rightarrow\) The fraction of people with O+ blood is about 38% according to https://www.redcrossblood.org/donate-blood/blood-types/o-blood-type.html We used 35% for simplicity in the examples.
43
www.redcrossblood.org/donate-blood/blood-types/o-blood-type.html
Subsection A.1.4 Chapter 4: Distributions of random variables
In Section 4.1: Blood type prevalence. \(\rightarrow\) This data set is described in the data for Chapter 3.
In Section 4.2: , \(\rightarrow\) These data set represent the full population and a sample of the runners and their run times in the 2017 Cherry Blossom Run in Washington, DC. For more details, see www.cherryblossom.org
run17
44
www.openintro.org/data/index.php?data=run17
run17samp
45
www.openintro.org/data/index.php?data=run17samp
46
www.cherryblossom.org
In Section 4.2: \(\rightarrow\) The full data set includes poker winnings (and losses) for 50 days by a professional poker player, which represents their first 50 days trying to play for a living. Anonymity has been requested by the player.
poker
47
www.openintro.org/data/index.php?data=poker
Subsection A.1.5 Chapter 5: Foundations for inference
In Section 5.1: \(\rightarrow\) This data set is described in the data for Chapter 2.
email
48
www.openintro.org/data/index.php?data=email
In Section 5.1: \(\rightarrow\) The actual data has more observations than were referenced in this chapter. That is, we used a subsample since it helped smooth some of the examples to have a bit more variability. The
pew_energy_2018
49
www.openintro.org/data/index.php?data=pew_energy_2018
pew_energy_2018
data set represents the full data set for each of the different energy source questions, which covers solar, wind, offshore drilling, hydrolic fracturing, and nuclear energy. The statistics used to construct the data are from the following page: www.pewinternet.org/2018/05/14/majorities-see-government-efforts-to-protect-the-environment-as-insufficient/50
www.pewresearch.org/science/2018/05/14/majorities-see-government-efforts-to-protect-the-environment-as-insufficient/
In Section 5.2: \(\rightarrow\) See the details for this data set above in Section 5.1 data section.
pew_energy_2018
51
www.openintro.org/data/index.php?data=pew_energy_2018
In Section 5.2: \(\rightarrow\) In New York City on October 23rd, 2014, a doctor who had recently been treating Ebola patients in Guinea went to the hospital with a slight fever and was subsequently diagnosed with Ebola. Soon thereafter, an NBC 4 New York/The Wall Street Journal/Marist Poll found that 82% of New Yorkers favored a “mandatory 21-day quarantine for anyone who has come in contact with an Ebola patient”. This poll included responses of 1,042 New York adults between Oct 26th and 28th, 2014. Poll ID NY141026 on maristpoll.marist.edu.
ebola_survey
52
www.openintro.org/data/index.php?data=ebola_survey
53
maristpoll.marist.edu/wp-content/misc/nyspolls/NY141026/Cuomo/Complete%20NBC%204%20NY_WSJ_Marist%20Poll%20New%20York%20State%20Release%20and%20Tables_October%202014.pdf
In Section 5.3: \(\rightarrow\) This is a made up data set about the health outcomes for a hypothetical medical consultant. Note that the data set on the website has 62 patients, not 142 patients, so there will a difference for what is covered in this book vs the data set on the website.
transplant
54
www.openintro.org/data/index.php?data=transplant
In Section 5.3: Alaska residents under 5 years old. \(\rightarrow\) The 2010 statistic comes from the US census: https://data.census.gov.
55
data.census.gov/table/DECENNIALDPCD1132010.113DP1?q=alaska%20age%202010%20census&hidePreview=false
Subsection A.1.6 Chapter 6: Inference for categorical data
In Section 6.1: Supreme Court \(\rightarrow\) The Gallup organization began measuring the public’s view of the Supreme Court’s job performance in 2000, and has measured it every year since then with the question: “Do you approve or disapprove of the way the Supreme Court is handling its job?”. In 2018, the Gallup poll randomly sampled 1,033 adults in the U.S. and found that 53% of them approved. https://news.gallup.com/poll/237269/supreme-court-approval-highest-2009.aspx
56
news.gallup.com/poll/237269/supreme-court-approval-highest-2009.aspx
In Section 6.1: Life on other planets \(\rightarrow\) A February 2018 Marist Poll reported: “Many Americans (68%) think there is intelligent life on other planets”. The results were based on a random sample of 1,033 adults in the U.S. http://maristpoll.marist.edu/212-are-americans-poised-for-an-alien-invasion
57
maristpoll.marist.edu/212-are-americans-poised-for-an-alien-invasion/#sthash.VrjaqJNS.Pyp2lgqf.dpbs
In Section 6.1: Congressional approval rating. \(\rightarrow\) This survey data is from https://news.gallup.com/poll/237176/snapshot-congressional-job-approval-july.aspx
58
news.gallup.com/poll/237176/snapshot-congressional-job-approval-july.aspx
In Section 6.1: Tire inspection. \(\rightarrow\) This is a hypothetical scenario not based on real data.
In Section 6.1: Toohey poll. \(\rightarrow\) This is a hypothetical scenario not based on a real person or real data.
In Section 6.1: Support for nuclear energy. \(\rightarrow\) The results are from the following Gallup poll: https://news.gallup.com/poll/190064/first-time-majority-oppose-nuclear-energy.aspx
59
news.gallup.com/poll/190064/first-time-majority-oppose-nuclear-energy.aspx
In Section 6.2: \(\rightarrow\) Böttiger et al. Efficacy and safety of thrombolytic therapy after initially unsuccessful cardiopulmonary resuscitation: a prospective clinical trial. The Lancet, 2001.
cpr
60
www.openintro.org/data/index.php?data=cpr
In Section 6.2: \(\rightarrow\) This is a hypothetical scenario not based on real data.
gear_company
61
www.openintro.org/data/index.php?data=gear_company
In Section 6.2: \(\rightarrow\) Pew research survey on the Affordable Care Act (aka Obamacare) that ran the survey question with two variants. https://www.pewresearch.org/politics/2012/03/26/public-remains-split-on-health-care-bill-opposed-to-mandate/
healthcare_law_survey
62
www.openintro.org/data/index.php?data=healthcare_law_survey
63
www.pewresearch.org/politics/2012/03/26/public-remains-split-on-health-care-bill-opposed-to-mandate/
In Section 6.2: \(\rightarrow\) Manson JE, et al. 2018. Marine n-3 Fatty Acids and Prevention of Cardiovascular Disease and Cancer. NEJMoa1811403.
fish_oil_18
64
www.openintro.org/data/index.php?data=fish_oil_18
In Section 6.3: \(\rightarrow\) Simulated data set of registered voter proportions and representation on juries from a population.
jury
65
www.openintro.org/data/index.php?data=jury
In Section 6.3: M&Ms \(\rightarrow\) Rick Wicklin collected a sample of 712 candies, or about 1.5 pounds, and counted how many there were of each color. https://qz.com/918008/the-color-distribution-of-mms-as-determined-by-a-phd-in-statistics
66
qz.com/918008/the-color-distribution-of-mms-as-determined-by-a-phd-in-statistics/
In Section 6.4: \(\rightarrow\) Simulated (fake) data set for Google search experiment.
gsearch
67
www.openintro.org/data/index.php?data=gsearch
In Section 6.4: \(\rightarrow\) Experiment results from asking about iPods, where the original source is: Minson JA, Ruedy NE, Schweitzer ME. There is such a thing as a stupid question: Question disclosure in strategic communication. opim.wharton.upenn.edu/DPlab/papers/workingPapers/Minson working Ask%20(the%20Right%20Way)%20and%20You%20Shall%20Receive.pdf
ask
68
www.openintro.org/data/index.php?data=ask
69
www.acrwebsite.org/volumes/1012889/volumes/v40/NA-40
In Section 6.4: Obama and Congressional approval by political affiliation \(\rightarrow\) This survey was completed by Pew Research and the full results may be found at: https://www.pewresearch.org/politics/2012/03/14/romney-leads-gop-contest-trails-in-matchup-with-obama/
70
www.pewresearch.org/politics/2012/03/14/romney-leads-gop-contest-trails-in-matchup-with-obama/
In Section 6.4: Attitudes on climate change \(\rightarrow\) A Pew Research poll published in May of 2021 looks at how Americans’ attitudes about climate change differ by generation, party and other factors https://www.pewresearch.org/short-reads/2021/05/26/key-findings-how-americans-attitudes-about-climate-change-differ-by-generation-party-and-other-factors/
71
www.pewresearch.org/short-reads/2021/05/26/key-findings-how-americans-attitudes-about-climate-change-differ-by-generation-party-and-other-factors/
Subsection A.1.7 Chapter 7: Inference for numerical data
In Section 7.1: Risso’s dolphins \(\rightarrow\) Endo T and Haraguchi K. 2009. High mercury levels in hair samples from residents of Taiji, a Japanese whaling town. Marine Pollution Bulletin 60(5):743-747. Taiji was featured in the movie The Cove, and it is a significant source of dolphin and whale meat in Japan. Thousands of dolphins pass through the Taiji area annually, and we will assume these 19 dolphins represent a simple random sample from those dolphins.
In Section 7.1: Croaker white fish \(\rightarrow\) www.fda.gov/food/foodborneillnesscontaminants/metals/ucm115644.htm
72
www.fda.gov/food/metals/mercury-levels-commercial-fish-and-shellfish-1990-2012
In Section 7.1: \(\rightarrow\) This data set is described in the data for ch_distributions 8.2.22.
run17samp
73
www.openintro.org/data/index.php?data=run17samp
In Section 7.2: , \(\rightarrow\) Data were collected by OpenIntro staff in 2010 and again in 2018. For the 2018 sample, we sampled 201 UCLA courses. Of those, 68 required books that could be found on Amazon. The websites where information was retrieved: sa.ucla.edu/ro/public/soc, ucla.verbacompare.com and amazon.com.
textbooks
74
www.openintro.org/data/index.php?data=textbooks
ucla_textbooks_f18
75
www.openintro.org/data/index.php?data=ucla_textbooks_f18
76
sa.ucla.edu/ro/public/soc
77
ucla.verbacompare.com/
78
www.amazon.com/
In Section 7.2: \(\rightarrow\) This is a hypothetical (fake) data set for SAT improvement from an SAT preparation company.
sat_improve
79
www.openintro.org/data/index.php?data=sat_improve
In Section 7.3: Jennifer-John \(\rightarrow\) Bertrand M, Mullainathan S. 2004. Science faculty’s subtle gender biases favor male students. PNAS October 9, 2012 109 (41) 16474-16479. https://www.pnas.org/content/109/41/16474
80
www.pnas.org/content/109/41/16474
In Section 7.3: \(\rightarrow\) Bertrand M, Mullainathan S. 2004. Are Emily and Greg More Employable than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination. The American Economic Review 94:4 (991-1013). www.nber.org/papers/w9873
resume
81
www.openintro.org/data/index.php?data=resume
82
www.nber.org/papers/w9873
In Section 7.3: Exams variants. \(\rightarrow\) This is a simulated (fake) data set for exam performance of students for two different exam variations.
In Section 7.3: \(\rightarrow\) A random sample of 1000 NC births. A sample of that random sample was used for the example in the section.
ncbirths
83
www.openintro.org/data/index.php?data=ncbirths
In Section 7.3: \(\rightarrow\) Menard C, et al. 2005. Transplantation of cardiac-committed mouse embryonic stem cells to infarcted sheep myocardium: a preclinical study. The Lancet: 366:9490, p1005-1012. https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(05)67380-1/fulltext
stem_cells
84
www.openintro.org/data/index.php?data=stem_cell
85
www.thelancet.com/journals/lancet/article/PIIS0140-6736(05)67380-1/fulltext
Subsection A.1.8 Chapter 8: Introduction to linear regression
In Section 8.1: \(\rightarrow\) Fake data used for the first three plots. The perfect linear plot uses group 4 data, where group variable in the data set (Figure 8.1.1). The group of 3 imperfect linear plots use groups 1-3 (Figure 8.1.2). The sinusoidal curve uses group 5 data (Figure 8.1.3). The group of 3 scatterplots with residual plots use groups 6-8 (Figure 8.1.13). The correlation plots uses groups 9-19 data (Figure 8.1.14 and Figure 8.1.16).
simulated_scatter
86
www.openintro.org/data/index.php?data=simulated_scatter
In Section 8.1: \(\rightarrow\) The data is described in the data for Chapter 2
possum
87
www.openintro.org/data/index.php?data=possum
In Section 8.1: \(\rightarrow\) The plots for things that can go wrong uses groups 20-23 Figure 8.4.1
simulated_scatter
88
www.openintro.org/data/index.php?data=simulated_scatter
In Section 8.2: \(\rightarrow\) These data were sampled from a table of data for all freshman from the 2011 class at Elmhurst College that accompanied an article titled What Students Really Pay to Go to College published online by The Chronicle of Higher Education: chronicle.com/article/What-Students-Really-Pay-to-Go/131435.
elmhurst
89
www.openintro.org/data/index.php?data=elmhurst
90
www.chronicle.com/article/What-Students-Really-Pay-to-Go/131435
In Section 8.2: , \(\rightarrow\) This data is described in the data for Chapter 7.
textbooks
91
www.openintro.org/data/index.php?data=textbooks
ucla_textbooks_f18
92
www.openintro.org/data/index.php?data=ucla_textbooks_f18
In Section 8.2: \(\rightarrow\) This data is described in the data for Chapter 1.
loan50
93
www.openintro.org/data/index.php?data=loan50
In Section 8.2: \(\rightarrow\) Auction data from Ebay (ebay.com) for the game Mario Kart for the Nintendo Wii. This data set was collected in early October, 2009.
mariokart
94
www.openintro.org/data/index.php?data=mariokart
In Section 8.2: \(\rightarrow\) The plots for types of outliers uses groups 24-29 from Example 8.2.22.
simulated_scatter
95
www.openintro.org/data/index.php?data=simulated_scatter
In Section 8.3: , \(\rightarrow\) These data sets are described in the data for Chapter 1
county
96
www.openintro.org/data/index.php?data=county
county_complete
97
www.openintro.org/data/index.php?data=county_complete
In Section 8.4: \(\rightarrow\) Data was retrieved from Wikipedia.
midterms_house
98
www.openintro.org/data/index.php?data=county
You have attempted of activities on this page.