Skip to main content

Section 4.6 Investigation 1.15: Counting Concussions

Concussions, particularly among high school and collegiate athletes, have been given increased attention in recent years. A 2017 national Youth Risk Behavior Survey found about 15.1% of U.S. high school students reported having at least one concussion in the previous year. Prevalence was higher among those who played on a sports team.
Image related to concussion study
A student project group was interested in the probability of concussion among the male soccer and football players at their university. They sent a survey to the team rosters and found that 12 of the 33 respondents reported at least one concussion in the previous two years (2017–2019). The anonymous survey was emailed to all 110 players, and included a list of injuries asking "yes or no questions about various injuries, in order to distract the athletes of the true purpose of our survey." Based on these data, is there evidence that the proportion of all 110 male soccer and football players at this university who had at least one concussion in the last 2 years is larger than 0.15?

Checkpoint 4.6.1. Identify observational units.

Checkpoint 4.6.2. Identify response variable.

Identify the response variable in this study. How are you defining a "success"?
Solution.
The response variable is Did the player have at least one concussion in the past 2 years? A "success" is having a concussion (or not having one - either is fine as long as you are consistent throughout the Investigation).

Checkpoint 4.6.3. Variable type.

Is the response variable quantitative or categorical?
  • Quantitative
  • Categorical
Solution.
This is a binary categorical variable.

Checkpoint 4.6.4. Identify sample and population.

Identify the sample, the population, and the sampling frame used in this study.
Sample:
Population:
Sampling frame:
Solution.
Sample: the 33 respondents
Population: the 110 male soccer and football players at this University
Sampling frame: The soccer and football team rosters from this University

Definition: Non-sampling Errors.

Non-sampling errors can occur even after we have a randomly selected sample. They are not associated with the sampling process, but rather with sources of bias that can arise after the sample has been selected.
Sources of non-sampling errors in surveys include biased, dishonest, or inaccurate responses by respondents due to leading word choice in survey questions, sensitive questions, faulty memory, the order in which questions appear, a leading tone, and the appearance of the interviewer.

Checkpoint 4.6.5. Identify precautions.

Identify some precautions taken by these students to avoid nonsampling errors.
Solution.
The surveys asked about various injuries, not just concussions, and the responses were anonymous.

Checkpoint 4.6.6. Examine sample data.

What proportion of students in the sample reported a concussion (rounded to 3 decimal places)?
Is this sample proportion larger than 0.15?
Hint.
Divide the number who reported a concussion by the total number of respondents. Then compare to 0.15.
Solution.
\(12/33 \approx 0.364\)
Yes, 0.364 is clearly larger than 0.15.

Checkpoint 4.6.7. Assess representativeness.

Do you think it is reasonable to consider this sample as representative of all varsity team football and soccer players at this school? Explain why or why not.
Solution.
It is unlikely to be representative of all football and soccer players at this university. The responses were voluntary and therefore this was not a random sample, and the response rate was low (0.3).
We can still ask: "If the population proportion was 0.15, would it be unlikely for a random sample from this population to produce such a large sample proportion?" This question should sound familiar, but now we need to consider that the sample did not come from a large population and this can impact our estimate of the amount of random sampling variation.

Checkpoint 4.6.8. Calculate probability for one athlete.

Suppose 15.5% of all 110 male football/soccer players at this school suffered a concussion. If we randomly select one athlete, what is the probability he had a concussion?
Hint.
How many athletes out of 110 would have had a concussion?
Solution.
Probability he had a concussion \(\approx 17/110 = 0.155\)

Checkpoint 4.6.9. Calculate conditional probabilities.

Now consider randomly selecting a second athlete from this population:
If the first athlete had a concussion, what is the probability the second athlete did as well?
If the first athlete did not have a concussion, what is the probability the second athlete did?
Hint.
If the first athlete had a concussion, how many athletes with concussions remain? How many total athletes remain?
Solution.
P(2nd has concussion | 1st has concussion) = \(16/109 \approx 0.147\)
P(2nd has concussion | 1st did not have concussion) = \(17/109 \approx 0.156\)
You should find your answers in CheckpointΒ 4.6.9 are not quite the same. This illustrates a violation of the independence assumption of our binomial model. However, if the population is large compared to the size of the sample, then the conditional probabilities are almost equal, and we can approximate with a binomial distribution. But when the population is not large, we need to take that lack of independence into account.

Key Result: Sampling Without Replacement.

When sampling without replacement from a finite population that is not more than 20 times the size of the sample, the "trials" are no longer independent.
The consequence is that we should not use the binomial distribution to calculate \(P(X \geq 12)\text{.}\) Instead of a calculation like \(\binom{33}{12}(0.155)^{12}(1-0.155)^{21}\text{,}\) we would need a calculation like
\begin{equation*} (17/110)(16/109) + \ldots + (17/110)(17/109) \ldots \end{equation*}
for all the possible sequences of outcomes. Turns out there is a different probability distribution we can use called the hypergeometric distribution.
To calculate hypergeometric probabilities, we need to consider another probability rule.
When outcomes are equally likely, then the probability of an event equals the number of ways for the event to happen divided by the total number of possible outcomes.
For example, there are 268 words in the Gettysburg Address, 125 contain the letter e (about 46.6%) and 143 do not. Using counting rules, there are \(\binom{268}{5} = 11,096,761,368\) different samples of 5 words that we could select from the Gettysburg Address. But if we want to select one e-word and four non-e-words, there are \(\binom{125}{1} \times \binom{143}{4} = 2,087,710,625\) such samples. Thus, if \(X\) = number of e-words in a sample of 5 words, we find \(P(\text{one e-word}) = P(X = 1) = 2,087,710,625 / 11,096,761,368 \approx 0.188\text{.}\)

Probability Detour β€” Hypergeometric Random Variables.

To be a Hypergeometric random variable, a random process must have the following properties:
  • \(n\) observations are drawn without replacement from a population of \(N\) objects.
  • There are two distinct types of objects in the population, \(M\) successes and \(N - M\) failures.
The main distinction between a hypergeometric random variable and a binomial random variable is the trials are no longer independent. The probability of drawing a success for the first object is \(M/N\text{.}\) But if we do draw a success, then the probability of success for the second object is \((M - 1)/(N - 1)\text{.}\) If we replaced the item, then we would be back to a binomial process.
In general, the probability of obtaining \(k\) successes from a population with \(M\) successes and \(N - M\) failures is
where \(\binom{a}{b} = \frac{a!}{b!(a-b)!}\)
The expected value of the hypergeometric random variable is \(E(X) = (M/N) \times n\) and the standard deviation is
(Compare these to a binomial with \(\pi = M/N\text{.}\))

Technology Detour β€” Calculating Hypergeometric Probabilities.

Continuing to suppose 17 of the 110 athletes had a concussion, use technology to find the probability that a random sample of 33 would have 12 or more successes.

Checkpoint 4.6.10. Calculating Hypergeometric Probabilities in R.

In R: The iscamhyperprob function takes the following inputs:
  • k, the observed value of interest (or the difference in conditional proportions, assumed if value is less than one, including negative)
  • total, the total number of observations in the population
  • succ, the overall number of successes in the population
  • n, the sample size
  • lower.tail, a Boolean which is TRUE or FALSE
For example: iscamhyperprob(k=12, total=110, succ=17, n=33, lower.tail=FALSE)
Solution.

Checkpoint 4.6.11. Calculating Hypergeometric Probabilities in JMP.

Checkpoint 4.6.12. Report p-value.

Report the p-value (4 decimal places).
Hint.
Use the hypergeometric distribution with \(N = 110\text{,}\) \(M = 17\text{,}\) \(n = 33\text{.}\) Find \(P(X \geq 12)\text{.}\)
Solution.
For \(X\) hypergeometric with \(N = 110\text{,}\) \(M = 17\text{,}\) \(n = 33\text{,}\) \(P(X \geq 12) = 0.0002\)

Checkpoint 4.6.13. Interpret the probability.

Write a one-sentence interpretation of the probability you calculated.
Solution.
If 15.5% of all football and soccer players at this university have had a concussion during the past two years, the chance we would find at least 12 of a random sample of 33 of these athletes reporting a concussion is 0.0002.

Checkpoint 4.6.14. Generalization to larger population.

What does the p-value in CheckpointΒ 4.6.12 tell you about the proportion of all U.S. collegiate football and soccer players with a concussion?
Solution.
This sample was not randomly selected from the population of all U.S. collegiate football and soccer players, so this sample really tells us nothing about the likelihood of a concussion among the larger population.

Discussion.

The hypergeometric distribution is able to tell us that if we listed all possible samples of 33 players from this population, how many would be at least as extreme as the observed sample. In many sampling situations, our population size is large enough that the binomial distribution gives a very reasonable approximation to the exact p-value. For this reason, the binomial p-value is much more commonly used than the hypergeometric p-value.

Probability Detour β€” Binomial Approximation to Hypergeometric Distribution.

If we instead used the binomial distribution to approximate probability of selecting an e-word from the Gettysburg Address, we would find
\(P(X = 1) = \binom{5}{1}(125/268)^1(143/268)^4 \approx 0.1890\)
Comparison of binomial and hypergeometric distributions
These probability calculations work out to be very similar! This happens when the finite population we are sampling from is large compared to our sample size (e.g., \(N > 20 \times n\)). This means we can continue to use the binomial distribution to determine such probabilities and the expected number of successes \(5(0.466) \approx 2.33\) with \(SD(X) \approx 1.115\) successes. And when the sample size is large relative to the probability of success, we can approximate the binomial with the normal distribution.

Study Conclusions.

Only about 0.02% of random samples of 33 players from a population of 110 players with 17 successes would have 12 or more successes in the sample by random chance (random sampling) alone. This survey would give us very strong evidence that the proportion of players with at least one concussion in this population was larger than 0.155, but there are many problems with this conclusion. This was not a random sample from the population of all athletes at this school (voluntary response, no follow-up) and is likely susceptible to sampling bias. Perhaps athletes who recently had a concussion would be more motivated to voluntarily complete the survey. Because of these issues with the sampling method, the p-value that we calculated is really not all that appropriate or meaningful. We should also consider that the 15% value cited by the YRBS came from one year of data vs. during the previous two years. Nevertheless, getting accurate reports on concussion rates is a challenging but critical task in helping to prevent possibly debilitating injuries, especially in high contact high school and college sports.

Subsection 4.6.1 Practice Problems 1.15

Checkpoint 4.6.15. First-year student voting preferences.

A 2004 student project asked 30 students from the 705 first-year students at their school whether they planned to vote for Bush or Kerry in the upcoming election. For half of the surveys, Kerry was listed first and for half of the surveys Bush was listed first. (Why?) They found 22 planned to vote for Kerry. Does this convince you that more than 2/3 of first-year students at this school planned to vote for Kerry?

Checkpoint 4.6.16. GSS non-sampling errors.

Recall the GSS survey in CheckpointΒ 4.4.10. Identify potential non-sampling errors in such a study. What are some steps the GSS has taken/could take to avoid non-sampling errors?
You have attempted of activities on this page.