In the process of re-organizing the website. Apologies for any broken links!
Understanding scientific phenomena means analyzing data, and statistics can help us decide whether what we observe is meaningful or if the results happened by chance.
Descriptive statistics are the foundation of data analysis - to the point that most of them are things you've probably seen since elementary school. They help you summarize patterns and identify variability in your data. The central tendency (often called an average) can be determined in a few different ways.
Mean: the most commonly used "average" and what most people mean when they say average. To determine the mean, add up all the values, and divide by the number of values you added. This is most useful when there are no extreme outliers.
Median: this is the middle number. It's determined by arranging the values in order and finding the middle number (if there are two - it's the mean between them). This is best used when there are extreme outliers or when the data is skewed.
Mode: the "most" common value. This is whichever result occurs the most frequently. This central tendency value is the only that that works for categorical data, rather than numerical.
While mean, median, and mode are useful tools for looking at the center of a dataset, other descriptive statistics are useful to understand how spread out (or variable) the data is.
Range: the simpler way to look at spread. The range is simply the maximum value minus the minimum value. This is very sensitive to outliers, but gives an easily calculated quick look at how much variability exists.
Standard Deviation: this measures the average distance between data points and the mean. It is a much more accurate look at the variation within a dataset. A low SD means there is little variation and the data is clustered around the mean, while a high SD means there is widely spread data. It can be calculated using the equation below.
As a note, standard deviation assumes a normal distribution and cannot tell you information about the asymmetry of data.
If you're unfamiliar with the notations here:
n refers to the number of data points
x is each individual data point
x̅ is the mean
When you've calculated your "sample mean" - how confident in it can you be? How much does it vary from the true population mean? The standard error of the mean (SEM) measures the uncertainty in the mean.
These are often used when graphing.
Chi-Square tests can be used to determine whether observed results differ significantly from expected results. There are two major types, both of which use the above formula (where O = observed number and E = expected number).
These tests work through the use of hypotheses. A null hypothesis is the "default assumption" that there is no effect/difference/relationship between your variables. An alternative hypothesis is what you think might be true instead - that there is some sort of effect/difference/relationship.
The chi-square is not a test to prove your alternative hypothesis, it is a test to see if you have enough evidence to "reject" the null hypothesis.
This test is used in order to see if a population is the same as another known/predicted population. It can commonly be used in genetics to see something like if seed color follow's Mendel's predicted 3:1 ratio.
HO: There is no significant difference between observed or expected populations.
Ha: There is a significant difference between observed or expected populations.
This test is used to determine whether two categorical variables are unrelated (independent) or related (dependent). This often uses a contingency table.
HO: The two variables are independent.
Ha: The two variables are dependent.
If you have the expected ratio, you can use that to determine the expected numbers. Using the above example, if you're expecting a 3:1 ratio of green vs yellow seeds, you would expect 3/4 of the seeds to be green, and 1/4 of them to be yellow.
Take the total amount of seeds that you observed and multiply by those fractions to determine your expected.
You'll want to make a contingency table, as can be seen above. The orange cells are column totals, the green cells are row totals, and the purple is a grand total. Use these values to determine the expected.
Expected = (row total * column total)/grand total
So, expected for male + dog would be: (164 * 197)/315 = 102.6
Once you have both your observed and expected values, you'll need to use the equation at the top of this section. You will do the (O-E)^2/E part for each category, and then take the sum of the results. This is your calculated chi-square value.
Next, you will need to determine the degrees of freedom, which is the number of options/categories - 1. So if you have 4 potential phenotypes, you have 3 degrees of freedom.
Finally, use the below table to determine if your data is statistically significant. The top row of numbers are the p-values which determine the chance of the data being due to random chance. We typically take a p-value of .05 to signify statistical significance, which is why the values under p = .05 and p = .01 are red. The value associated with your calculated degrees of freedom and p = .05 is the critical value. If your chi-square value is greater than or equal to the critical value, you can reject the null hypothesis, meaning that your data was statistically significant. If that is not the case, then you fail to reject the null. Be careful with your wording here - make sure you don't say you "accept" it.