Reading

Consider the dataset shown below (Fig. 1.1) adapted from Rauscher, Shaw and Ky’s (1993) “Mozart effect” study. In this study, students’ spatial reasoning IQ was measured after listening to ten minutes of Mozart, a relaxation tape, or silence. It looks as though spatial reasoning was highest after listening to Mozart, but was this effect significant? How would you test the significance of the differences among these conditions? You could to a t-test, but with that you could compare only two of your three means, Mozart and relaxation for example. You could use another t-test to compare the Mozart and silence conditions, but let’s take a moment to consider the possibility of Type I error. Remember that you will never know whether or not you have made a Type I error – you can only estimate the probability that you will make one when the null hypothesis is true. The problem is that you will never know whether the null is true or not! Nevertheless, alpha is our estimate of the probability of making a Type I error. So when you use a t-test to compare the Mozart and relaxation means, and you set your alpha at a reasonable level (e.g. .05), you have a .05 or 5% chance of making a Type I error that time. But when you use a second test to compare the Mozart and silence means you have ANOTHER 5% chance of making that Type I error in the event the null hypothesis is true, and that’s 10% total. How can we solve this problem? One way is to simply use a lower value for alpha, and there’s nothing wrong with that strategy, but there is another way.

Fig 1.1 ANOVA data

ANOVA stands for Analysis Of VAriance, variance in the statistical sense of the term. The word analysis literally means to cut up into smaller parts, and that’s exactly what happens in ANOVA.  Let’s look at some scores that might work for the means listed in Fig. 1.1.

ANOVA data table

This is just one way the data could have worked out. You can see the means in bold and the standard deviations below each mean. Five scores are shown for each treatment group. These aren’t the actual scores from the study, and the means don’t match the original study, but the small number of scores will make it easier to make a few crucial points.

First, look at the means themselves – they aren’t all the same. Why not? Well, obviously it looks as though the listening condition may contribute to this variability. That is, one of the reasons the numbers aren’t the same is because the people in each condition were listening to something different. In fact, that’s exactly what we’re trying to determine. Because the means aren’t the same, they have some non-zero variability that we can measure with any number of statistics: range, deviation scores (actually the sum of the squared deviation scores, SS), variance, standard deviation, etc. Remember, this isn’t the variability in each group; it’s the variability among the group means.

Of course it might not be the listening condition producing this variability (between-treatments variance). There are lots of reasons why the group means might be different. If this is a between-subjects design, there may be differences among individuals in each group producing the variability. Some people in the treatment group may have performed differently after listening to Mozart for reasons that have nothing to do with the music. It may be the case that people with higher spatial IQ were randomly assigned to this condition and not one of the others. It may be that one or more people were inspired to perform better because they watched Jeopardy the night before the experiment. So there are really three potential sources of variability among these three means: a treatment effect, individual differences, and other effects of chance or random variability.

Let’s take a look at a different kind of variability. Consider the scores in the silence condition. The average is 6, lower then that of the other two groups. But the scores that contribute to this average are not the same. There is variability within this group as well as among the different group means (Mozart, relaxation, and silence). The variability within a group is called within-treatments variance. Why are the scores in this group different from each other? Why didn’t each person in the group score the same? It seems like a silly question when talking about an IQ test; people are different, so we shouldn’t expect them to have the same IQ score. So variability within a group can be attributed to individual differences just as can variability among the groups. Random chance might also play a role; one of the participants in the silence group might have skipped breakfast and thus performed worse than average within that group. Another might have been inspired by a newspaper story the morning of the experiment and performed better than he or she otherwise would have. Variability within an individual group thus has only TWO potential sources: individual differences and otherwise random chance.

Let’s review! The means for each group are different, i.e. there is between-treatments variance.  Why? Three reasons: there are different people in each group, otherwise random chance, and perhaps a treatment effect. The scores within each group are different, i.e. there is within-treatments variance. Why? Two reasons: there are different people within the group, and otherwise random chance.

All the analysis of variance really does is compare the variability between different means (three means in this example), with the average variability within each group. The same 4 steps of hypothesis testing apply.

We use the same steps of null hypothesis testing as for the t-test:

      1. State the hypotheses
      2. Determine the critical region
      3. Calculate the test statistic
      4. Make your decision

STEP ONE: State the hypotheses

We use ANOVA to test a null hypothesis. Like the null hypothesis for the t-test, the null for ANOVA is that there is a zero difference between the means of the populations from which our samples were selected:

H0: µMozart = µrelaxation = µsilence

We calculate the analysis of variance under the assumption of the null hypothesis, then make a decision as to whether we reject this original assumption or not. The rationale for doing this is as follows:

If the treatment actually does something, then we should expect the variability among the three means to be pretty big, relative to the average variability within the groups. We define variability among the group means by how different the means are from each other. If they are much more different from each other than the scores within each group are from each other, then we’ll take that as evidence that the treatment actually did something. We do this all in the context of a hypothesis test, and use the rules of probability to help us make our decision. Again, it’s all about comparing how different the means are from each other to how different the scores within each group are from each other.

In addition to the null hypothesis, we need to set up an alternative hypothesis. Instead of a hypothesis using statistical symbols, you may set it up like this:

H1:There is at least one group difference

We set it up this way because there are MANY different possible combinations for alternative hypotheses that a researcher could come up with. For instance, you might think that all of the group means are different, which would look like this:

H1: µMozart ≠ µrelaxation ≠ µsilence

Or, you might think only one group mean is different, which might look like this:

H1: µMozart = µrelaxation ≠ µsilence

It is best to go in with a specific hypothesis, but that is not always possible. You should also note that an ANOVA will only be able to tell you whether or not group differences exist. It will NOT be able to tell you which groups are different. We use Post Hoc tests for that or planned comparisons.

STEP TWO: Determine the critical region

To find Fcrit, you will need dfbetween treatments and dfwithin treatments.

dfbetween treatments = k – 1, where k is the number of treatments or conditions. For this dataset, this is equal to 2

dfwithin treatments = N-k, where N is the total number of scores. For this dataset N=15, so dfwithin treatments = 15-3 = 12

Sometimes these are referred to as upper and lower respectively to correspond with which part of the F-ratio (explain in the next section) they go with. Upper or between treatments aligns with the numerator of the ratio. Lower or within treatments aligns with the denominator. You will also need an alpha value. You will only have one value for the F-ratio. It won’t be a one-tailed or two-tailed value as you have used with other statistics. Here is a table: F-TableThe numbers within the table that are in bold represent an alpha value of 0.01. The other numbers (i.e. not in bold) represent an alpha value of 0.05. For an alpha value of 0.05, the critical value for this dataset would be 3.68. An F-table is also available on page 541 in the Morling book.

STEP THREE: CALCULATE YOUR STATISTIC

To understand how to find the critical values of F, it’s better to understand how the statistic itself is calculated. Earlier, we stated that ANOVA compares how different the treatment group means are from each other, to how different the scores within each group are to each other. It does this in a ratio, by dividing variance among the means (between groups variance) by variance within the groups (within groups variance).  Remember that the variance of a distribution is the sum of the squared deviations from the mean (sum of squares; SS) divided by the number of scores in that distribution minus one (degrees of freedom). To compute the average variance within the groups, we’ll first calculate the SS for each group separately, then add them together to get the sum of the squares within treatment groups (SSwithin treatments). We’ll use the computational formula for SS:

SS for mozart group

SS for relaxation group

SS for silence group

When you add all these up, you get SSwithin treatments = 56. But ANOVA stands for analysis of variance. So how do we compute the variance?

We compute sample variance by dividing sum of squares by degrees of freedom and we’ll do the same thing here. But what are the degrees of freedom for these three samples? The answer is simple – it’s the sum of degrees of freedom for each group: dfwithin treatments = (nmozart – 1) + (nrelaxation – 1) + (nsilence – 1) = (5-1)+ (5-1)+ (5-1)=12. Note that we just subtracted the number of groups from the number of total scores. If we call the total number of scores N, then dfwithin treatments = N-k, where k is the number of treatments. So the variance would be:

var for anova

but we’re not going to call it the variance, we’re going to call it the mean squared deviation from the mean, and that’s not a bad name for it – in fact, it’s adjusted average sum of squares across all the treatment groups. Thus:

MS within Anova

This score is often called the error term because it represents the amount of variance that may be due to random error, or the variability you should naturally find in the population within the groups. So what about measuring the variability among the three different means? Again, our first step is to compute the sum of the squared deviations between the treatment groups, SSbetween treatments. We could do this by subtracting the mean of the three means from each group mean, squaring those differences, then multiplying by the number of scores in each group. We would add these for each treatment group, but it’s much easier to use the following computational formula: 

SS btwn treatments in anova

where T is the sum of the scores within a treatment group, n is the number of scores in the treatment group, G is the total of all scores, and N is the number of all scores. These variables may look new to you, but consider the equation for the regular sum of squares:

SS formula in anova

The last part, the square of the sum of the scores divided by the number of scores, is the same as g2 over n for anova.  The first part, t2 over n for anova, is sort of like computing the average squared score, but within each group.

SS btwn treatments in anovass btwn treatment calc in anova

One thing to keep in mind as you are making these calculations is that there is no such thing as negative variability. You should never have a negative number for SS, SSbetween_treatments or SSwithin_treatments.

Now that you have the sum of squares between treatments, it’s easy to calculate the variance, or mean squared deviation from the mean. Simply divide  by the degrees of freedom between treatments, k-1. Remember that we are interested in the variability of the three group means in this example, and only two are free to vary: dfbetween treatments = k – 1 = 3 – 1=2. Thus:

ms btwn for anova

Now that we have our variances, we can compare them according to the following ratio:

f for anova

One way that researchers common organize the numbers from an ANOVA is in a table, more specifically it is called a source table. For our dataset, it would look like this:

source table for anova

A source table is often the output given by common statistical software programs, such as the Statistical Package for Social Sciences (SPSS). Now back to our hypothesis test.

Remember our null hypothesis? It was that the means came from virtually the same population. If that’s true, and the treatment really had no effect, then the variability among the means should usually be about the same as the variability among the scores within each group, on average, and F should usually be around 1. In fact, when the null really is true, the average value of F IS 1, when scores are randomly sampled from a single population and randomly assigned to fake “groups.” Of course, it’s possible to get values much greater than 1 when the null is true – imagine randomly placing some extremely high values in one group just by chance. That would make one mean much higher than the others, and increase the variability between the groups, without affecting the variability within the groups, and F would become very large. But that’s pretty unlikely. This is where our critical value comes in and it is time to make a decision.

STEP FOUR: MAKE A DECISION

Now it is time to compare our critical value to our calculated F. If our calculated value is higher, we reject the null hypothesis and say that there is at least one difference in the group means. If our calculated F is lower, we fail to reject the null hypothesis and conclude that the group means are not different. In this case our calculated F is 4.64 which is larger than our critical value for F of 3.68; thus, we reject the null and say there is a difference in the groups.

We need to include an interpretation in APA style to state our decision with the variables, a description of what happened, and the statistical copy. For this study, one example of this would be:

The researchers examined whether or not 10 minutes of auditory exposure to Mozart music, a relaxation tape or silence impacted performance on a spatial IQ test. There was a difference between the groups, F(2, 12) = 4.64, p< 0.05.

From here, a researcher would want to investigate this difference further using planned comparisons or post hoc tests. 

POST HOC TESTS

ANOVA simply tells you whether or not a difference exists in the groups. It does not tell you which group means are different. You have to use these other techniques to determine that. Planned comparisons are analyses of differences that you expected to find PRIOR to conducting your hypothesis test. Thus, you can plan them ahead of time. You will only test the group means that you expect to be different, not all the possible comparisons. Just like ANOVA reduces your chance of making a Type I error compared to multiple t-tests, planned comparisons also do this. Planned comparisons can be conducted with many different types of statistical tests. This chapter is going to focus on Post Hoc tests.

Post Hoc tests are conducted when you aren’t sure which groups may be different and your ANOVA has revealed statistically significant differences. Now you will go on to test all the possible comparisons of the groups to determine exactly which groups are different. Two Post Hoc tests that can be used are Tukey’s Honestly Significant Difference (HSD) and the Scheffé Test.

TUKEY’S HONESTLY SIGNIFICANT DIFFERENCE

For Tukey’s HSD test, mean differences between conditions are analysed. If the difference is larger than the critical value that you calculate (HSD), then you say the conditions are significantly different. One criteria for Tukey’s HSD test is that all of the conditions have the same number of participants or n.

The following formula is used to compute the critical value:

hsd formula

MSwithin_treatments comes from your prior calculations for the ANOVA. The value for q comes from a table, such as this one. To get q, you need k (i.e. the number of treatments) and dfwithin_treatments. For our dataset from the prior section, k = 3 and dfwithin_treatments = 12. That would give us a value of q = 3.77 for an alpha of 0.05.

Next you are going to compute a mean difference for each comparison you make and compare that to the Honestly Significant Difference you compute using the formula.

Here is a reminder of the data.

ANOVA data table

If we compare Mozart and Relaxation, the mean difference is 3. This is smaller than the HSD and indicates they are not significantly different. If we compare Mozart and Silence, the mean difference is 4. This is larger than the HSD and indicates they are significantly different. If we compare Relaxation and Silence, the mean difference is 1. This is smaller than the HSD and indicates they are not significantly different.

The information for these post hoc tests would be included in a results section of a research paper. An example statement might include: 

“Tukey’s HSD post hoc test revealed a significant difference between the Mozart condition and the silence condition; no other differences were found.”

SCHEFFE TEST 

The Scheffé Test requires a more complex calculation but can be used when your groups have a different number of participants. To complete a Scheffé Test, you basically do an ANOVA calculation for each group comparison. You calculate an F-ratio for each comparison, with a different numerator but the SAME denominator you used for the original ANOVA (in this case 4.67). You compare the calculated F to the same critical value you used for the original ANOVA (in this case 3.68).

If we were to compare the Mozart and Relaxation conditions, ww would first calculate a new SSbetween treatments, using only the data from the two conditions.

SS btwn treatments in anovass for post hoc for anova

You will also need a and a new MSbetween_treatments using the original k value.

ms btwn for post hoc

Next, for this comparison you will calculate an F-ratio

f calc for post hoc

This F value is not larger than the original critical value so this indicates these conditions are not significantly different.

We repeat this process for all other comparisons.

Here is Mozart and Silence:

SS btwn treatments in anova

ss btwn for post hoc

ms btwn for post hoc

f calc for post hoc 2

This F value is higher than the original critical value, indicating a significant difference between these two groups.

The final comparison is between the relaxation condition and the silence condition.

SS btwn treatments in anovass calc for post hoc 3

ms calc for post hoc 3

f calc for post hoc 3

Similar to the first comparison, the calculated F value is smaller than the original critical value, indicating there is not a significant difference between these conditions.

SUMMARY

ANOVA only tells you whether or not a significant difference exists between groups. Post Hoc tests and planned comparisons can tell you which groups are significantly different. Tukey’s HSD and the Scheffé Test are two potential post hoc tests that can be used. They will give you similar results but Tukey’s HSD test can only be used when all of the groups have the same number of participants. The Scheffé Test does not have that restriction but is a more complicated calculation.

 

Leave a Reply

Your email address will not be published. Required fields are marked *