Lesson 3: Experiments with a Single Factor - the Oneway ANOVA - in the Completely Randomized Design (CRD)
By the end of this chapter, we will understand how to proceed when the ANOVA tells us that the mean responses differ, (i.e., the levels are significantly different), among our treatment levels. We will also briefly discuss the situation that the levels are a random sample from a larger set of possible levels, such as a sample of brands for a product. We will briefly discuss multiple comparison procedures for qualitative factors, and regression approaches for quantitative factors. These are covered in more detail in the STAT 502 course and discussed only briefly here.
Focus more on the design and planning aspects of these situations:
- to achieve a desired precision when the goal is estimating a parameter, and
- to achieve a desired level of power when hypothesis testing.
- Understanding which multiple comparison procedure is appropriate for your situation
- Be able to allocate our observations among the k treatment groups.
- Understanding that the Dunnett Test situation has a different optimum allocation
- Describes the F -test as an example of the General Linear Test.
3.1 - Experiments with One Factor and Multiple Levels
Lesson 3 is the beginning of the one-way analysis of variance part of the course, which extends the two sample situation to k samples..
Text Reading : In addition to these notes, read Chapter 3 of the text and the online supplement. (If you have the 7th edition, also read 13.1.)
We review the issues related to a single factor experiment, which we see in the context of a Completely Randomized Design (CRD). In a single factor experiment with a CRD, the levels of the factor are randomly assigned to the experimental units. Alternatively, we can think of randomly assigning the experimental units to the treatments or in some cases, randomly selecting experimental units from each level of the factor.
Example 3-1: Cotton Tensile Strength
This is an investigation into the formulation of synthetic fibers that are used to make cloth. The response is tensile strength, the strength of the fiber. The experimenter wants to determine the best level of the cotton in terms of percent, to achieve the highest tensile strength of the fiber. Therefore, we have a single quantitative factor, the percent of cotton combined with synthetic fabric fibers.
The five treatment levels of percent cotton are evenly spaced from 15% to 35%. We have five replicates, five runs on each of the five cotton weight percentages.
The box plot of the results shows an indication that there is an increase in strength as you increase the cotton and then it seems to drop off rather dramatically after 30%.
Makes you wonder about all of those 50% cotton shirts that you buy?!
The null hypothesis asks: does the cotton percent make a difference? Now, it seems that it doesn't take statistics to answer this question. All we have to do is look at the side by side box plots of the data and there appears to be a difference – however, this difference is not so obvious by looking at the table of raw data. A second question, frequently asked when the factor is quantitative: what is the optimal level of cotton if you only want to consider strength?
There is a point that I probably should emphasize now and repeatedly throughout this course. There is often more than one response measurement that is of interest. You need to think about multiple responses in any given experiment. In this experiment, for some reason, we are interested in only one response, tensile strength, whereas in practice the manufacturer would also consider comfort, ductility, cost, etc.
This single factor experiment can be described as a completely randomized design (CRD). The completely randomized design means there is no structure among the experimental units. There are 25 runs which differ only in the percent cotton, and these will be done in random order. If there were different machines or operators, or other factors such as the order or batches of material, this would need to be taken into account. We will talk about these kinds of designs later. This is an example of a completely randomized design where there are no other factors that we are interested in other than the treatment factor percentage of cotton.
Reference: Problem 3.10 of Montgomery (3.8 in the \(7^{th}\) edition)
Analysis of Variance
The Analysis of Variance (ANOVA) is a somewhat misleading name for this procedure. But we call it the analysis of variance because we are partitioning the total variation in the response measurements.
The Model Statement
Each measured response can be written as the overall mean plus the treatment effect plus a random error.
\(Y_{ij} = \mu + \tau_i +\epsilon_{ij}\)
\(i = 1, ... , a,\) and \( j = 1, ... n_i\)
Generally, we will define our treatment effects so that they sum to 0, a constraint on our definition of our parameters,\(\sum \tau_{i}=0\). This is not the only constraint we could choose, one treatment level could be a reference such as the zero level for cotton and then everything else would be a deviation from that. However, generally, we will let the effects sum to 0. The experimental error terms are assumed to be normally distributed, with zero mean and if the experiment has constant variance then there is a single variance parameter \(\sigma^2\). All of these assumptions need to be checked. This is called the effects model.
An alternative way to write the model, besides the effects model, where the expected value of our observation, \(E\left(Y_{i j}\right)=\mu+\tau_{i}\) or an overall mean plus the treatment effect. This is called the means model and is written as:
\(Y_{ij} = \mu +\epsilon_{ij}\)
In looking ahead there is also the regression model. Regression models can also be employed but for now, we consider the traditional analysis of variance model and focus on the effects of the treatment.
Analysis of variance formulas that you should be familiar with by now are provided in the textbook.
The total variation is the sum of the observations minus the overall mean squared, summed over all a × n observations.
The analysis of variance simply takes this total variation and partitions it into the treatment component and the error component. The treatment component is the difference between the treatment mean and the overall mean. The error component is the difference between the observations and the treatment mean, i.e. the variation not explained by the treatments.
Notice when you square the deviations there are also cross-product terms, (see equation 3-5), but these sum to zero when you sum over the set of observations. The analysis of variance is the partition of the total variation into treatment and error components. We want to test the hypothesis that the means are equal versus at least one is different, i.e.
\(H_0 \colon \mu_{1}=\ldots=\mu_{a}\) versus \(H_a \colon \mu_{i} \neq \mu_{i'}\)
Corresponding to the sum of squares (SS) are the degrees of freedom associated with the treatments, \(a - 1\), and the degrees of freedom associated with the error, \(a × (n - 1)\), and finally one degree of freedom is due to the overall mean parameter. These add up to the total \(N = a × n\), when the \(n_i\) are all equal to \(n\), or \( N=\sum n_{i}\) otherwise.
The mean square treatment (MST) is the sum of squares due to treatment divided by its degrees of freedom.
The mean square error (MSE) is the sum of squares due to error divided by its degrees of freedom.
If the true treatment means are equal to each other, i.e. the \(\mu_i\) are all equal, then these two quantities should have the same expectation. If they are different then the treatment component, MST will be larger. This is the basis for the F -test.
The basic test statistic for testing the hypothesis that the means are all equal is the F ratio, MST/MSE, with degrees of freedom, a -1, and a ×( n -1) or a-1 and N-a.
We reject \(H_0\) if this quantity is greater than \(1-α\) percentile of the F distribution.
Example 3-1: Continued - Cotton Weight Percent
Here is the Analysis of Variance table from the Minitab output:
One-way ANOVA: Observations versus Cotton Weight %
Individual 95% CIs for Mean based on Pooled StDev
Note a very large F statistic that is, 14.76. The p -value for this F -statistic is < .0005 which is taken from an F distribution pictured below with 4 and 20 degrees of freedom.
We can see that most of the distribution lies between zero and about four. Our statistic, 14.76, is far out in the tail, obvious confirmation about what the data show, that indeed the means are not the same. Hence, we reject the null hypothesis.
Model Assumption Checking
We should check if the data are normal - they should be approximately normal - they should certainly have constant variance among the groups. Independence is harder to check but plotting the residuals in the order in which the operations are done can sometimes detect if there is lack of independence. The question, in general, is how do we fit the right model to represent the data observed. In this case, there's not too much that can go wrong since we only have one factor and it is a completely randomized design. It is hard to argue with this model.
Let's examine the residuals, which are just the observations minus the predicted values, in this case, treatment means. Hence, \(e_{ij}=y_{ij}-\bar{y}_{i}\).
These plots don't look exactly normal but at least they don't seem to have any wild outliers. The normal scores plot looks reasonable. The residuals versus the order of the data plot are a plot of the error residuals data in the order in which the observations were taken. This looks a little suspect in that the first six data points all have small negative residuals which are not reflected in the following data points. Does this look like it might be a startup problem? These are the kinds of clues that you look for... if you are conducting this experiment you would certainly want to find out what was happening in the beginning.
Post-ANOVA Comparison of Means
So, we found the means are significantly different. Now what? In general, if we had a qualitative factor rather than a quantitative factor we would want to know which means differ from which other ones. We would probably want to do t -tests or Tukey maximum range comparisons, or some set of contrasts to examine the differences in means. There are many multiple comparison procedures.
Two methods, in particular, are Fisher's Least Significant Difference (LSD), and the Bonferroni Method. Both of these are based on the t -test. Fisher's LSD says do an F -test first and if you reject the null hypothesis, then just do ordinary t -tests between all pairs of means. The Bonferroni method is similar but only requires that you decide in advance how many pairs of means you wish to compare, say g, and then perform the g t -tests with a type I level of \(\alpha / g\). This provides protection for the entire family of g tests that the type I error is no more than \(\alpha \). For this setting, with a treatments, g = a ( a -1)/2 when comparing all pairs of treatments.
All of these multiple comparison procedures are simply aimed at interpreting or understanding the overall F -test --which means are different? They apply to many situations especially when the factor is qualitative. However, in this case, since cotton percent is a quantitative factor, doing a test between two arbitrary levels e.g. 15% and 20% level, isn't really what you want to know. What you should focus on is the whole response function as you increase the level of the quantitative factor, cotton percent.
Whenever you have a quantitative factor you should be thinking about modeling that relationship with a regression function.
Review the video that demonstrates the use of polynomial regression to help explain what is going on.
Here is the Minitab output where regression was applied:
Polynomial Regression Analysis: Observation versus Cotton Weight %
The regression equations is:
Observations = 62.61 - 9.011 Cotton Weight % + 0.4814 cotton Weight % **2 - 0.007600 Cotton Weight%**3
Sequential Analysis of Variance
Here is a link to the Cotton Weight % dataset ( cotton_weight.mwx | cotton_weight.csv ). Open this in Minitab so that you can try this yourself.
You can see that the linear term in the regression model is not significant but the quadratic is highly significant. Even the cubic term is significant with p -value = 0.015. In Minitab we can plot this relationship in the fitted line plot as seen below:
This shows the actual fitted equation. Why wasn't the linear term significant? If you just fit a straight line to this data it would be almost flat, not quite but almost. As a result, the linear term by itself is not significant. We should still leave it in the polynomial regression model, however, because we like to have a hierarchical model when fitting polynomials. What we can learn from this model is that the tensile strength of cotton is probably best between the 25 and 30 weight.
This is a more focused conclusion than we get from simply comparing the means of the actual levels in the experiment because the polynomial model reflects the quantitative relationship between the treatment and the response.
We should also check whether the observations have constant variance \(\sigma^2\), for all treatments. If they are all equal we can say that they are equal to \(\sigma^2\). This is an assumption of the analysis and we need to confirm this assumption. We can either test it with Bartlett's test, the Levene's test, or simply use the 'eye ball' technique of plotting the residuals versus the fitted values and see if they are roughly equal. The eyeball approach is almost as good as using these tests since by testing we cannot ‘prove’ the null hypothesis.
Bartlett's test is very susceptible to non-normality because it is based on the sample variances, which are not robust to outliers. We must assume that the data are normally distributed and thus not very long-tailed. When one of the residuals is large and you square it, you get a very large value which explains why the sample variance is not very robust. One or two outliers can cause any particular variance to be very large. Thus simply looking at the data in a box plot is as good as these formal tests. If there is an outlier you can see it. If the distribution has a strange shape you can also see this in a histogram or a box plot. The graphical view is very useful in this regard.
Levene's test is preferred to Bartlett’s in my view because it is more robust. To calculate the Levene's test you take the observations and obtain (not the squared deviations from the mean but) the absolute deviations from the median. Then, you simply do the usual one way ANOVA F -test on these absolute deviations from the medians. This is a very clever and simple test that has been around for a long time, created by Levene back in the 1950s. It is much more robust to outliers and non-normality than Bartlett's test.
3.2 - Sample Size Determination
An important aspect of designing an experiment is to know how many observations are needed to make conclusions of sufficient accuracy and with sufficient confidence. We review what we mean by this statement. The sample size needed depends on lots of things; including what type of experiment is being contemplated, how it will be conducted, resources, and desired sensitivity and confidence.
Sensitivity refers to the difference in means that the experimenter wishes to detect, i.e., sensitive enough to detect important differences in the means.
Generally, increasing the number of replications increases the sensitivity and makes it easier to detect small differences in the means. Both power and the margin of error are a function of n and a function of the error variance. Most of this course is about finding techniques to reduce this unexplained residual error variance, and thereby improving the power of hypothesis tests, and reducing the margin of error in estimation.
Hypothesis Testing Approach to Determining Sample Size
Our usual goal is to test the hypothesis that the means are equal, versus the alternative that the means are not equal.
The null hypothesis that the means are all equal implies that the \(\tau_i\)'s are all equal to 0. Under this framework, we want to calculate the power of the F -test in the fixed effects case.
Example 3.2: Blood Pressure
Consider the situation where we have four treatment groups that will be using four different blood pressure drugs, a = 4 . We want to be able to detect differences between the mean blood pressure for the subjects after using these drugs.
One possible scenario is that two of the drugs are effective and two are not. e.g. say two of them result in blood pressure at 110 and two of them at 120. In this case the sum of the \(\tau_{i}^{2}\) for this situation is 100, i.e. \(\tau_i = (-5, -5, 5, 5) \) and thus \(\Sigma \tau_{i}^{2}=100\).
Another scenario is the situation where we have one drug at 110, two of them at 115 and one at 120. In this case the sum of the \(\tau_{i}^{2}\) is 50, i.e. \(\tau_i = (-5, 0, 0, 5) \) and thus \(\Sigma \tau_{i}^{2}=50\).
Considering both of these scenarios, although there is no difference between the minimums and the maximums, the quantities \(\Sigma \mathrm{\tau}_{i}^{2}\) are very different.
Of the two scenarios, the second is the least favorable configuration (LFC). It is the configuration of means for which you get the least power. The first scenario would be much more favorable. But generally, you do not know which situation you are in. The usual approach is to not to try to guess exactly what all the values of the \(\tau_i\) will be but simply to specify \(\delta\), which is the maximum difference between the true means, or \(\delta = \text{max}(\tau_i) - \text{min}(\tau_i)\).
Going back to our LFC scenario we can calculate this again using \(\Sigma \tau_{i}^{2} = \delta^{2} /2\), i.e. the maximum difference squared over 2. This is true for the LFC for any number of treatments since \(\Sigma \tau_i^{2} = (\delta/2)^2 \times 2 = \delta^2 \ 2\) since all but the extreme values of \(\tau_i\) are zero under the LFC.
The Use of Operating Characteristic Curves
The OC curves for the fixed effects model are given in the Appendix V.
The usual way to use these charts is to define the difference in the means, \(\delta = \text{max}(\mu_i) - \text{min}(\mu_i)\), that you want to detect, specify the value of \(\sigma^2\), and then for the LFC use :
\(\Phi^2=\dfrac{n\delta^2}{2a\sigma^2}\)
for various values of n . The Appendix V gives \(\beta\), where \(1 - \beta\) is the power for the test where \(\nu_1 = a - 1\) and \(\nu_2 = a(n - 1)\). Thus after setting n , you must calculate \(\nu_1\) and \(\nu_2\) to use the table.
Example: We consider an \(\alpha = 0.05\) level test for \(a = 4\) using \(\delta = 10\) and \(\sigma^2 = 144\) and we want to find the sample size n to obtain a test with power = 0.9.
Let's guess at what our n is and see how this work. Say we let n be equal to 20, let \(\delta = 10\), and \(\sigma = 12\) then we can calculate the power using Appendix V. Plugging in these values to find \(\Phi\) we get \(\Phi = 1.3\).
Now go to the chart where \(\nu_2\) is 80 - 4 = 76 and \(\Phi = 1.3\). This gives us a Type II error of \(\beta = 0.45\) and \(\text{power} = 1 - \beta = 0.55\).
It seems that we need a larger sample size.
Well, let's use a sample size of 30. In this case we get \(\Phi^2 = 2.604\), so \(\Phi = 1.6\).
Now with \(\nu_2\) a bit more at 116, we have \(\beta = 0.30\) and power = 0.70.
So we need a bit more than n = 30 per group to achieve a test with power = 0.8.
Review the video below for a 'walk-through' this procedure using Appendix V in the back of the text.
3.3 - Multiple Comparisons
Scheffé's method.
Scheffé's method for investigating all possible contrasts of the means corresponds exactly to the F -test in the following sense. If the F -test rejects the null hypothesis at level \(\alpha\), then there exists at least one contrast which would be rejected using the Scheffé procedure at level \(\alpha\). Therefore, Scheffé provides \(\alpha\) level protection against rejecting the null hypothesis when it is true, regardless of how many contrasts of the means are tested.
Fisher's LSD
Fisher’s LSD, which is the F test, followed by ordinary t- tests among all pairs of means, but only if the F -test rejects the null hypothesis. The F -test provides the overall protection against rejecting \(H_0\) when it is true. The t -tests are each performed at \(\alpha\) level and thus likely will reject more than they should, when the F -test rejects. A simple example may explain this statement: assume there are eight treatment groups, and one treatment has a mean higher than the other seven, which all have the same value, and the F -test will reject \(H_0\). However, when following up with the pairwise t -tests, the \(7 \times 6 / 2 = 21\) pairwise t -tests among the seven means which are all equal, will by chance alone reject at least one pairwise hypothesis, \(H_0 \colon \mu_i = \mu_i^{\prime}\) at \(\alpha = 0.05\). Despite this drawback, Fisher's LSD remains a favorite method since it has overall \(\alpha\) level protection, and offers simplicity to understand and interpret.
Bonferroni Method
Bonferroni method for \(g\) comparisons – use \(\alpha / g\)instead of \(\alpha\) for testing each of the \(g\) comparisons.
Comparing the Bonferroni Procedure with the Fishers LSD
Fishers’s LSD method is an alternative to other pairwise comparison methods (for post ANOVA analysis). This method controls the \(\alpha\text{-level}\) error rate for each pairwise comparison so it does not control the family error rate. This procedure uses the t statistic for testing \(H_0 \colon \mu_i = \mu_j\) for all i and j pairs.
\(t=\dfrac{\bar{y}_i-\bar{y}_j}{\sqrt{MSE(\frac{1}{n_i}+\frac{1}{n_j})}}\)
Alternatively, the Bonferroni method does control the family error rate, by performing the pairwise comparison tests using \(_{\alpha/g}\) level of significance, where g is the number of pairwise comparisons. Hence, the Bonferroni confidence intervals for differences of the means are wider than that of Fisher’s LSD. In addition, it can be easily shown that the p -value of each pairwise comparison calculated by Bonferroni method is g times the p -value calculated by Fisher’s LSD method.
Tukey's Studentized Range
Tukey’s Studentized Range considers the differences among all pairs of means divided by the estimated standard deviation of the mean and compares them with the tabled critical values provided in Appendix VII. Why is it called the studentized range? The denominator uses an estimated standard deviation, hence, the statistic is studentized like the student t -test. The Tukey procedure assumes all \(n_i\) are equal say to \(n\).
\(q=\dfrac{\bar{y}_i-\bar{y}_j}{\sqrt{MSE(\frac{1}{n})}}\)
Comparing the Tukey Procedure with the Bonferroni Procedure
The Bonferroni procedure is a good all around tool, but for all pairwise comparisons the Tukey studentized range procedure is slightly better as we show here.
The studentized range is the distribution of the difference between the maximum and a minimum over the standard error of the mean. When we calculate a t -test, or when we're using the Bonferroni adjustment where g is the number of comparisons, we are not comparing apples and oranges. In one case (Tukey) the statistic has a denominator with the standard error of a single mean and in the other case ( t -test) with the standard error of the difference between means as seen in the equation for t and q above.
Example 3.3: Tukey vs. Bonferroni approaches
Here is an example we can work out. Let's say we have 5 means, so a = 5, we will let \(\alpha = 0.05\), and the total number of observations N = 35, so each group has seven observations and df = 30.
If we look at the studentized range distribution for 5, 30 degrees of freedom, we find a critical value of 4.11.
If we took a Bonferroni approach - we would use \(g = 5 × 4 / 2 = 10\) pairwise comparisons since a = 5. Thus, again for an α = 0.05 test all we need to look at is the t -distribution for \(\alpha / 2g = 0.0025\) and N - a =30 df . Looking at the t -table we get the value 3.03. However, to compare with the Tukey Studentized Range statistic, we need to multiply the tabled critical value by \(\sqrt{2} = 1.414\), therefore 3.03 x1.414 = 4.28 , which is slightly larger than the 4.11 obtained for the Tukey table.
The point that we want to make is that the Bonferroni procedure is slightly more conservative than the Tukey result since the Tukey procedure is exact in this situation whereas Bonferroni only approximate.
The Tukey's procedure is exact for equal samples sizes. However, there is an approximate procedure called the Tukey-Kramer test for unequal \(n_i\).
If you are looking at all pairwise comparisons then Tukey's exact procedure is probably the best procedure to use. The Bonferroni, however, is a good general procedure.
Contrasts of Means
A pairwise comparison is just one example of a contrast of the means. A general contrast can be written as a set of coefficients of the means that sum to zero. This will often involve more than just a pair of treatments. In general, we can write a contrast to make any comparison we like. We will also consider sets of orthogonal contrasts.
Example 3.4: Gas Mileage
We want to compare the gas mileage on a set of cars: Ford Escape (hybrid), Toyota Camry, Toyota Prius (hybrid), Honda Accord, and the Honda Civic (hybrid). A consumer testing group wants to test each of these cars for gas mileage under certain conditions. They take n prescribed test runs and record the mileage for each vehicle.
Now they first need to define some contrasts among these means. Contrasts are the coefficients which provide a comparison that is meaningful. Then they can test and estimate these contrasts. For the first contrast, \(C_1\), they could compare the American brand to the foreign brands. We need each contrast to sum to 0, and for convenience only use integers. How about comparing Toyota to Honda (that is \(C_2\)), or hybrid compared to non-hybrid (that is \(C_3\)).
So the first three contrast coefficients would specify the comparisons described, and the \(C_4\) and \(C_5\) are comparisons within the brands with two models.
After we develop a set of contrasts, we can then test these contrasts or we can estimate them. We can also calculate a confidence intervals around the true contrast of the means by using the estimated contrast ± the t -distribution times the estimated standard deviation of the contrast. See equation 3-30 in the text.
Concerning Sets of Multiple Contrasts
Scheffé’s Method provides \(\alpha\text{-level}\) protection for all possible contrasts - especially useful when we don't really know how many contrasts we will have in advance. This test is quite conservative because this test is valid for all possible contrasts of the means. Therefore the Scheffé procedure is equivalent to the F- test, and if the F- test rejects, there will be some contrast that will not contain zero in its confidence interval.
What is an orthogonal contrast?
Two contrasts are orthogonal if the sum of the product of the coefficients of the two contrasts sum to zero. An orthogonal set of contrasts are also orthogonal to the overall mean, since the coefficients sum to zero.
Look at the table above and locate which contrasts are orthogonal.
There always exists a -1 orthogonal contrasts of a means. When the sample sizes are equal, the sum of squares for these contrasts, when added up, total the sum of squares due to treatment. Any set of orthogonal contrasts partition the variation such that the total variation corresponding to those a -1 contrasts equals the total sum of squares among treatments. When the sample sizes are not equal, the definition of orthogonal contrasts involves the sample sizes.
Dunnett's Procedure
Dunnett’s procedure is another multiple comparison procedure specifically designed to compare each treatment to a control. If we have a groups, let the last one be a control group and the first a - 1 be treatment groups. We want to compare each of these treatment groups to this one control. Therefore, we will have a - 1 contrasts or a - 1 pairwise comparisons. To perform multiple comparisons on these a - 1 contrasts we use special tables for finding hypothesis test critical values, derived by Dunnett.
Comparing Dunnett’s procedure to the Bonferroni procedure
We can compare the Bonferroni approach to the Dunnett procedure. The Dunnett procedure calculates the difference of means for the control versus treatment one, control versus treatment two, etc. to a - 1. Which provides a - 1 pairwise comparisons.
So, we now consider an example where we have six groups, a = 6, and t = 5 and n = 6 observations per group. Then, Dunnett's procedure will give the critical point for comparing the difference of means. From the table, we get \(\alpha =0.05\) two-sided comparison d ( a -1, f ) = 2.66, where a - 1 = 5 and f = df = 30.
Using the Bonferroni approach, if we look at the t -distribution for g = 5 comparisons and a two-sided test with 30 degrees of freedom for error we get 2.75.
Comparing the two, we can see that the Bonferroni approach is a bit more conservative. The Dunnett's is an exact procedure for comparing a control to a -1 treatments. Bonferroni is a general tool but not exact. However, there is not much of a difference in this example
Fisher's LSD has the practicality of always using the same measuring stick, the unadjusted t- test. Everyone knows that if you do a lot of these tests, that for every 20 tests you do, that one could be wrong by chance. This is another way to handle this uncertainty. All of these methods are protecting you from making too many Type I errors whether you are either doing hypothesis testing or confidence intervals. In your lifetime how many tests are you going to do?
So in a sense, you have to ask yourself the question of what is the set of tests that I want to protect against making a Type I error. So, in Fisher's LSD procedure each test is standing on its own and is not really a multiple comparisons test. If you are looking for any type of difference and you don't know how many you are going to end up doing, you should probably be using Scheffé to protect you against all of them. But if you know it is all pairwise and that is it, then Tukey's would be best. If you're comparing a bunch of treatments against a control then Dunnett's would be best.
There is a whole family of step-wise procedures which are now available, but we will not consider them here. Each can be shown to be better in certain situations. Another approach to this problem is called False Discovery Rate control. It is used when there are hundreds of hypotheses - a situation that occurs for example in testing gene expression of all genes in an organism, or differences in pixel intensities for pixels in a set of images. The multiple comparisons procedures discussed above all guard against the probability of making one false significant call. But when there are hundreds of tests, we might prefer to make a few false significant calls if it greatly increases our power to detect the true difference. False Discovery Rate methods attempt to control the expected percentage of false significant calls among the tests declared significant.
3.4 - The Optimum Allocation for the Dunnett Test
The Dunnett test for comparing means is a multiple comparison procedure but is precisely designed to test t treatments against a control.
We compared the Dunnett test to the Bonferroni - and there was only a slight difference, reflecting the fact that the Bonferroni procedure is an approximation. This is a situation where we have a = t + 1 groups; a control group and t treatments.
I like to think of an example where we have a standard therapy, (a control group), and we want to test t new treatments to compare them against the existing acceptable therapy. This is a case where we are not so much interested in comparing each of the treatments against each other, but instead, we are interested in finding out whether each of the new treatments is better than the original control treatment.
We have \(Y_ij\) distributed with mean \(\mu_i\), and variance \(\sigma^{2}\), where \(i = 1, \dots , t, \text{ and } j = 1, \dots , n_i\) for the t treatment groups and a control group with mean \(\mu_0\) with variance \(\sigma^2\).
We are assuming equal variance among all treatment groups.
The question that I want to address here is the design question.
The Dunnett procedure is based on t comparisons for testing \(H_0\) that \(\mu_i = \mu_0\), for \(i = 1, \dots , t\). This is really t different tests where t = a - 1.
The \(H_A\) is that the \(\mu_i\) are not equal to \(\mu_0\).
Or viewing this as an estimation problem, we want to estimate the t differences \(\mu_i = \mu_0\).
How Should We Allocate Our Observations?
This is the question we are trying to answer. We have a fixed set of resources and a budget that only allows for only N observations. So, how should we allocate our resources?
Should we assign half to the control group and the rest spread out among the treatments? Or, should we assign an equal number of observations among all treatments and the control? Or what?
We want to answer this question by seeing how we can maximize the power of these tests with the N observations that we have available. We approach this using an estimation approach where we want to estimate the t differences \(\mu_i - \mu_0\). Let's estimate the variance of these differences.
What we want to do is minimize the total variance. Remember that the variance of \((\bar{y}_i-\bar{y}_0)\) is \(\sigma^{2} / n_i + \sigma^{2} / n_0\). The total variance is the sum of these t parts.
We need to find \(n_0\), and \(n_i\) that will minimize this total variance. However, this is subject to a constraint, the constraint being that \(N = n_0 + (t \times n)\), if the \(n_i = n\) for all treatments, an assumption we can reasonably make when all treatments are of equal importance.
Given N observations and a groups, where \(a = t + 1\):
the model is:
\(y_{ij} = \mu_i + \epsilon_{ij}\), where \(i = 0, 1, \dots , t\) and \(j = 1, \dots , n_i\)
sample mean: \(\bar{y}_{i.}=\dfrac{1}{n_i} \sum\limits_j^{n_i} y_{ij}\) and \(Var(\bar{y}_{i.})=\dfrac{\sigma^2}{n_i}\)
Furthermore, \(Var(\bar{y}_{i.}-\bar{y}_0)=\dfrac{\sigma^2}{n_i}+\dfrac{\sigma^2}{n_0}\)
Use \(\hat{\sigma}^2=MSE\) and assume \(n_i= n\) for \(i = 1, \dots , t\).
Then the Total Sample Variance (TSV) = \((TSV)=\sum\limits_{i=1}^t \widehat{var} (\bar{y}_{i.}-\bar{y}_{0.})=t(\dfrac{\sigma^2}{n}+\dfrac{\sigma^2}{n_0})\)
We want to minimize \(t\sigma^2(\frac{1}{n}+\frac{1}{n_0})\) where \(N = tn + n_0\)
This is a LaGrange multiplier problem (calculus): \(\text{min} {TSV + \lambda(N - tn - n_0}\):
1) \(\dfrac{\partial(\ast)}{\delta n}=\dfrac{-t\sigma^2}{n^2}-\lambda t=0\)
2) \(\dfrac{\partial(\ast)}{\partial n_0}=\dfrac{-t\sigma^2}{n_0^2}-\lambda =0\)
From 2) \(\lambda=\dfrac{-t\sigma^2}{n_0^2}\) we can then substitute into 1) as follows:
\(\dfrac{-t\sigma^2}{n^2}=\lambda t=\dfrac{-t\sigma^2}{n_0^2} \Longrightarrow n^2=\dfrac{n_0^2}{t} \Longrightarrow n=\dfrac{n_0}{\sqrt{t}} \Longrightarrow n_0=n \sqrt{t}\)
Therefore, from \(N=tn+n_0=tn+\sqrt{t} n=n(t+\sqrt{t})\Longrightarrow n=\dfrac{N}{(t+\sqrt{t})}\)
When this is all worked out we have a nice simple rule to guide our decision about how to allocate our observations:
\(n_{0}=n\sqrt{t}\)
Or, the number of observations in the control group should be the square root of the number of treatments times the number of observations in the treatment groups.
If we want to get the exact n based on our resources, let \(n=N/(t+\sqrt{t})\) and \(n_{0}=\sqrt{t}\times n\) and then round to the nearest integers.
Back to our example...
In our example, we had N = 60 and t = 4. Plugging these values into the equation above gives us \(n = 10\) and \(n_0 = 20\). We should allocate 20 observations in the control and 10 observations in each of the treatments. The purpose is not to compare each of the new drugs to each other but rather to answer whether or not the new drug is better than the control.
These calculations demonstrate once again, that the design principles we use in this course are almost always based on trying to minimize the variance and maximizing the power of the experiment. Here is a case where equal allocation is not optimal because you are not interested equally in all comparisons. You are interested in specific comparisons i.e. treatments versus the control, so the control takes on special importance. In this case, we allocate additional observations to the control group for the purpose of minimizing the total variance.
3.5 - One-way Random Effects Models
With quantitative factors, we may want to make inference to levels not measured in the experiment by interpolation or extrapolation on the measurement scale. With categorical factors, we may only be able to use a subset of all possible levels - e.g. brands of popcorn - but we would still like to be able to make inference to other levels. Imagine that we randomly select a of the possible levels of the factor of interest. In this case, we say that the factor is random. As before, the usual single factor ANOVA applies which is
\(y_{ij}=\mu +\tau_i+\varepsilon_{ij} \left\{\begin{array}{c} i=1,2,\ldots,a \\ j=1,2,\ldots,n \end{array}\right. \)
However, here both the error term and treatment effects are random variables, that is
\(\varepsilon_{ij}\ \mbox{is }NID(0,\sigma^2)\mbox{ and }\tau_i\mbox{is }NID(0,\sigma^2_{\tau})\)
Also, \(\tau_i\) and \(\epsilon_{ij}\) are independent. The variances \(\sigma^2_{\tau} \) and \(\sigma^2\) are called variance components.
In the fixed effect models we test the equality of the treatment means. However, this is no longer appropriate because treatments are randomly selected and we are interested in the population of treatments rather than any individual one. The appropriate hypothesis test for a random effect is:
\(H_0:\sigma^2_{\tau}=0\) \(H_1:\sigma^2_{\tau}>0\)
The standard ANOVA partition of the total sum of squares still works and leads to the usual ANOVA display. However, as before, the form of the appropriate test statistic depends on the Expected Mean Squares. In this case, the appropriate test statistic would be
\(F_0=MS_{Treatments}/MS_E\)
which follows an F distribution with a-1 and N-a degrees of freedom. Furthermore, we are also interested in estimating the variance components \(\sigma_{t}^{2}\) and \(\sigma^2\). To do so, we use the analysis of variance method which consists of equating the expected mean squares to their observed values.
\({\hat{\sigma}}^2=MS_E\ \mbox{and}\ {\hat{\sigma}}^2+n{\hat{\sigma}}^2_{\tau}=MS_{Treatments}\)
\({\hat{\sigma}}^2_{\tau}=\dfrac{MS_{Treatment}-MS_E}{n}\)
\({\hat{\sigma}}^2=MS_E\)
A potential problem that may arise here is that the estimated treatment variance component may be negative. It such a case, it is proposed to either consider zero in case of a negative estimate or use another method which always results in a positive estimate. A negative estimate for the treatment variance component can also be viewed as evidence that the model is not appropriate, which suggests looking for a better one.
Example 3.11 (13.1 in the 7th ed) discusses a single random factor case about the differences among looms in a textile weaving company. Four looms have been chosen randomly from a population of looms within a weaving shed and four observations were made on each loom. Table 13.1 illustrates the data obtained from the experiment. Here is the Minitab output for this example using Stat > ANOVA > Balanced ANOVA command.
The interpretation made from the ANOVA table is as before. With the p -value equal to 0.000 it is obvious that the looms in the plant are significantly different, or more accurately stated, the variance component among the looms is significantly larger than zero. And confidence intervals can be found for the variance components. The \(100(1-\alpha)\%\) confidence interval for \(\alpha^2\) is
\(\dfrac{(N-a)MS_E}{\chi^2_{\alpha/2,N-a}} \leq \sigma^2 \leq \dfrac{(N-a)MS_E}{\chi^2_{1-\alpha/2,N-a}}\)
Confidence intervals for other variance components are provided in the textbook. It should be noted that a closed form expression for the confidence interval on some parameters may not be obtained.
3.6 - The General Linear Test
This is just a general representation of an F -test based on a full and a reduced model. We will use this frequently when we look at more complex models.
Let's illustrate the general linear test here for the single factor experiment:
First we write the full model, \(Y_{ij} = \mu + \tau_i + \epsilon_{ij}\) and then the reduced model, \(Y_{ij} = \mu + \epsilon_{ij}\) where you don't have a \(\tau_i\) term, you just have an overall mean, \(\mu\). This is a pretty degenerate model that just says all the observations are just coming from one group. But the reduced model is equivalent to what we are hypothesizing when we say the \(\mu_i\) would all be equal, i.e.:
\(H_0 \colon \mu_1 = \mu_2 = \dots = \mu_a\)
This is equivalent to our null hypothesis where the \(\tau_i\)'s are all equal to 0.
The reduced model is just another way of stating our hypothesis. But in more complex situations this is not the only reduced model that we can write, there are others we could look at.
The general linear test is stated as an F ratio:
\(F=\dfrac{(SSE(R)-SSE(F))/(dfR-dfF)}{SSE(F)/dfF}\)
This is a very general test. You can apply any full and reduced model and test whether or not the difference between the full and the reduced model is significant just by looking at the difference in the SSE appropriately. This has an F distribution with ( df R - df F), df F degrees of freedom, which correspond to the numerator and the denominator degrees of freedom of this F ratio.
Let's take a look at this general linear test using Minitab...
Example 3.5: Cotton Weight
Remember this experiment had treatment levels 15, 20, 25, 30, 35 % cotton weight and the observations were the tensile strength of the material.
The full model allows a different mean for each level of cotton weight %.
We can demonstrate the General Linear Test by viewing the ANOVA table from Minitab:
STAT > ANOVA > Balanced ANOVA
The \(SSE(R) = 636.96\) with a \(dfR = 24\), and \(SSE(F) = 161.20\) with \(dfF = 20\). Therefore:
\(F^\ast =\dfrac{(636.96-161.20)/(24-20)}{161.20/20}\)
This demonstrates the equivalence of this test to the F -test. We now use the General Linear Test (GLT) to test for Lack of Fit when fitting a series of polynomial regression models to determine the appropriate degree of polynomial.
We can demonstrate the General Linear Test by comparing the quadratic polynomial model (Reduced model), with the full ANOVA model (Full model). Let \(Y_{ij} = \mu + \beta_{1}x_{ij} + \beta_{2}x_{ij}^{2} + \epsilon_{ij}\) be the reduced model, where \(x_{ij}\) is the cotton weight percent. Let \(Y_{ij} = \mu + \tau_i + \epsilon_{ij}\) be the full model.
The General Linear Test - Cotton Weight Example (no sound)
The video above shows the SSE ( R ) = 260.126 with dfR = 22 for the quadratic regression model. The ANOVA shows the full model with SSE ( F ) = 161.20 with dfF = 20.
Therefore the GLT is:
\(\begin{eqnarray} F^\ast &=&\dfrac{(SSE(R)-SSE(F))/(dfR-dfF)}{SSE(F)/dfF} \nonumber\\ &=&\dfrac{(260.126-161.200)/(22-20)}{161.20/20}\nonumber\\ &=&\dfrac{98.926/2}{8.06}\nonumber\\ &=&\dfrac{49.46}{8.06}\nonumber\\&=&6.14 \nonumber \end{eqnarray}\)
We reject \(H_0\colon \) Quadratic Model and claim there is Lack of Fit if \(F^{*} > F_{1}-\alpha (2, 20) = 3.49\).
Therefore, since 6.14 is > 3.49 we reject the null hypothesis of no Lack of Fit from the quadratic equation and fit a cubic polynomial. From the viewlet above we noticed that the cubic term in the equation was indeed significant with p -value = 0.015.
We can apply the General Linear Test again, now testing whether the cubic equation is adequate. The reduced model is:
\(Y_{ij} = \mu + \beta_{1}x_{ij} + \beta_{2}x_{ij}^{2} + \beta_{3}x_{ij}^{3} + \epsilon_{ij}\)
and the full model is the same as before, the full ANOVA model:
\(Y_ij = \mu + \tau_i + \epsilon_{ij}\)
The General Linear Test is now a test for Lack of Fit from the cubic model:
\begin{aligned} F^{*} &=\frac{(\operatorname{SSE}(R)-\operatorname{SSE}(F)) /(d f R-d f F)}{\operatorname{SSE}(F) / d f F} \\ &=\frac{(195.146-161.200) /(21-20)}{161.20 / 20} \\ &=\frac{33.95 / 1}{8.06} \\ &=4.21 \end{aligned}
We reject if \(F^{*} > F_{0.95} (1, 20) = 4.35\).
Therefore we do not reject \(H_A \colon\) Lack of Fit and conclude the data are consistent with the cubic regression model, and higher order terms are not necessary.
Research Methods in Psychology
4. single factor experiments ¶.
To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. —Ronald Fisher
4.1. Experiments in a nutshell ¶
The primary goal of experiments is to identify causal relationships between things in the world. Experiments do this by a systematic process of measuring how things behave under different conditions.
Fig. 4.1 Which switch turns on which light? ¶
People conduct informal experiments all of the time. For example, when you walk into an unfamiliar room and want to turn on a light, what do you do? You find the light switch panel, and then you flip the switches until you find the one that turns a particular light on or off. This process is called trial and error, and involves trying things out until they work. The process of figuring out which switch causes a particular light to turn on or off is very similar to the process of conducting experiments. Let’s use the light-switch example to define some important terms, and then discuss the logic of running an experiment and making inferences about the data that is collected in an experiment.
Experiments begin with a question about a potential cause and effect relationship between two variables. For example, which switch on the panel caused light #1 to turn on? When you flick the switches and look at the lights, you are actually accomplishing three important parts of an experiment: manipulating the independent variable(s), measuring the dependent variable(s), and controlling extraneous variables.
The independent variable is the potential cause in our cause and effect relationship and is the variable that a researcher directly manipulates. The light switches are independent variables that you (the “researcher” in this scenario) can manipulate. For example, the first light switch can be up or down, the second light switch can be up or down, and so on.
The dependent variable is the potential effect in our cause and effect relationship and is measured by the researcher. Each light bulb is a dependent variable that we can measure. For example, we observe whether a light is on off or perhaps use a special photometer to measure the brightness of the light.
Fig. 4.2 Possible outcomes of an experiment asking if Switch #1 controls Light #1 ¶
Let’s look at an experiment asking if light switch #1 causes the first light to turn on or off. The experiment involves manipulating switch #1 by flipping it up or down, and then observing whether the light turns on or off. There are two simple outcomes. Possible outcome #1 is that the light stays off in both conditions. What inference can we make based off of this pattern of data? In most situations, people would be comfortable inferring that switch 1 does not cause Light #1 to turn on and off. Possible outcome #2 is that the light turns on when the switch is up, and turns off when the switch is down. What inference can we make based off of this pattern of data? In most situations, people would be comfortable inferring that switch #1 does cause Light #1 to turn on and off.
It would be nice if the process of figuring out what causes what is as simple as the light switch example, but even this example is not as simple as it seems. The biggest complication is the inference . We discussed two plausible inferences for outcomes 1 and 2. However, these inferences might be mistaken.
For outcome 1, when the light doesn’t turn on, what could be wrong about the inference that switch 1 does not control light #1? Perhaps that switch is wired up to control light #1, but the light is broken. Or perhaps the wire got disconnected. Or, perhaps the light did turn on, but you couldn’t see it because the light was very dim.
For outcome 2, when the light does turn on, what could be wrong about the inference that switch 1 does control light 1? Here, we at least know that the light works, so we can be confident in our measure of the dependent variable. But, how confident our we that our manipulation of the light switch was the only variable changing in our experiment? This depends on how well the experiment controls extraneous variables , or confounds. If you can guarantee that switch 1 going up and down was the only variable changing (i.e., there are no extraneous variables), then you can be confident of the inference that switch 1 caused light 1 to turn on and off.
Let’s consider an obviously problematic version of the experiment where you are not controlling extraneous variables. For example, imagine that every time you test switch #1, a friend is also testing a different switch. When you turn switch #1 up, your friend turns switch #2 up. When you turn switch #1 down, your friend turns switch #2 down. If the light turns on and off when you flip switch #1, what can you infer about the causal influence of switch 1 on the light? Well, switch #1 might influence the light. But it seems just as plausible that switch #1 does nothing and that switch #2 influences the light. The only way to conclusively infer that switch #1 controls the light is to eliminate the influence of other possible variables. So, you need to ask your friend to stop testing other switches while you focus on testing switch #1.
To summarize, experiments attempt to discover the causal and effect relationships between variables. Researchers systematically manipulate the independent variable and they measure the dependent variable. Researchers then see if there the dependent variable changes in a way that seems related to the manipiulation of the independent variable. When we observe a systematic relationship between independent and dependent variables, we conclude that the independent variable causally influences the dependent variable. When we observe no relationship between independent and dependent variables, we conclude that the independent variable does not causally influence the dependent variable. However, these inferences are only valid when the experiment is designed properly and so that the researcher is confident that all extraneous variables are appropriately dealt with.
In psychological experiments, the goal of figuring out what causes what is rarely accomplished by a single experiment. Instead, our inferences about causal relationships are strengthened over many experiments that improve our ability to measure variables of interest, and to create well-controlled conditions where the independent variables are not confounded by extraneous influences.
4.2. An example Psychology Experiment ¶
In the late 1960s social psychologists John Darley and Bibb Latané proposed a counter-intuitive hypothesis: the more witnesses there are to an accident or crime, the less likely any of them is to help the victim [DL68] .
They also suggested the theory that this phenomenon occurs because each witness feels less responsible for helping - a process referred to as the “diffusion of responsibility”. Darley and Latané noted that their ideas were consistent with many real-world cases. For example, a New York woman named Catherine “Kitty” Genovese was assaulted and murdered while several witnesses evidently failed to help. But Darley and Latané also understood that such isolated cases did not provide convincing evidence for their hypothesized “bystander effect”. There was no way to know, for example, whether any of the witnesses to Kitty Genovese’s murder would have helped had there been fewer of them.
So to test their hypothesis, Darley and Latané created a simulated emergency situation in a laboratory. Each of their university student participants was isolated in a small room and told that he or she would be having a discussion about university life with other students via an intercom system. Early in the discussion, however, one of the students began having what seemed to be an epileptic seizure. Over the intercom came the following: “I could really-er-use some help so if somebody would-er-give me a little h-help-uh-er-er-er-er-er c-could somebody-er- er-help-er-uh-uh-uh (choking sounds)…I’m gonna die-er-er-I’m…gonna die-er-help-er-er-seizure-er- [chokes, then quiet]” [DL68] .
In actuality, there were no other students. These comments had been prerecorded and were played back to create the appearance of a real emergency. The key to the study was that some participants were told that the discussion involved only one other student (the victim), others were told that it involved two other students, and still others were told that it included five other students. Because this was the only difference between these three groups of participants, any difference in their tendency to help the victim would have to have been caused by it. And sure enough, the likelihood that the participant left the room to seek help for the “victim” decreased from 85% to 62% to 31% as the number of “witnesses” increased.
The story of Kitty Genovese has been told and retold in numerous psychology textbooks. The standard version is that there were 38 witnesses to the crime, that all of them watched (or listened) for an extended period of time, and that none of them did anything to help. However, recent scholarship suggests that the standard story is inaccurate in many ways [MLC07] . For example, only six eyewitnesses testified at the trial, none of them was aware that he or she was witnessing a lethal assault, and there have been several reports of witnesses calling the police or even coming to the aid of Kitty Genovese. Although the standard story inspired a long line of research on the bystander effect and the diffusion of responsibility, it may also have directed researchers’ and students’ attention away from other equally interesting and important issues in the psychology of helping - including the conditions in which people do in fact respond collectively to emergency situations.
The research that Darley and Latané conducted was a particular kind of study called an experiment. Experiments are used to determine not only whether there is a meaningful relationship between two variables but also whether the relationship is a causal one that is supported by statistical analysis. For this reason, experiments are one of the most common and useful tools in the psychological researcher’s toolbox. In this chapter, we look at experiments in detail. We will first consider what sets experiments apart from other kinds of studies and why they support causal conclusions while other kinds of studies do not. We then look at two basic ways of designing an experiment—between-subjects designs and within-subjects designs—and discuss their pros and cons. Finally, we consider several important practical issues that arise when conducting experiments.
4.3. More Experimental Basics ¶
4.3.1. learning objectives ¶.
Explain what an experiment is and recognize examples of studies that are experiments and studies that are not experiments.
Explain what internal validity is and why experiments are considered to be high in internal validity.
Explain what external validity is and evaluate studies in terms of their external validity.
Distinguish between the manipulation of the independent variable and control of extraneous variables and explain the importance of each.
Recognize examples of confounding variables and explain how they affect the internal validity of a study.
4.3.2. What Is an Experiment? ¶
As we saw earlier in the book, an experiment is a type of study designed specifically to answer the question of whether there is a causal relationship between two variables. In other words, whether changes in an independent variable cause changes in a dependent variable. Experiments have two fundamental features. The first is that the researchers manipulate, or systematically vary, the level of the independent variable. The different levels of the independent variable are called conditions. For example, in Darley and Latané’s experiment, the independent variable was the number of witnesses that participants believed to be present. The researchers manipulated this independent variable by telling participants that there were either one, two, or five other students involved in the discussion, thereby creating three conditions. For a new researcher, it is easy to confuse these terms by believing there are three independent variables in this situation: one, two, or five students involved in the discussion, but there is actually only one independent variable (number of witnesses) with three different conditions (one, two or five students). The second fundamental feature of an experiment is that the researcher controls, or minimizes the variability in, variables other than the independent and dependent variable. These other variables are called extraneous variables. Darley and Latané tested all their participants in the same room, exposed them to the same emergency situation, and so on. They also randomly assigned their participants to conditions so that the three groups would be similar to each other to begin with. Notice that although the words manipulation and control have similar meanings in everyday language, researchers make a clear distinction between them. They manipulate the independent variable by systematically changing its levels and control other variables by holding them constant.
4.3.3. Four Big Validities ¶
When we read about psychology experiments with a critical view, one question to ask is “is this study valid?” However, that question is not as straightforward as it seems because in psychology, there are many different kinds of validities. Researchers have focused on four validities to help assess whether an experiment is sound [KJ81] [Mor14] : internal validity, external validity, construct validity, and statistical validity. We will explore each validity in depth.
4.3.4. Internal Validity ¶
Recall that two variables being statistically related does not necessarily mean that one causes the other. “Correlation does not imply causation.” For example, if it were the case that people who exercise regularly are happier than people who do not exercise regularly, this implication would not necessarily mean that exercising increases people’s happiness. It could mean instead that greater happiness causes people to exercise (the directionality problem) or that something like better physical health causes people to exercise and be happier (the third-variable problem).
The purpose of an experiment, however, is to show that two variables are statistically related and to do so in a way that supports the conclusion that the independent variable caused any observed differences in the dependent variable. The logic is based on this assumption : If the researcher creates two or more highly similar conditions and then manipulates the independent variable to produce just one difference between them, then any later difference between the conditions must have been caused by the independent variable. For example, because the only difference between Darley and Latané’s conditions was the number of students that participants believed to be involved in the discussion, this difference in belief must have been responsible for differences in helping between the conditions.
An empirical study is said to be high in internal validity if the way it was conducted supports the conclusion that the independent variable caused any observed differences in the dependent variable. Thus experiments are high in internal validity because the way they are conducted—with the manipulation of the independent variable and the control of extraneous variables—provides strong support for causal conclusions.
4.3.5. External Validity ¶
At the same time, the way that experiments are conducted sometimes leads to a different kind of criticism. Specifically, the need to manipulate the independent variable and control extraneous variables means that experiments are often conducted under conditions that seem artificial [BMBW14] . In many psychology experiments, the participants are all undergraduate students and come to a classroom or laboratory to fill out a series of paper-and-pencil questionnaires or to perform a carefully designed computerized task. Consider, for example, an experiment in which researcher Barbara Fredrickson and her colleagues had undergraduate students come to a laboratory on campus and complete a math test while wearing a swimsuit [FRN+98] . At first, this manipulation might seem silly. When will undergraduate students ever have to complete math tests in their swimsuits outside of this experiment?
The issue we are confronting is that of external validity. An empirical study is high in external validity if the way it was conducted supports generalizing the results to people and situations beyond those actually studied. As a general rule, studies are higher in external validity when the participants and the situation studied are similar to those that the researchers want to generalize to and participants encounter everyday, often described as mundane realism. Imagine, for example, that a group of researchers is interested in how shoppers in large grocery stores are affected by whether breakfast cereal is packaged in yellow or purple boxes. Their study would be high in external validity and have high mundane realism if they studied the decisions of ordinary people doing their weekly shopping in a real grocery store. If the shoppers bought much more cereal in purple boxes, the researchers would be fairly confident that this increase would be true for other shoppers in other stores. Their study would be relatively low in external validity, however, if they studied a sample of undergraduate students in a laboratory at a selective university who merely judged the appeal of various colors presented on a computer screen; however, this study would have high psychological realism where the same mental process is used in both the laboratory and in the real world. If the students judged purple to be more appealing than yellow, the researchers would not be very confident that this preference is relevant to grocery shoppers’ cereal-buying decisions because of low external validity but they could be confident that the visual processing of colors has high psychological realism.
We should be careful, however, not to draw the blanket conclusion that experiments are low in external validity. One reason is that experiments need not seem artificial. Consider that Darley and Latané’s experiment provided a reasonably good simulation of a real emergency situation. Or consider-field experiments that are conducted entirely outside the laboratory. In one such experiment, Robert Cialdini and his colleagues studied whether hotel guests choose to reuse their towels for a second day as opposed to having them washed as a way of conserving water and energy [Cia05] . These researchers manipulated the message on a card left in a large sample of hotel rooms. One version of the message emphasized showing respect for the environment, another emphasized that the hotel would donate a portion of their savings to an environmental cause, and a third emphasized that most hotel guests choose to reuse their towels. The result was that guests who received the message that most hotel guests choose to reuse their towels reused their own towels substantially more often than guests receiving either of the other two messages. Given the way they conducted their study, it seems very likely that their result would hold true for other guests in other hotels.
A second reason not to draw the blanket conclusion that experiments are low in external validity is that they are often conducted to learn about psychological processes that are likely to operate in a variety of people and situations. Let us return to the experiment by Fredrickson and colleagues. They found that the women in their study, but not the men, performed worse on the math test when they were wearing swimsuits. They argued that this gender difference was due to women’s greater tendency to objectify themselves—to think about themselves from the perspective of an outside observer—which diverts their attention away from other tasks. They argued, furthermore, that this process of self-objectification and its effect on attention is likely to operate in a variety of women and situations—even if none of them ever finds herself taking a math test in her swimsuit.
4.3.6. Construct Validity ¶
In addition to the generalizability of the results of an experiment, another element to scrutinize in a study is the quality of the experiment’s manipulations, or the construct validity. The research question that Darley and Latané started with is “does helping behavior become diffused?” They hypothesized that participants in a lab would be less likely to help when they believed there were more potential helpers besides themselves. This conversion from research question to experiment design is called operationalization (see Chapter 2 for more information about the operational definition). Darley and Latané operationalized the independent variable of diffusion of responsibility by increasing the number of potential helpers. In evaluating this design, we would say that the construct validity was very high because the experiment’s manipulations very clearly speak to the research question; there was a crisis, a way for the participant to help, and increasing the number of other students involved in the discussion, they provided a way to test diffusion.
What if the number of conditions in Darley and Latané’s study changed? Consider if there were only two conditions: one student involved in the discussion or two. Even though we may see a decrease in helping by adding another person, it may not be a clear demonstration of diffusion of responsibility, just merely the presence of others. We might think it was a form of Bandura’s social inhibition (discussed in Chapter 4). The construct validity would be lower. However, had there been five conditions, perhaps we would see the decrease continue with more people in the discussion or perhaps it would plateau after a certain number of people. In that situation, we may not necessarily be learning more about diffusion of responsibility or it may become a different phenomenon. By adding more conditions, the construct validity may not get higher. When designing your own experiment, consider how well the research question is operationalized your study.
4.3.7. Statistical Validity ¶
A common critique of experiments is that a study did not have enough participants. The main reason for this criticism is that it is difficult to generalize about a population from a small sample. At the outset, it seems as though this critique is about external validity but there are studies where small sample sizes are not a problem (Chapter 11 will discuss how small samples, even of only 1 person, are still very illuminating for psychology research). Therefore, small sample sizes are actually a critique of statistical validity. The statistical validity speaks to whether the statistics conducted in the study support the conclusions that are made.
Proper statistical analysis should be conducted on the data to determine whether the difference or relationship that was predicted was found. The number of conditions and the number of total participants will determine the overall size of the effect. With this information, a power analysis can be conducted to ascertain whether you are likely to find a real difference. When designing a study, it is best to think about the power analysis so that the appropriate number of participants can be recruited and tested (more on effect sizes in Chapter 12). To design a statistically valid experiment, thinking about the statistical tests at the beginning of the design will help ensure the results can be believed.
4.3.8. Prioritizing Validities ¶
These four big validities–internal, external, construct, and statistical–are useful to keep in mind when both reading about other experiments and designing your own. However, researchers must prioritize and often it is not possible to have high validity in all four areas. In Cialdini’s study on towel usage in hotels, the external validity was high but the statistical validity was more modest. This discrepancy does not invalidate the study but it shows where there may be room for improvement for future follow-up studies [GCG08] . Morling [Mor14] points out that most psychology studies have high internal and construct validity but sometimes sacrifice external validity.
4.3.9. Manipulation of the Independent Variable ¶
Again, to manipulate an independent variable means to change its level systematically so that different groups of participants are exposed to different levels of that variable, or the same group of participants is exposed to different levels at different times. For example, to see whether expressive writing affects people’s health, a researcher might instruct some participants to write about traumatic experiences and others to write about neutral experiences. As discussed earlier in this chapter, the different levels of the independent variable are referred to as conditions, and researchers often give the conditions short descriptive names to make it easy to talk and write about them. In this case, the conditions might be called the “traumatic condition” and the “neutral condition.”
Notice that the manipulation of an independent variable must involve the active intervention of the researcher. Comparing groups of people who differ on the independent variable before the study begins is not the same as manipulating that variable. For example, a researcher who compares the health of people who already keep a journal with the health of people who do not keep a journal has not manipulated this variable and therefore not conducted an experiment. This distinction is important because groups that already differ in one way at the beginning of a study are likely to differ in other ways too. For example, people who choose to keep journals might also be more conscientious, more introverted, or less stressed than people who do not. Therefore, any observed difference between the two groups in terms of their health might have been caused by whether or not they keep a journal, or it might have been caused by any of the other differences between people who do and do not keep journals. Thus the active manipulation of the independent variable is crucial for eliminating the third-variable problem.
Of course, there are many situations in which the independent variable cannot be manipulated for practical or ethical reasons and therefore an experiment is not possible. For example, whether or not people have a significant early illness experience cannot be manipulated, making it impossible to conduct an experiment on the effect of early illness experiences on the development of hypochondriasis. This caveat does not mean it is impossible to study the relationship between early illness experiences and hypochondriasis—only that it must be done using nonexperimental approaches. We will discuss this type of methodology in detail later in the book.
In many experiments, the independent variable is a construct that can only be manipulated indirectly. For example, a researcher might try to manipulate participants’ stress levels indirectly by telling some of them that they have five minutes to prepare a short speech that they will then have to give to an audience of other participants. In such situations, researchers often include a manipulation check in their procedure. A manipulation check is a separate measure of the construct the researcher is trying to manipulate. For example, researchers trying to manipulate participants’ stress levels might give them a paper-and-pencil stress questionnaire or take their blood pressure—perhaps right after the manipulation or at the end of the procedure—to verify that they successfully manipulated this variable.
4.3.10. Control of Extraneous Variables ¶
As we have seen previously in the chapter, an extraneous variable is anything that varies in the context of a study other than the independent and dependent variables. In an experiment on the effect of expressive writing on health, for example, extraneous variables would include participant variables (individual differences) such as their writing ability, their diet, and their shoe size. They would also include situational or task variables such as the time of day when participants write, whether they write by hand or on a computer, and the weather. Extraneous variables pose a problem because many of them are likely to have some effect on the dependent variable. For example, participants’ health will be affected by many things other than whether or not they engage in expressive writing. This influencing factor can make it difficult to separate the effect of the independent variable from the effects of the extraneous variables, which is why it is important to control extraneous variables by holding them constant.
4.3.11. Extraneous Variables as “Noise” ¶
Extraneous variables make it difficult to detect the effect of the independent variable in two ways. One is by adding variability or “noise” to the data. Imagine a simple experiment on the effect of mood (happy vs. sad) on the number of happy childhood events people are able to recall. Participants are put into a negative or positive mood (by showing them a happy or sad video clip) and then asked to recall as many happy childhood events as they can. The two leftmost columns of Table 6.1 show what the data might look like if there were no extraneous variables and the number of happy childhood events participants recalled was affected only by their moods. Every participant in the happy mood condition recalled exactly four happy childhood events, and every participant in the sad mood condition recalled exactly three. The effect of mood here is quite obvious. In reality, however, the data would probably look more like those in the two rightmost columns of Table 6.1. Even in the happy mood condition, some participants would recall fewer happy memories because they have fewer to draw on, use less effective recall strategies, or are less motivated. And even in the sad mood condition, some participants would recall more happy childhood memories because they have more happy memories to draw on, they use more effective recall strategies, or they are more motivated. Although the mean difference between the two groups is the same as in the idealized data, this difference is much less obvious in the context of the greater variability in the data. Thus one reason researchers try to control extraneous variables is so their data look more like the idealized data in Table 6.1, which makes the effect of the independent variable easier to detect (although real data never look quite that good).
Fig. 4.3 Hypothetical Noiseless Data and Realistic Noisy Data ¶
One way to control extraneous variables is to hold them constant. This technique can mean holding situation or task variables constant by testing all participants in the same location, giving them identical instructions, treating them in the same way, and so on. It can also mean holding participant variables constant. For example, many studies of language limit participants to right-handed people, who generally have their language areas isolated in their left cerebral hemispheres. Left-handed people are more likely to have their language areas isolated in their right cerebral hemispheres or distributed across both hemispheres, which can change the way they process language and thereby add noise to the data.
In principle, researchers can control extraneous variables by limiting participants to one very specific category of person, such as 20-year-old, heterosexual, female, right-handed psychology majors. The obvious downside to this approach is that it would lower the external validity of the study—in particular, the extent to which the results can be generalized beyond the people actually studied. For example, it might be unclear whether results obtained with a sample of younger heterosexual women would apply to older homosexual men. In many situations, the advantages of a diverse sample outweigh the reduction in noise achieved by a homogeneous one.
4.3.12. Extraneous Variables as Confounding Variables ¶
The second way that extraneous variables can make it difficult to detect the effect of the independent variable is by becoming confounding variables. A confounding variable is an extraneous variable that differs on average across levels of the independent variable. For example, in almost all experiments, participants’ intelligence quotients (IQs) will be an extraneous variable. But as long as there are participants with lower and higher IQs at each level of the independent variable so that the average IQ is roughly equal, then this variation is probably acceptable (and may even be desirable). What would be bad, however, would be for participants at one level of the independent variable to have substantially lower IQs on average and participants at another level to have substantially higher IQs on average. In this case, IQ would be a confounding variable.
Fig. 4.4 Hypothetical Results From a Study on the Effect of Mood on Memory. Because IQ also differs across conditions, it is a confounding variable. ¶
To confound means to confuse, and this effect is exactly why confounding variables are undesirable. Because they differ across conditions—just like the independent variable—they provide an alternative explanation for any observed difference in the dependent variable. Figure 4.4 shows the results of a hypothetical study, in which participants in a positive mood condition scored higher on a memory task than participants in a negative mood condition. But if IQ is a confounding variable—with participants in the positive mood condition having higher IQs on average than participants in the negative mood condition—then it is unclear whether it was the positive moods or the higher IQs that caused participants in the first condition to score higher. One way to avoid confounding variables is by holding extraneous variables constant. For example, one could prevent IQ from becoming a confounding variable by limiting participants only to those with IQs of exactly 100. But this approach is not always desirable for reasons we have already discussed. A second and much more general approach—random assignment to conditions—will be discussed in detail shortly.
4.3.13. Key Takeaways ¶
An experiment is a type of empirical study that features the manipulation of an independent variable, the measurement of a dependent variable, and control of extraneous variables.
Studies are high in internal validity to the extent that the way they are conducted supports the conclusion that the independent variable caused any observed differences in the dependent variable. Experiments are generally high in internal validity because of the manipulation of the independent variable and control of extraneous variables.
Studies are high in external validity to the extent that the result can be generalized to people and situations beyond those actually studied. Although experiments can seem “artificial”—and low in external validity—it is important to consider whether the psychological processes under study are likely to operate in other people and situations.
4.3.14. Exercises ¶
Practice: List five variables that can be manipulated by the researcher in an experiment. List five variables that cannot be manipulated by the researcher in an experiment.
Practice: For each of the following topics, decide whether that topic could be studied using an experimental research design and explain why or why not.
Effect of parietal lobe damage on people’s ability to do basic arithmetic.
Effect of being clinically depressed on the number of close friendships people have.
Effect of group training on the social skills of teenagers with Asperger’s syndrome.
Effect of paying people to take an IQ test on their performance on that test.
4.4. Experimental Design ¶
4.4.1. learning objectives ¶.
Explain the difference between between-subjects and within-subjects experiments, list some of the pros and cons of each approach, and decide which approach to use to answer a particular research question.
Define random assignment, distinguish it from random sampling, explain its purpose in experimental research, and use some simple strategies to implement it.
Define what a control condition is, explain its purpose in research on treatment effectiveness, and describe some alternative types of control conditions.
Define several types of carryover effect, give examples of each, and explain how counterbalancing helps to deal with them.
In this section, we look at some different ways to design an experiment. The primary distinction we will make is between approaches in which each participant experiences one level of the independent variable and approaches in which each participant experiences all levels of the independent variable. The former are called between-subjects experiments and the latter are called within-subjects experiments.
4.4.2. Between-Subjects Experiments ¶
In a between-subjects experiment, each participant is tested in only one condition. For example, a researcher with a sample of 100 university students might assign half of them to write about a traumatic event and the other half write about a neutral event. Or a researcher with a sample of 60 people with severe agoraphobia (fear of open spaces) might assign 20 of them to receive each of three different treatments for that disorder. It is essential in a between- subjects experiment that the researcher assign participants to conditions so that the different groups are, on average, highly similar to each other. Those in a trauma condition and a neutral condition, for example, should include a similar proportion of men and women, and they should have similar average intelligence quotients (IQs), similar average levels of motivation, similar average numbers of health problems, and so on. This matching is a matter of controlling these extraneous participant variables across conditions so that they do not become confounding variables.
4.4.3. Random Assignment ¶
The primary way that researchers accomplish this kind of control of extraneous variables across conditions is called random assignment, which means using a random process to decide which participants are tested in which conditions. Do not confuse random assignment with random sampling. Random sampling is a method for selecting a sample from a population, and it is rarely used in psychological research. Random assignment is a method for assigning participants in a sample to the different conditions, and it is an important element of all experimental research in psychology and other fields too.
In its strictest sense, random assignment should meet two criteria. One is that each participant has an equal chance of being assigned to each condition (e.g., a 50% chance of being assigned to each of two conditions). The second is that each participant is assigned to a condition independently of other participants. Thus one way to assign participants to two conditions would be to flip a coin for each one. If the coin lands heads, the participant is assigned to Condition A, and if it lands tails, the participant is assigned to Condition B. For three conditions, one could use a computer to generate a random integer from 1 to 3 for each participant. If the integer is 1, the participant is assigned to Condition A; if it is 2, the participant is assigned to Condition B; and if it is 3, the participant is assigned to Condition C. In practice, a full sequence of conditions—one for each participant expected to be in the experiment—is usually created ahead of time, and each new participant is assigned to the next condition in the sequence as he or she is tested. When the procedure is computerized, the computer program often handles the random assignment.
One problem with coin flipping and other strict procedures for random assignment is that they are likely to result in unequal sample sizes in the different conditions. Unequal sample sizes are generally not a serious problem, and you should never throw away data you have already collected to achieve equal sample sizes. However, for a fixed number of participants, it is statistically most efficient to divide them into equal-sized groups. It is standard practice, therefore, to use a kind of modified random assignment that keeps the number of participants in each group as similar as possible. One approach is block randomization. In block randomization, all the conditions occur once in the sequence before any of them is repeated. Then they all occur again before any of them is repeated again. Within each of these “blocks,” the conditions occur in a random order. Again, the sequence of conditions is usually generated before any participants are tested, and each new participant is assigned to the next condition in the sequence. Table 6.2 shows such a sequence for assigning nine participants to three conditions. The Research Randomizer website http://www.randomizer.org will generate block randomization sequences for any number of participants and conditions. Again, when the procedure is computerized, the computer program often handles the block randomization.
Fig. 4.5 Block Randomization Sequence for Assigning Nine Participants to Three Conditions ¶
Random assignment is not guaranteed to control all extraneous variables across conditions. It is always possible that just by chance, the participants in one condition might turn out to be substantially older, less tired, more motivated, or less depressed on average than the participants in another condition. However, there are some reasons that this possibility is not a major concern. One is that random assignment works better than one might expect, especially for large samples. Another is that the inferential statistics that researchers use to decide whether a difference between groups reflects a difference in the population takes the “fallibility” of random assignment into account. Yet another reason is that even if random assignment does result in a confounding variable and therefore produces misleading results, this confound is likely to be detected when the experiment is replicated. The upshot is that random assignment to conditions—although not infallible in terms of controlling extraneous variables—is always considered a strength of a research design.
4.4.4. Treatment and Control Conditions ¶
Between-subjects experiments are often used to determine whether a treatment works. In psychological research, a treatment is any intervention meant to change people’s behavior for the better. This intervention includes psychotherapies and medical treatments for psychological disorders but also interventions designed to improve learning, promote conservation, reduce prejudice, and so on. To determine whether a treatment works, participants are randomly assigned to either a treatment condition, in which they receive the treatment, or a control condition, in which they do not receive the treatment. If participants in the treatment condition end up better off than participants in the control condition—for example, they are less depressed, learn faster, conserve more, express less prejudice—then the researcher can conclude that the treatment works. In research on the effectiveness of psychotherapies and medical treatments, this type of experiment is often called a randomized clinical trial.
There are different types of control conditions. In a no-treatment control condition, participants receive no treatment whatsoever. One problem with this approach, however, is the existence of placebo effects. A placebo is a simulated treatment that lacks any active ingredient or element that should make it effective, and a placebo effect is a positive effect of such a treatment. Many folk remedies that seem to work—such as eating chicken soup for a cold or placing soap under the bed sheets to stop nighttime leg cramps—are probably nothing more than placebos. Although placebo effects are not well understood, they are probably driven primarily by people’s expectations that they will improve. Having the expectation to improve can result in reduced stress, anxiety, and depression, which can alter perceptions and even improve immune system functioning [PFB08] .
Placebo effects are interesting in their own right (see Note “The Powerful Placebo”), but they also pose a serious problem for researchers who want to determine whether a treatment works. Figure 4.6 shows some hypothetical results in which participants in a treatment condition improved more on average than participants in a no-treatment control condition. If these conditions (the two leftmost bars in Figure 4.6 ) were the only conditions in this experiment, however, one could not conclude that the treatment worked. It could be instead that participants in the treatment group improved more because they expected to improve, while those in the no-treatment control condition did not.
Fortunately, there are several solutions to this problem. One is to include a placebo control condition, in which participants receive a placebo that looks much like the treatment but lacks the active ingredient or element thought to be responsible for the treatment’s effectiveness. When participants in a treatment condition take a pill, for example, then those in a placebo control condition would take an identical-looking pill that lacks the active ingredient in the treatment (a “sugar pill”). In research on psychotherapy effectiveness, the placebo might involve going to a psychotherapist and talking in an unstructured way about one’s problems. The idea is that if participants in both the treatment and the placebo control groups expect to improve, then any improvement in the treatment group over and above that in the placebo control group must have been caused by the treatment and not by participants’ expectations. This difference is what is shown by a comparison of the two outer bars in Figure 4.6 .
Fig. 4.6 Hypothetical Results From a Study Including Treatment, No-Treatment, and Placebo Conditions ¶
Of course, the principle of informed consent requires that participants be told that they will be assigned to either a treatment or a placebo control condition—even though they cannot be told which until the experiment ends. In many cases the participants who had been in the control condition are then offered an opportunity to have the real treatment. An alternative approach is to use a wait-list control condition, in which participants are told that they will receive the treatment but must wait until the participants in the treatment condition have already received it. This disclosure allows researchers to compare participants who have received the treatment with participants who are not currently receiving it but who still expect to improve (eventually). A final solution to the problem of placebo effects is to leave out the control condition completely and compare any new treatment with the best available alternative treatment. For example, a new treatment for simple phobia could be compared with standard exposure therapy. Because participants in both conditions receive a treatment, their expectations about improvement should be similar. This approach also makes sense because once there is an effective treatment, the interesting question about a new treatment is not simply “Does it work?” but “Does it work better than what is already available?
Many people are not surprised that placebos can have a positive effect on disorders that seem fundamentally psychological, including depression, anxiety, and insomnia. However, placebos can also have a positive effect on disorders that most people think of as fundamentally physiological. These include asthma, ulcers, and warts [SS00] . There is even evidence that placebo surgery—also called “sham surgery”—can be as effective as actual surgery.
Medical researcher J. Bruce Moseley and his colleagues conducted a study on the effectiveness of two arthroscopic surgery procedures for osteoarthritis of the knee [MOmalleyP+02] . The control participants in this study were prepped for surgery, received a tranquilizer, and even received three small incisions in their knees. But they did not receive the actual arthroscopic surgical procedure. The surprising result was that all participants improved in terms of both knee pain and function, and the sham surgery group improved just as much as the treatment groups. According to the researchers, “This study provides strong evidence that arthroscopic lavage with or without débridement [the surgical procedures used] is not better than and appears to be equivalent to a placebo procedure in improving knee pain and self-reported function” (p. 85).
4.4.5. Within-Subjects Experiments ¶
In a within-subjects experiment, each participant is tested under all conditions. Consider an experiment on the effect of a defendant’s physical attractiveness on judgments of his guilt. Again, in a between-subjects experiment, one group of participants would be shown an attractive defendant and asked to judge his guilt, and another group of participants would be shown an unattractive defendant and asked to judge his guilt. In a within-subjects experiment, however, the same group of participants would judge the guilt of both an attractive and an unattractive defendant.
The primary advantage of this approach is that it provides maximum control of extraneous participant variables. Participants in all conditions have the same mean IQ, same socioeconomic status, same number of siblings, and so on—because they are the very same people. Within-subjects experiments also make it possible to use statistical procedures that remove the effect of these extraneous participant variables on the dependent variable and therefore make the data less “noisy” and the effect of the independent variable easier to detect. We will look more closely at this idea later in the book. However, not all experiments can use a within-subjects design nor would it be desirable to.
4.4.6. Carryover Effects and Counterbalancing ¶
The primary disadvantage of within-subjects designs is that they can result in carryover effects. A carryover effect is an effect of being tested in one condition on participants’ behavior in later conditions. One type of carryover effect is a practice effect, where participants perform a task better in later conditions because they have had a chance to practice it. Another type is a fatigue effect, where participants perform a task worse in later conditions because they become tired or bored. Being tested in one condition can also change how participants perceive stimuli or interpret their task in later conditions. This type of effect is called a context effect. For example, an average-looking defendant might be judged more harshly when participants have just judged an attractive defendant than when they have just judged an unattractive defendant. Within-subjects experiments also make it easier for participants to guess the hypothesis. For example, a participant who is asked to judge the guilt of an attractive defendant and then is asked to judge the guilt of an unattractive defendant is likely to guess that the hypothesis is that defendant attractiveness affects judgments of guilt. This knowledge could lead the participant to judge the unattractive defendant more harshly because he thinks this is what he is expected to do. Or it could make participants judge the two defendants similarly in an effort to be “fair.”
Carryover effects can be interesting in their own right. (Does the attractiveness of one person depend on the attractiveness of other people that we have seen recently?) But when they are not the focus of the research, carryover effects can be problematic. Imagine, for example, that participants judge the guilt of an attractive defendant and then judge the guilt of an unattractive defendant. If they judge the unattractive defendant more harshly, this might be because of his unattractiveness. But it could be instead that they judge him more harshly because they are becoming bored or tired. In other words, the order of the conditions is a confounding variable. The attractive condition is always the first condition and the unattractive condition the second. Thus any difference between the conditions in terms of the dependent variable could be caused by the order of the conditions and not the independent variable itself.
There is a solution to the problem of order effects, however, that can be used in many situations. It is counterbalancing, which means testing different participants in different orders. For example, some participants would be tested in the attractive defendant condition followed by the unattractive defendant condition, and others would be tested in the unattractive condition followed by the attractive condition. With three conditions, there would be six different orders (ABC, ACB, BAC, BCA, CAB, and CBA), so some participants would be tested in each of the six orders. With counterbalancing, participants are assigned to orders randomly, using the techniques we have already discussed. Thus random assignment plays an important role in within-subjects designs just as in between- subjects designs. Here, instead of randomly assigning to conditions, they are randomly assigned to different orders of conditions. In fact, it can safely be said that if a study does not involve random assignment in one form or another, it is not an experiment.
An efficient way of counterbalancing is through a Latin square design which randomizes through having equal rows and columns. For example, if you have four treatments, you must have four versions. Like a Sudoku puzzle, no treatment can repeat in a row or column. For four versions of four treatments, the Latin square design would look like the table to the right.
Fig. 4.7 Latin Square for four variables ¶
There are two ways to think about what counterbalancing accomplishes. One is that it controls the order of conditions so that it is no longer a confounding variable. Instead of the attractive condition always being first and the unattractive condition always being second, the attractive condition comes first for some participants and second for others. Likewise, the unattractive condition comes first for some participants and second for others. Thus any overall difference in the dependent variable between the two conditions cannot have been caused by the order of conditions. A second way to think about what counterbalancing accomplishes is that if there are carryover effects, it makes it possible to detect them. One can analyze the data separately for each order to see whether it had an effect.
Researcher Michael Birnbaum has argued that the lack of context provided by between-subjects designs is often a bigger problem than the context effects created by within-subjects designs. To demonstrate this problem, he asked participants to rate two numbers on how large they were on a scale of 1-to-10 where 1 was “very very small” and 10 was “very very large”. One group of participants were asked to rate the number 9 and another group was asked to rate the number 221 [Bir99] . Participants in this between-subjects design gave the number 9 a mean rating of 5.13 and the number 221 a mean rating of 3.10. In other words, they rated 9 as larger than 221! According to Birnbaum, this difference is because participants spontaneously compared 9 with other one-digit numbers (in which case it is relatively large) and compared 221 with other three-digit numbers (in which case it is relatively small).
4.4.7. Simultaneous Within-Subjects Designs ¶
So far, we have discussed an approach to within-subjects designs in which participants are tested in one condition at a time. There is another approach, however, that is often used when participants make multiple responses in each condition. Imagine, for example, that participants judge the guilt of 10 attractive defendants and 10 unattractive defendants. Instead of having people make judgments about all 10 defendants of one type followed by all 10 defendants of the other type, the researcher could present all 20 defendants in a sequence that mixed the two types. The researcher could then compute each participant’s mean rating for each type of defendant. Or imagine an experiment designed to see whether people with social anxiety disorder remember negative adjectives (e.g., “stupid,” “incompetent”) better than positive ones (e.g., “happy,” “productive”). The researcher could have participants study a single list that includes both kinds of words and then have them try to recall as many words as possible. The researcher could then count the number of each type of word that was recalled. There are many ways to determine the order in which the stimuli are presented, but one common way is to generate a different random order for each participant.
4.4.8. Between-Subjects or Within-Subjects? ¶
Almost every experiment can be conducted using either a between-subjects design or a within-subjects design. This possibility means that researchers must choose between the two approaches based on their relative merits for the particular situation.
Between-subjects experiments have the advantage of being conceptually simpler and requiring less testing time per participant. They also avoid carryover effects without the need for counterbalancing. Within-subjects experiments have the advantage of controlling extraneous participant variables, which generally reduces noise in the data and makes it easier to detect a relationship between the independent and dependent variables.
A good rule of thumb, then, is that if it is possible to conduct a within-subjects experiment (with proper counterbalancing) in the time that is available per participant—and you have no serious concerns about carryover effects—this design is probably the best option. If a within-subjects design would be difficult or impossible to carry out, then you should consider a between-subjects design instead. For example, if you were testing participants in a doctor’s waiting room or shoppers in line at a grocery store, you might not have enough time to test each participant in all conditions and therefore would opt for a between-subjects design. Or imagine you were trying to reduce people’s level of prejudice by having them interact with someone of another race. A within-subjects design with counterbalancing would require testing some participants in the treatment condition first and then in a control condition. But if the treatment works and reduces people’s level of prejudice, then they would no longer be suitable for testing in the control condition. This difficulty is true for many designs that involve a treatment meant to produce long-term change in participants’ behavior (e.g., studies testing the effectiveness of psychotherapy). Clearly, a between-subjects design would be necessary here.
Remember also that using one type of design does not preclude using the other type in a different study. There is no reason that a researcher could not use both a between-subjects design and a within-subjects design to answer the same research question. In fact, professional researchers often take exactly this type of mixed methods approach.
4.4.9. Key Takeaways ¶
Experiments can be conducted using either between-subjects or within-subjects designs. Deciding which to use in a particular situation requires careful consideration of the pros and cons of each approach.
Random assignment to conditions in between-subjects experiments or to orders of conditions in within-subjects experiments is a fundamental element of experimental research. Its purpose is to control extraneous variables so that they do not become confounding variables.
Experimental research on the effectiveness of a treatment requires both a treatment condition and a control condition, which can be a no-treatment control condition, a placebo control condition, or a wait-list control condition. Experimental treatments can also be compared with the best available alternative.
4.4.10. Exercises ¶
Discussion: For each of the following topics, list the pros and cons of a between-subjects and within-subjects design and decide which would be better.
You want to test the relative effectiveness of two training programs for running a marathon.
Using photographs of people as stimuli, you want to see if smiling people are perceived as more intelligent than people who are not smiling.
In a field experiment, you want to see if the way a panhandler is dressed (neatly vs. sloppily) affects whether or not passersby give him any money.
You want to see if concrete nouns (e.g., dog) are recalled better than abstract nouns (e.g., truth).
Discussion: Imagine that an experiment shows that participants who receive psychodynamic therapy for a dog phobia improve more than participants in a no-treatment control group. Explain a fundamental problem with this research design and at least two ways that it might be corrected.
4.5. Conducting Experiments ¶
4.5.1. learning objectives ¶.
Describe several strategies for recruiting participants for an experiment.
Explain why it is important to standardize the procedure of an experiment and several ways to do this.
Explain what pilot testing is and why it is important.
The information presented so far in this chapter is enough to design a basic experiment. When it comes time to conduct that experiment, however, several additional practical issues arise. In this section, we consider some of these issues and how to deal with them. Much of this information applies to nonexperimental studies as well as experimental ones.
4.5.2. Recruiting Participants ¶
Of course, at the start of any research project you should be thinking about how you will obtain your participants. Unless you have access to people with schizophrenia or incarcerated juvenile offenders, for example, then there is no point designing a study that focuses on these populations. But even if you plan to use a convenience sample, you will have to recruit participants for your study.
There are several approaches to recruiting participants. One is to use participants from a formal subject pool—an established group of people who have agreed to be contacted about participating in research studies. For example, at many colleges and universities, there is a subject pool consisting of students enrolled in introductory psychology courses who must participate in a certain number of studies to meet a course requirement. Researchers post descriptions of their studies and students sign up to participate, usually via an online system. Participants who are not in subject pools can also be recruited by posting or publishing advertisements or making personal appeals to groups that represent the population of interest. For example, a researcher interested in studying older adults could arrange to speak at a meeting of the residents at a retirement community to explain the study and ask for volunteers.
Even if the participants in a study receive compensation in the form of course credit, a small amount of money, or a chance at being treated for a psychological problem, they are still essentially volunteers. This is worth considering because people who volunteer to participate in psychological research have been shown to differ in predictable ways from those who do not volunteer. Specifically, there is good evidence that on average, volunteers have the following characteristics compared with nonvolunteers [RauthoraftergroupignorespacesR75] :
They are more interested in the topic of the research.
They are more educated.
They have a greater need for approval.
They have higher intelligence quotients (IQs).
They are more sociable.
They are higher in social class.
This difference can be an issue of external validity if there is reason to believe that participants with these characteristics are likely to behave differently than the general population. For example, in testing different methods of persuading people, a rational argument might work better on volunteers than it does on the general population because of their generally higher educational level and IQ.
In many field experiments, the task is not recruiting participants but selecting them. For example, researchers Nicolas Guéguen and Marie-Agnès de Gail conducted a field experiment on the effect of being smiled at on helping, in which the participants were shoppers at a supermarket. A confederate walking down a stairway gazed directly at a shopper walking up the stairway and either smiled or did not smile. Shortly afterward, the shopper encountered another confederate, who dropped some computer diskettes on the ground. The dependent variable was whether or not the shopper stopped to help pick up the diskettes [GDG03] . Notice that these participants were not “recruited,” but the researchers still had to select them from among all the shoppers taking the stairs that day. It is extremely important that this kind of selection be done according to a well-defined set of rules that is established before the data collection begins and can be explained clearly afterward. In this case, with each trip down the stairs, the confederate was instructed to gaze at the first person he encountered who appeared to be between the ages of 20 and 50. Only if the person gazed back did he or she become a participant in the study. The point of having a well-defined selection rule is to avoid bias in the selection of participants. For example, if the confederate was free to choose which shoppers he would gaze at, he might choose friendly-looking shoppers when he was set to smile and unfriendly-looking ones when he was not set to smile. As we will see shortly, such biases can be entirely unintentional.
4.5.3. Standardizing the Procedure ¶
It is surprisingly easy to introduce extraneous variables during the procedure. For example, the same experimenter might give clear instructions to one participant but vague instructions to another. Or one experimenter might greet participants warmly while another barely makes eye contact with them. To the extent that such variables affect participants’ behavior, they add noise to the data and make the effect of the independent variable more difficult to detect. If they vary across conditions, they become confounding variables and provide alternative explanations for the results. For example, if participants in a treatment group are tested by a warm and friendly experimenter and participants in a control group are tested by a cold and unfriendly one, then what appears to be an effect of the treatment might actually be an effect of experimenter demeanor. When there are multiple experimenters, the possibility for introducing extraneous variables is even greater, but is often necessary for practical reasons.
It is well known that whether research participants are male or female can affect the results of a study. But what about whether the experimenter is male or female? There is plenty of evidence that this matters too. Male and female experimenters have slightly different ways of interacting with their participants, and of course participants also respond differently to male and female experimenters [Ros73] . For example, in a recent study on pain perception, participants immersed their hands in icy water for as long as they could [KallaiBV04] . Male participants tolerated the pain longer when the experimenter was a woman, and female participants tolerated it longer when the experimenter was a man.
Researcher Robert Rosenthal has spent much of his career showing that this kind of unintended variation in the procedure does, in fact, affect participants’ behavior. Furthermore, one important source of such variation is the experimenter’s expectations about how participants “should” behave in the experiment. This outcome is referred to as an experimenter expectancy effect [RauthoraftergroupignorespacesR75] . For example, if an experimenter expects participants in a treatment group to perform better on a task than participants in a control group, then he or she might unintentionally give the treatment group participants clearer instructions or more encouragement or allow them more time to complete the task. In a striking example, Rosenthal and Kermit Fode had several students in a laboratory course in psychology train rats to run through a maze. Although the rats were genetically similar, some of the students were told that they were working with “maze-bright” rats that had been bred to be good learners, and other students were told that they were working with “maze-dull” rats that had been bred to be poor learners. Sure enough, over five days of training, the “maze-bright” rats made more correct responses, made the correct response more quickly, and improved more steadily than the “maze-dull” rats [RF63] . Clearly it had to have been the students’ expectations about how the rats would perform that made the difference. But how? Some clues come from data gathered at the end of the study, which showed that students who expected their rats to learn quickly felt more positively about their animals and reported behaving toward them in a more friendly manner (e.g., handling them more).
The way to minimize unintended variation in the procedure is to standardize it as much as possible so that it is carried out in the same way for all participants regardless of the condition they are in. Here are several ways to do this:
Create a written protocol that specifies everything that the experimenters are to do and say from the time they greet participants to the time they dismiss them.
Create standard instructions that participants read themselves or that are read to them word for word by the experimenter.
Automate the rest of the procedure as much as possible by using software packages for this purpose or even simple computer slide shows.
Anticipate participants’ questions and either raise and answer them in the instructions or develop standard answers for them.
Train multiple experimenters on the protocol together and have them practice on each other.
Be sure that each experimenter tests participants in all conditions.
Another good practice is to arrange for the experimenters to be “blind” to the research question or to the condition that each participant is tested in. The idea is to minimize experimenter expectancy effects by minimizing the experimenters’ expectations. For example, in a drug study in which each participant receives the drug or a placebo, it is often the case that neither the participants nor the experimenter who interacts with the participants know which condition he or she has been assigned to. Because both the participants and the experimenters are blind to the condition, this technique is referred to as a double-blind study. (A single-blind study is one in which the participant, but not the experimenter, is blind to the condition.) Of course, there are many times this blinding is not possible. For example, if you are both the investigator and the only experimenter, it is not possible for you to remain blind to the research question. Also, in many studies the experimenter must know the condition because he or she must carry out the procedure in a different way in the different conditions.
4.5.4. Record Keeping ¶
It is essential to keep good records when you conduct an experiment. As discussed earlier, it is typical for experimenters to generate a written sequence of conditions before the study begins and then to test each new participant in the next condition in the sequence. As you test them, it is a good idea to add to this list basic demographic information; the date, time, and place of testing; and the name of the experimenter who did the testing. It is also a good idea to have a place for the experimenter to write down comments about unusual occurrences (e.g., a confused or uncooperative participant) or questions that come up. This kind of information can be useful later if you decide to analyze sex differences or effects of different experimenters, or if a question arises about a particular participant or testing session.
It can also be useful to assign an identification number to each participant as you test them. Simply numbering them consecutively beginning with 1 is usually sufficient. This number can then also be written on any response sheets or questionnaires that participants generate, making it easier to keep them together.
4.5.5. Pilot Testing ¶
It is always a good idea to conduct a pilot test of your experiment. A pilot test is a small-scale study conducted to make sure that a new procedure works as planned. In a pilot test, you can recruit participants formally (e.g., from an established participant pool) or you can recruit them informally from among family, friends, classmates, and so on. The number of participants can be small, but it should be enough to give you confidence that your procedure works as planned. There are several important questions that you can answer by conducting a pilot test:
Do participants understand the instructions?
What kind of misunderstandings do participants have, what kind of mistakes do they make, and what kind of questions do they ask?
Do participants become bored or frustrated?
Is an indirect manipulation effective? (You will need to include a manipulation check.)
Can participants guess the research question or hypothesis?
How long does the procedure take?
Are computer programs or other automated procedures working properly?
Are data being recorded correctly?
Of course, to answer some of these questions you will need to observe participants carefully during the procedure and talk with them about it afterward. Participants are often hesitant to criticize a study in front of the researcher, so be sure they understand that their participation is part of a pilot test and you are genuinely interested in feedback that will help you improve the procedure. If the procedure works as planned, then you can proceed with the actual study. If there are problems to be solved, you can solve them, pilot test the new procedure, and continue with this process until you are ready to proceed.
4.5.6. Key Takeaways ¶
There are several effective methods you can use to recruit research participants for your experiment, including through formal subject pools, advertisements, and personal appeals. Field experiments require well-defined participant selection procedures.
It is important to standardize experimental procedures to minimize extraneous variables, including experimenter expectancy effects.
It is important to conduct one or more small-scale pilot tests of an experiment to be sure that the procedure works as planned.
4.5.7. Exercises ¶
Practice: List two ways that you might recruit participants from each of the following populations: a. elderly adults b. unemployed people c. regular exercisers d. math majors
Discussion: Imagine a study in which you will visually present participants with a list of 20 words, one at a time, wait for a short time, and then ask them to recall as many of the words as they can. In the stressed condition, they are told that they might also be chosen to give a short speech in front of a small audience. In the unstressed condition, they are not told that they might have to give a speech. What are several specific things that you could do to standardize the procedure?
4.6. Single Factor Designs with 2 levels ¶
The simplest kind of experiment has one independent variable (single-factor) with two levels, and one dependent measure of interest. It is important to note that any experiment must have at least 2 levels. If you only measured the dependent variable in one condition, then you would simply be taking a measurement, and not conducting an experiment to see whether the measurement changes between different conditions. In order to find out if the measure changes across conditions, we need at more than one condition.
There are three general ways to manipulate an independent variable between two conditions: 1) present/absent, 2) differing magnitudes, and 3) qualitatively different conditions.
For example, consider a drug company researching a drug to reduce headache pain. They could run a present/absent experiment by having one group of participants receive the drug, and another group receive no drug, and then find out if headache pain was reduced for the group that received the drug. They could also run a magnitude experiment by having one group take one pill and the other group take two (or more) pills. This experiment could test whether taking 2 pills reduces headache pain more than taking 1 pill. Finally, they could run an experiment with qualitatively different conditions. For example, one group could take drug #1 and another group could take a different drug #2. This experiment could test whether one drug is better than another at reducing headache pain.
4.6.1. The basic empirical question: Is there a difference? ¶
All experiments have the same basic empirical question: Did the dependent variable change between conditions of the independent variable? There are many other important questions, such as how much change happened, is the change meaningful, and did the independent variable really cause the change or did some other confounding variable cause the change?
At first blush, it is easy to find out if there was any change in the dependent measure. We simply look at the measurement in condition 1 and condition 2. If they are the same, then there was no change. If they are different, then there was a change.
However, in most psychology experiments the measurements in condition 1 and 2 will always be different. This is because most measurements in Psychology are variable. In other words, the measurements themselves change from one person to the next, or within the same person from one time to the next. Imagine measuring something in condition 1 twice. If you were measuring the length of a door twice, you would expect to get the same number twice (no change). However, if you were measuring how fast someone can say a word that begins with “a” twice, you would probably find two different reaction times.
So, there are two kinds of change that researchers have to deal with: real change caused by the independent variable, and random change caused by measuring the dependent variable. Any difference that is found in an experiment could be the result of one or both of these kinds of change. As a result, it is critically important to determine whether an observed change is real, or to due random chance. For example, if an observed difference was due to random change that occurs by chance, then we should not conclude that the independent variable caused the change. If a researcher did not recognize that their observed difference could have been caused by random change, then they might wrongly conclude that it is was their manipulation that caused the change; this kind of inferential error is called a type-I error. The opposite can happen as well. A researcher might find a difference, but conclude that the difference was caused by random change, even though in reality their manipulation caused the change. This kind of inferential error is called a type-II error.
In order to avoid making type I and II inferential errors, researchers need to determine whether the change they observe was real or random. Fortunately, this is a problem that can be solved with inferential statistics. We will go into more detail about how statistics are used to solve this problem. The solution usually does not involve eliminating the influence of random change, although this can be minimized by improving the quality of the measurement (by reducing measurement error and variability). In most cases, there will be always some random change that can not be eliminated. So, researchers are always faced with determining whether there was a real change above and beyond the change that occurs randomly.
The nice thing about random chance, is that it can be estimated very precisely. As a result, for a given experiment, we can determine both how much change can be produced by random chance, and we can determine how often (or how likely) chance alone would produce changes of different sizes. For example, we could show that in some experiment, chance often produces a change of say 10 (units of the measurement), but very rarely (say only 5 % or 1% of the time) produces a change of 20 units. If a researcher found a change of 20 units or greater, then they could be confident that chance did not produce this change, and they would then conclude that the independent variable caused the change. If a researcher found a change of 5 units, then they would recognize that chance alone could have easily produced this change, and they would not be confident that their independent variable caused the change.
4.6.2. Chance and Change ¶
In order to understand how to estimate the probability that chance caused a change between conditions, we first need to understand how it is that chance can produce changes in the first place.
Chance can produce changes in a measurement for two simple reasons: measurement variability, and sampling. Measurement variability refers change or instability in a measurement. Sampling refers to the process of taking measurements from a variable.
The easiest way to see how this works is by understanding the concept of sampling from a distribution.
4.7. Single Factor Designs with multiple levels ¶
The experiments we have discussed so far are fairly simple. They have one independent variable with two levels, and a single dependent variable. Experiments can become much more complicated by adding more levels to the independent variable, adding more independent variables, and/or adding more dependent variables. As experiments become more complicated, the basic empirical question remains the same: Did the manipulation(s) cause change in the measure(s). To ease into more complex designs we will discuss single factor designs with more than two levels.
4.7.1. Quantitative vs. Qualitative Independent variables ¶
A single factor design with more than two levels involves a single independent variable (factor), and typically a single dependent variable. Importantly, the independent variable has more than two levels. Two common kinds of multi-level designs involve either quantitative or qualititative manipulations of the independent variable.
Fig. 4.8 Example of quantitative independent variable (i.e., number of pills). ¶
A quantitative manipulation is a change in magnitude, or amount. For example, a drug company might be interested in testing not only whether or not Drug A reduces headache (perhaps by comparing one group that gets the drug, and another that does not), but also how the amount of the drug influences reductions in headache pain. So, a multi-level experiment might have a few groups who receive, 0, 1, 2, 3, 4 or more pills, respectively.
Fig. 4.9 Example of qualitative independent variable (i.e., drug type). ¶
A qualitative manipulation involves categorically different conditions. For example, a drug company might be interested in comparing the relative effectiveness of different kinds of drugs in reducing headache pain. They could conduct a multi-level experiment with each group receiving a different drug, drug A, drug B, drug C, and so on.
4.7.2. Interpreting the pattern of results ¶
Designs with only two levels are fairly straightforward to interpret because there are only a few possible kinds of patterns of differences that can be observed. These include: \(A>B\) , \(A=B\) , and \(A<B\) . Or even more simply: A is the same as B ( \(A=B\) ), or A is not the same as B ( \(A>B\) , or \(A<B\) ).
The number of possible patterns that could be observed increases with each additional level. For example, consider an experiment with three levels A, B, and C. The possible patterns that could be observed are shown on the right.
As with two level designs, when reporting the results of experiments with multiple levels, it is very important to explain the pattern of means across conditions. This involves telling the reader which means were different from one another, and which means were the same.
Again, as with 2-level designs, the process of random sampling can produce differences in the sample means for each of the levels. So, researchers also conduct statistical tests to determine the likelihood that the results that they observed could have been obtained by chance alone. The most common statistical test used in this case is the one-way ANOVA (Analysis of Variance). The chapter on inferential statistics goes into more detail about ANOVAs, and we assume that you have some memories of how ANOVAs work from your statistics class. Nevertheless, we go through an example to illustrate the basic process. Note, this example is the same one discussed in chapter four of your lab manual.
4.7.3. Writing it all up ¶
The following is an example results section for a hypothetical experiment. This could serve as a model for your own results section.
The number of correctly recalled words for each subject in each condition were submitted to a one-way ANOVA, with memorization condition (A, B, C, D, and E) as the sole between-subjects factor. Mean recall scores in each condition are displayed in Figure X. The main effect of memorization condition was significant, F(4, 45) = 48.6, MSE = 3.19, p < .001. Figure 1 shows that Groups A and B had higher recall scores than Groups C and D, which had higher recall scores than Group E. This pattern was confirmed across four independent sample t-tests. Group A (M = 20.3) and Group B (M = 18.8) were not significantly different t(18) = 1.98, p =0.063. Group A recalled significantly more words than Group C (M = 15.1), t(18) = 7.9, p =0. Group C and Group D (M = 15.7) were not significantly different t(18) = -0.8, p =0.436. Finally, Group D recalled significantly more words than Group E (M = 10.1), t(18) = 6.4, p =0.
Single-Factor Experimental Design
- First Online: 01 September 2018
Cite this chapter
- Dharmaraja Selvamuthu 3 &
- Dipayan Das 4
4386 Accesses
2 Citations
Often, we wish to investigate the effect of a factor (independent variable) on a response (dependent variable). We then carry out an experiment where the levels of the factor are varied. Such experiments are known as single-factor experiment. There are many designs available to carry out such experiment. The most popular ones are completely randomized design, randomized block design, Latin square design, and balanced incomplete block design. In this chapter, we will discuss these four designs along with the statistical analysis of the data obtained by following such designs of experiments.
This is a preview of subscription content, log in via an institution to check access.
Access this chapter
Subscribe and save.
- Get 10 units per month
- Download Article/Chapter or eBook
- 1 Unit = 1 Article or 1 Chapter
- Cancel anytime
- Available as EPUB and PDF
- Read on any device
- Instant download
- Own it forever
- Compact, lightweight edition
- Dispatched in 3 to 5 business days
- Free shipping worldwide - see info
Tax calculation will be finalised at checkout
Purchases are for personal use only
Institutional subscriptions
Leaf GAV (1987) Practical statistics for the textile industry. Manchester, Part II, The Textile Institute, p 70
Google Scholar
Download references
Author information
Authors and affiliations.
Department of Mathematics, Indian Institute of Technology Delhi, New Delhi, India
Dharmaraja Selvamuthu
Department of Textile Technology, Indian Institute of Technology Delhi, New Delhi, India
Dipayan Das
You can also search for this author in PubMed Google Scholar
Corresponding author
Correspondence to Dharmaraja Selvamuthu .
A completely randomized experiment was carried out to examine the effect of an inductor current sense resistor on the power factor of a PFC circuit. Four resistor values were chosen and three replicates of each treatment were carried out. The results are displayed in Table 7.27 .
Draw a scatter plot and a box plot of power factor of the PFC circuit.
Fit a descriptive model to the above-mentioned data and comment on the adequacy of the model.
Test the hypothesis that the four sensor resistor values result in same power factor. Use \(\alpha =0.05\) .
Use Tukey’s test to compare the means of power factors obtained at different sensor resistor values.
Use Fisher’s test to compare the means of power factors obtained at different sensor resistor values. Is the conclusion same as in 8.1(d)?
Fit a suitable regression model to the data.
The austenite stainless steels are used for making various engineering components in power plants and automobile industries. In order to examine the effect of cyclic loading on fatigue properties of austenite stainless steel, a completely randomized experiment was conducted. Four levels of number of cycles to failure were chosen and the corresponding maximum stress was observed. This was replicated three times. The results of experiment are shown in Table 7.28 .
Does the number of cycles to failure affect the maximum stress of the austenite stainless steel? Use \(\alpha =0.05\) .
Use Tukey’s test to compare the means of maximum stress obtained at the different numbers of cycles to failure.
In order to examine the effect of temperature on the output voltage of a thermocouple, a completely randomized experiment was carried out. Four different temperatures (250, 500, 750, and 1000 \(^{\circ }\) C) were chosen and the output voltages were measured. This was replicated three times. The results are shown in Table 7.29 .
Does the temperature affect the output voltage of the thermocouple? Use \(\alpha =0.05\) .
Use Tukey’s test to compare the means of output voltages obtained at different temperatures.
Four catalysts that may affect the yield of a chemical process are investigated. A completely randomized design is followed where each process using a specific catalyst was replicated four times. The yields obtained are shown in Table 7.30 .
Do the four catalyst have same effect on yield? Use \(\alpha =0.05\) .
A completely randomized experiment was carried out to examine the effect of diet on coagulation time for blood of animals. For this, 24 animals were randomly allocated to four different diets A, B, C, and D and the blood samples of the animals were tested. The blood coagulation times of the animals are shown in Table 7.31 .
Draw a scatter plot and a Box plot of blood coagulation time.
Test the hypothesis that the four diets result in same blood coagulation time. Use \(\alpha =0.05\) .
Use Tukey’s test to compare the mean coagulation times.
Use Fisher’s test to compare the mean coagulation times. Is the conclusion same as in 8.5(d)?
In order to study the effect of domestic cooking on phytochemicals of fresh cauliflower, a completely randomized experiment was conducted. Cauliflower was processed by four different cooking methods, and the resulting glucosinolates were measured. Each treatment was replicated three times. The experimental data are shown in Table 7.32 . Does the cooking method affect the concentration of glucosinolates in cooked cauliflower? Use \(\alpha =0.05\) .
Three brands of batteries were investigated for their life in clocks by performing a completely randomized design of experiment. Four batteries of each brand were tested, and the results on the life (weeks) of batteries were obtained and are shown in Table 7.33 .
Are the lives of the four brands of batteries different? Use \(\alpha =0.05?\)
A quality control manager wishes to test the effect of four test methods on the percentage rejection of produced items. He performed a completely randomized design with four replicates under each test methods and obtained the following data shown in Table 7.34 .
Are the percentage rejection resulting from four different test methods different? Use \(\alpha =0.05\) .
A completely randomized design of experiment was carried out to compare five brands of air-conditioning filters in terms of their filtration efficiency. Four filters of each brand were tested, and the results were obtained on the filtration efficiency \((\%)\) of the filters and are shown in Table 7.35 .
Are the filtration efficiencies of the five brands of filters different? Use \(\alpha \) \(=\) 0.05.
An experiment was conducted to examine the effect of lecture timing on the marks (out of 100) obtained by students in a common first-year undergraduate course of Mathematics. A randomized block design was chosen with three blocks in such a way that the students majoring in Computer Science, Electrical Engineering, and Mechanical Engineering constituted Block 1, Block 2, and Block 3, respectively. The marks obtained by the students out of 100 are shown in Table 7.36 .
Analyze the data and draw appropriate conclusion. Use \(\alpha \) \(=\) 0.05.
Consider, in the above problem, the experimenter was not aware of randomized block design and would have assigned the lecture timing each of the students majoring in different branches randomly. Further, consider that the experimenter would have obtained the same results as in the above problem by chance. Would the experimenter have concluded the same as in the above problem? Use \(\alpha \) \(=\) 0.05.
Three different test methods ( A , B , and C ) are compared to examine the strength of four different materials (I, II, III, and IV) in accordance with a randomized block design of experiment such that each material acts like a block. The results of experiments (strength in N) are given in Table 7.37 .
A sales manager was interested to compare the sales of three products ( A , B , and C ). A Latin Square Design of experiment was conducted to systematically control the effects of region and season on the sales of the products. The data on sales revenue (in thousand dollars) are given in Table 7.38 .
An oil company wishes to test the effect of four different blends of gasoline ( A , B , C , and D ) on the fuel efficiency of cars. It is thought to run the experiment according to a Latin Square Design so that the effects of drivers and car models may be systematically controlled. The fuel efficiency, measured in km/h after driving the cars over a standard course, is shown in Table 7.39 .
A chemical engineer wishes to test the effect of five catalysts ( A , B , C , D , and E ) on the reaction time of a chemical process. An experiment is conducted according to a Latin Square Design so that the effects of batches and experimenters may be systematically controlled. The experimental data, expressed in h, is shown in Table 7.40 .
A process engineer is interested to study the effect of oil weight onto the filtration efficiency of engine intake air filters of automotives. In the road test he wishes to use cars as blocks, however, because of time constraint, he used a balanced incomplete block design. The results of experiment are shown in Table 7.41 . Analyze the data using \(\alpha =0.05\) .
Rights and permissions
Reprints and permissions
Copyright information
© 2018 Springer Nature Singapore Pte Ltd.
About this chapter
Selvamuthu, D., Das, D. (2018). Single-Factor Experimental Design. In: Introduction to Statistical Methods, Design of Experiments and Statistical Quality Control. Springer, Singapore. https://doi.org/10.1007/978-981-13-1736-1_7
Download citation
DOI : https://doi.org/10.1007/978-981-13-1736-1_7
Published : 01 September 2018
Publisher Name : Springer, Singapore
Print ISBN : 978-981-13-1735-4
Online ISBN : 978-981-13-1736-1
eBook Packages : Mathematics and Statistics Mathematics and Statistics (R0)
Share this chapter
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
- Publish with us
Policies and ethics
- Find a journal
- Track your research
Single-Factor Designs
Between-subjects versus within-subjects experimental designs.
In between-subjects experimental designs, we randomly assign different subjects to each of the levels of the independent variable. That is, for an experiment with one IV with two levels or conditions, half of the subjects are exposed to the first level of the independent variable and the other half of subjects are exposed to the second level of the independent variable. For each participant, his/her score on the dependent variable is collected following exposure to the independent variable. Dr. Z's examination of the effects of type of course material on grades in the course is a between-subjects design, in that, one set of students were assigned to receive traditional course materials while a second set of students were assigned to receive the traditional materials accompanied by interactive tutorials. In other words, subjects could only belong to one level of the independent variable.
In within-subjects experimental designs, each subject in the study is exposed to each level of the independent variable. Therefore, for each subject a score on the dependent variable is collected more than once (once for each level of the independent variable). At times you may hear this design referred to as a repeated-measures design , since all subjects are repeatedly measured on the dependent measure for each level of the independent variable. For example, suppose that you hypothesized that you could alleviate the fear of public speaking by training people to engage in some deep breathing before beginning their speech. For the control condition (absence of treatment) you have a number of participants give a short speech introducing themselves to a small crowd of on-lookers. Immediately following the speech, you measure the participants' heart rates as a measure of the stress or fear they have experienced. A week later, you expose the same participants to the treatment condition. You give these same participants some training in deep breathing exercises and instruct them to use this technique before speaking in public. Once again, you ask these same participants to give a short speech introducing themselves to a small crowd and, once again, you take the participants' heart rate immediately following the speech. This experiment uses a within-subjects design, in that all participants in the study were exposed to each level of the IV (control condition and deep breathing condition).
To illustrate the differences between within-subjects and between-subjects designs, suppose we wanted to test the effects of eating sweets on children's attention. We might have two levels of our independent variable: exposed to no sweets in the hour preceding the attention test, and fed a 50g chocolate bar in the hour preceding the attention test. If we had 8 children available to us to participate in the study (this is a hypothetical situation, typically we would require higher sample sizes), then here is an illustration of how the children may be assigned to the groups or conditions of the independent variable and when the dependent variable may be measured.
Between-Subjects Design
No Sweets Condition (Control Group): Collect attention scores for Joe, Jim, Judy, & Jill
Chocolate Bar Condition (Experimental Group): Collection attention scores for Kyle, Kurt, Kathy, & Kali
Within-Subjects Design
No Sweets Condition (Control Condition): Collect attention scores for Joe, Jim, Judy, Jill, Kyle, Kurt, Kathy, & Kali
Chocolate Bar Condition (Experimental Condition): On the following day, collect attention scores for Joe, Jim, Judy, Jill, Kyle, Kurt, Kathy, & Kali
Experimental Design: Types, Examples & Methods
Saul McLeod, PhD
Editor-in-Chief for Simply Psychology
BSc (Hons) Psychology, MRes, PhD, University of Manchester
Saul McLeod, PhD., is a qualified psychology teacher with over 18 years of experience in further and higher education. He has been published in peer-reviewed journals, including the Journal of Clinical Psychology.
Learn about our Editorial Process
Olivia Guy-Evans, MSc
Associate Editor for Simply Psychology
BSc (Hons) Psychology, MSc Psychology of Education
Olivia Guy-Evans is a writer and associate editor for Simply Psychology. She has previously worked in healthcare and educational sectors.
On This Page:
Experimental design refers to how participants are allocated to different groups in an experiment. Types of design include repeated measures, independent groups, and matched pairs designs.
Probably the most common way to design an experiment in psychology is to divide the participants into two groups, the experimental group and the control group, and then introduce a change to the experimental group, not the control group.
The researcher must decide how he/she will allocate their sample to the different experimental groups. For example, if there are 10 participants, will all 10 participants participate in both groups (e.g., repeated measures), or will the participants be split in half and take part in only one group each?
Three types of experimental designs are commonly used:
1. Independent Measures
Independent measures design, also known as between-groups , is an experimental design where different participants are used in each condition of the independent variable. This means that each condition of the experiment includes a different group of participants.
This should be done by random allocation, ensuring that each participant has an equal chance of being assigned to one group.
Independent measures involve using two separate groups of participants, one in each condition. For example:
- Con : More people are needed than with the repeated measures design (i.e., more time-consuming).
- Pro : Avoids order effects (such as practice or fatigue) as people participate in one condition only. If a person is involved in several conditions, they may become bored, tired, and fed up by the time they come to the second condition or become wise to the requirements of the experiment!
- Con : Differences between participants in the groups may affect results, for example, variations in age, gender, or social background. These differences are known as participant variables (i.e., a type of extraneous variable ).
- Control : After the participants have been recruited, they should be randomly assigned to their groups. This should ensure the groups are similar, on average (reducing participant variables).
2. Repeated Measures Design
Repeated Measures design is an experimental design where the same participants participate in each independent variable condition. This means that each experiment condition includes the same group of participants.
Repeated Measures design is also known as within-groups or within-subjects design .
- Pro : As the same participants are used in each condition, participant variables (i.e., individual differences) are reduced.
- Con : There may be order effects. Order effects refer to the order of the conditions affecting the participants’ behavior. Performance in the second condition may be better because the participants know what to do (i.e., practice effect). Or their performance might be worse in the second condition because they are tired (i.e., fatigue effect). This limitation can be controlled using counterbalancing.
- Pro : Fewer people are needed as they participate in all conditions (i.e., saves time).
- Control : To combat order effects, the researcher counter-balances the order of the conditions for the participants. Alternating the order in which participants perform in different conditions of an experiment.
Counterbalancing
Suppose we used a repeated measures design in which all of the participants first learned words in “loud noise” and then learned them in “no noise.”
We expect the participants to learn better in “no noise” because of order effects, such as practice. However, a researcher can control for order effects using counterbalancing.
The sample would be split into two groups: experimental (A) and control (B). For example, group 1 does ‘A’ then ‘B,’ and group 2 does ‘B’ then ‘A.’ This is to eliminate order effects.
Although order effects occur for each participant, they balance each other out in the results because they occur equally in both groups.
3. Matched Pairs Design
A matched pairs design is an experimental design where pairs of participants are matched in terms of key variables, such as age or socioeconomic status. One member of each pair is then placed into the experimental group and the other member into the control group .
One member of each matched pair must be randomly assigned to the experimental group and the other to the control group.
- Con : If one participant drops out, you lose 2 PPs’ data.
- Pro : Reduces participant variables because the researcher has tried to pair up the participants so that each condition has people with similar abilities and characteristics.
- Con : Very time-consuming trying to find closely matched pairs.
- Pro : It avoids order effects, so counterbalancing is not necessary.
- Con : Impossible to match people exactly unless they are identical twins!
- Control : Members of each pair should be randomly assigned to conditions. However, this does not solve all these problems.
Experimental design refers to how participants are allocated to an experiment’s different conditions (or IV levels). There are three types:
1. Independent measures / between-groups : Different participants are used in each condition of the independent variable.
2. Repeated measures /within groups : The same participants take part in each condition of the independent variable.
3. Matched pairs : Each condition uses different participants, but they are matched in terms of important characteristics, e.g., gender, age, intelligence, etc.
Learning Check
Read about each of the experiments below. For each experiment, identify (1) which experimental design was used; and (2) why the researcher might have used that design.
1 . To compare the effectiveness of two different types of therapy for depression, depressed patients were assigned to receive either cognitive therapy or behavior therapy for a 12-week period.
The researchers attempted to ensure that the patients in the two groups had similar severity of depressed symptoms by administering a standardized test of depression to each participant, then pairing them according to the severity of their symptoms.
2 . To assess the difference in reading comprehension between 7 and 9-year-olds, a researcher recruited each group from a local primary school. They were given the same passage of text to read and then asked a series of questions to assess their understanding.
3 . To assess the effectiveness of two different ways of teaching reading, a group of 5-year-olds was recruited from a primary school. Their level of reading ability was assessed, and then they were taught using scheme one for 20 weeks.
At the end of this period, their reading was reassessed, and a reading improvement score was calculated. They were then taught using scheme two for a further 20 weeks, and another reading improvement score for this period was calculated. The reading improvement scores for each child were then compared.
4 . To assess the effect of the organization on recall, a researcher randomly assigned student volunteers to two conditions.
Condition one attempted to recall a list of words that were organized into meaningful categories; condition two attempted to recall the same words, randomly grouped on the page.
Experiment Terminology
Ecological validity.
The degree to which an investigation represents real-life experiences.
Experimenter effects
These are the ways that the experimenter can accidentally influence the participant through their appearance or behavior.
Demand characteristics
The clues in an experiment lead the participants to think they know what the researcher is looking for (e.g., the experimenter’s body language).
Independent variable (IV)
The variable the experimenter manipulates (i.e., changes) is assumed to have a direct effect on the dependent variable.
Dependent variable (DV)
Variable the experimenter measures. This is the outcome (i.e., the result) of a study.
Extraneous variables (EV)
All variables which are not independent variables but could affect the results (DV) of the experiment. Extraneous variables should be controlled where possible.
Confounding variables
Variable(s) that have affected the results (DV), apart from the IV. A confounding variable could be an extraneous variable that has not been controlled.
Random Allocation
Randomly allocating participants to independent variable conditions means that all participants should have an equal chance of taking part in each condition.
The principle of random allocation is to avoid bias in how the experiment is carried out and limit the effects of participant variables.
Order effects
Changes in participants’ performance due to their repeating the same or similar test more than once. Examples of order effects include:
(i) practice effect: an improvement in performance on a task due to repetition, for example, because of familiarity with the task;
(ii) fatigue effect: a decrease in performance of a task due to repetition, for example, because of boredom or tiredness.
IMAGES
VIDEO
COMMENTS
Often, we wish to investigate the effect of a factor (independent variable) on a response (dependent variable). We then carry out an experiment where the levels of the factor are varied. Such experiments are known as single-factor experiment.
We review the issues related to a single factor experiment, which we see in the context of a Completely Randomized Design (CRD). In a single factor experiment with a CRD, the levels of the factor are randomly assigned to the experimental units.
We review the issues related to a single factor experiment, which we see in the context of a Completely Randomized Design (CRD). In a single factor experiment with a CRD, the levels of the factor are randomly assigned to the experimental units.
•Experimental factor (or variable): Controlled aspect of the experiment. One may choose not to control all controllable factors. •Factor level: Specific value of factor. •Treatment: A single factor level or combinations of two or more factors. •Unit: “the smallest division of experimental material such
A single factor design with more than two levels involves a single independent variable (factor), and typically a single dependent variable. Importantly, the independent variable has more than two levels.
A good experimental design requires a strong understanding of the system you are studying. There are five key steps in designing an experiment: Consider your variables and how they are related; Write a specific, testable hypothesis; Design experimental treatments to manipulate your independent variable
Single-Factor Experimental Design. 7.1 Introduction. Often, we wish to investigate the effect of a factor (independent variable) on a response (dependent variable). We then carry out an experiment where the levels of the factor are varied. Such experiments are known as single-factor experiment.
Single-Factor Designs Tutorial. Between-Subjects versus Within-Subjects Experimental Designs. In between-subjects experimental designs, we randomly assign different subjects to each of the levels of the independent variable.
Experimental design refers to how participants are allocated to different groups in an experiment. Types of design include repeated measures, independent groups, and matched pairs designs.
Causality and Confounds. What are the three criteria that must be met in order to make a causal inference? covariation of X and Y. temporal order. absence of plausible alternative explanations. What is a confounding variable? a factor that covaries with the IV. cannot tell whether the IV or the confound affects the DV. Confounding Variables.