Monday, April 24, 2017

T-Test in Action

If you wanted to see a t-test in action, you've come to the right place. (If watching me conduct statistical analyses isn't what you were hoping for, I don't know what to tell you. Here's a video of two corgis playing tetherball.)

This month, I've been using an ongoing example of a study on the effect of caffeine on test performance. In fact, in my post on p-values, I gave fictional means and standard deviations to conduct a t-test. All I told you was the p-value, but I didn't go into how that was derived.

First, I used those fictional means and standard deviations to generate some data. I used the rnorm function in R to generate two random samples that were normally distributed and matched up with the descriptive statistics I provided. (And since the data are fake anyway, I've made the dataset publicly available in a tab-delimited here. For the group variable, 0 = control and 1 = experimental.) So I have a sample of 60 people, 30 in each group. I know the data are normally distributed, which is one of the key assumptions of the t-test. The descriptive data is slightly different from what I reported in the p-value post; I just made up those values on the spot, but what I have from the generated data is really close to those values:

Experimental group: M = 83.2, SD = 6.21
Control group: M = 79.3, M = 6.40

The difference in means is easy to get - you just subtract one mean from the other. The difference between groups is 3.933. The less straightforward part is getting the denominator - the pooled standard error. I'm about to get into a more advanced statistical concept, so bear with me.

Each sample has their standard deviation you can see above. That tells you how much variation among individuals to expect by chance alone. But when you conduct a t-test of two independent samples (that is, no overlap or matching between your groups), you're testing the probability that you would get a mean difference of that size. The normal distribution gives you probabilities of scores, but what you actually want to compare to is the probability of mean differences, where each sample is a collective unit.

Your curve is actually a distribution of mean differences, and your measure of variability is how much samples deviate from the center of that distribution (the mean of mean differences). Essentially, that measure of variability is how much we would expect mean differences to vary by chance alone. We expect mean differences based on larger samples to more accurately reflect the true mean difference (what we would get if we could measure everyone in the population) than smaller samples. We correct our overall standard deviation by sample size to get what we call standard error (full name: standard error of the difference). In fact, the equation uses variance (s2) divided by sample size for each group, then adds them together and takes the square root to get standard error.


Using the two standard deviations above (squared they are 38.51 and 40.96, respectively), and plugging those values into this equation, our standard error is 1.63. If we divide the mean difference (3.933) by this standard error, we get a t of 2.41. We would use the t-distribution for a degrees of freedom of 58 (60-2). This t-value corresponds to a p of 0.02. If our alpha was 0.05, we would say this difference is significant (unlikely to be due to chance).

You could replicate this by hand if you'd like. You'd have to use a table to look up your p-value, but this would only give you an approximation, because the table won't give you values for every possible t. Instead, you can replicate these exact results by:
  • Using an online t-test calculator 
  • Pulling the data into Excel and using the T.TEST function (whichever group is array 2, their mean will be subtracted from the mean of array 1, so keep in mind depending on how you assign groups that your mean difference might be negative; for tails, select 2, and for type, select 2)
  • Computing your t by hand then using the T.DIST.2T function to get your exact p (x is your t - don't ask me why they didn't just use t instead of x in the arguments; maybe because Excel was not created by or for statisticians)
(Note: If you get into conducting statistics, Excel is not a good tool, especially for more advanced stats. But for basic demonstrations like this, it's fine.)

Bonus points if you do the t-test while drinking a beer (Guinness if you really want to be authentic).

T is for T-Test

And now the long-awaited post about the intersection of two things I love: statistics and beer. In fact, as I was working on this post Sunday evening, I was enjoying a Guinness:


I'll get to why I specifically chose Guinness in a moment. But first, let's revisit our old friend, the standard normal distribution:


This curve describes the properties of a normally distributed variable in the population. We can determine the exact proportion of scores that will fall within a certain area of the curve. The thing is, this guy describes population-level data very well, but not so much with samples, even though the sample would be drawn from the population reflected in this curve. Think back to the post about population versus sample standard deviation; samples tend to have less variance than populations. The proportions in certain areas of the standard normal distribution are not just the number of people who fall in that range; they are also the probabilities that you will end up with a person falling within that range in your sample. So you have a very high probability of getting someone who falls in the middle, and a very low probability of getting someone who falls in one of the tails.

Your sample standard deviation is going to be an underestimate of the population standard deviation, so we apply the correction of N-1. The degree of underestimation is directly related to sample size - the bigger the sample, the better the estimate. So if you drew a normal distribution for your sample, it would look different depending on the sample size. As sample size increases, the distribution would look more and more like the standard normal distribution. But the areas under different parts of the curve (the probabilities of certain scores) would be different depending on sample size. So you need to use a different curve to determine your p-value depending on your sample size. If you use the standard normal distribution instead, your p-values won't be accurate.

In the early 1900s, a chemist named William Sealy Gosset was working at the Guinness Brewing Company. Guinness frequently hired scientists and statisticians, and even allowed their technical staff to take sabbaticals to do research - it's like an academic department but with beer. Gosset was dealing with very small samples in his research on the chemical properties of barley, and he needed a statistic (and distribution) that would allow him to conduct statistical analyses with a very small number of cases (sometimes as few as 3). Population-level tests and distributions would not be well-suited for such small samples, so Gosset used his sabbatical to spend some time at University College London, developed the t-test and t-distribution, and published his results to share with the world. (You can read the paper here.)

Every person who has taken a statistics course has learned about the t-test, but very few know Gosset's name. Why? Because he published the paper under the pseudonym "Student" and to this day, the t-test is known as Student's t-test (and the normal curves the Student's t-distribution). There are many explanations for this, and unfortunately, I don't know which one is accurate. I had always heard the first one, but as I did some digging, I found other stories:
  • Gosset feared people wouldn't respect a statistic created by a brewer, so he hid his identity
  • Guinness didn't allow its staff to publish
  • Guinness did allow staff to publish, but only under a pseudonym
  • Gosset didn't want competitors to know Guinness was using statistics to improve brewing
I'd like to show you a worked example, but since this post is getting long, I'm going to stop here. But I'll have a second post this afternoon showing a t-test in action (if you're into that kind of thing). Stay tuned!

Saturday, April 22, 2017

S is for Scatterplot

Visualizing your data is incredibly important. I talked previously about the importance of creating histograms of your interval/ratio variables to check the shape of your distribution. Today, I'm going to talk about another way to visualize data: the scatterplot.

Let's say you have two interval/ratio variables that you think are related to each other in some way. You might think they're simply correlated, or you might think that one causes the other one. You would first want to look at the relationship between the two variables. Why? Correlation assumes a linear relationship between variables, meaning a consistent positive (as one increases so does the other) or negative (as one increases the other decreases) relationship across all values. We wouldn't want it to be positive at first, and then flatten out before turning negative. (I mean, we might, if that's the kind of relationship we expect, but we would need to analyze our data with a different statistic - one that doesn't assume a linear relationship.)

So we create a scatterplot, which maps out each participants' pair of scores on the two variables we're interested in. In fact, you've probably done this before in math class, on a smaller scale.

As I discussed in yesterday's bonus post, I had 257 people respond to a rather long survey about how they use Facebook, and how use impacts health outcomes. My participants completed a variety of measures, including measures of rumination, savoring, life satisfaction, Big Five personality traits, physical health complaints, and depression. There are many potential relationships that could exist between and among these concepts. For instance, people who ruminate more (fixate on negative events and feelings) also tend to be more depressed. In fact, here's a scatterplot created with those two variables from my study data:


And sure enough, these two variables are positively correlated with each other: r = 0.568. (Remember that r ranges from -1 to +1, and that 1 would indicate a perfect relationship. So we have a strong relationship here, but there are still other variables that explain part of the variance in rumination and/or depression.)

Savoring, on the other hand, is in some ways the opposite of rumination; it involves fixating on positive events and feelings. So we would expect these two to be negatively correlated with each other. And they are:


The correlation between these two variables is -0.351, so not as a strong as the relationship between rumination and depression and in the opposite direction.

Unfortunately, I couldn't find any variables in my study that had a nonlinear relationship to show (i.e., has curves). But I could find two variables that were not correlated with each other: the Extraversion scale from Big Five and physical health complaints. Unsurprisingly, being an extravert (or introvert) has nothing to do with health problems (r = -0.087; pretty close to 0):


But if you really want to see what a nonlinear relationship might look like, check out this post on the Dunning-Kruger effect; look at the relationship between actual performance and perceived ability.

As I said yesterday, r also comes with a p-value to tell whether the relationship is larger than we would expect by chance. We would usually report the exact p-value, but for some these, the p-value is so small (really small probability of occurring by chance), the program doesn't display the whole thing. In those cases, we would choose a really small value (the convention in these cases seems to be 0.001) and say the p was less than that. Here's the r's and p-values for the 3 scatterplots above:

  1. Rumination and Depression, r = 0.568, p < 0.001
  2. Rumination and Savoring, r = -0.351, p < 0.001
  3. Extraversion and Health Complaints, r = -0.087, p = 0.164

Friday, April 21, 2017

Bonus Post: Explained Variance and a Power Analysis in Action

In my beta post, I talked about power analysis, and how I've approached it if I don't have previous studies to guide me on what kind of effect I should expect. For instance, I referenced my study on Facebook use and health outcomes among college students. When I conducted the study (Fall 2011), there wasn't as much published research on Facebook effects. Instead, I identified the smallest effect I was interested in seeing - that is, the smallest effect that would be meaningful.

I used an analysis technique called multiple linear regression, which produces an equation to predict a single dependent variable. Multiple refers to the number of predictor variables being used to predict the dependent variable. And linear means that I expected a consistent positive or negative relationship between each predictor and the outcome. You probably remember working with linear equations in math class:

y = ax + b

where y is the variable you're predicting, a is the slope (how much y changes for each 1 unit change in x), and b is the constant (the value of y when x is 0). (You might have instead learned it as y = mx + b, but same thing.) That's what a regression equation looks like. When there's more than one predictor, you add in more "a*x" terms: a1x1, a2x2, etc.

When you conduct a regression, one piece of information you get is R-squared. This month, I've talked about how statistics is about explaining variance. Your goal is to move as much of the variance from the error (unexplained) column into the systematic (explained) column. Since you know what the total variance is (because it's a descriptive statistic - something you can quantify), when you move some of the variance over to the explained column, you can figure out what proportion of the variance is explained. You just divide the amount of variance you could explain by the total variance. R-squared is that proportion - it is the amount of variance in your dependent variable that can be explained by where people were on the predictor variable(s).

By the way, R-squared is based on correlation. For a single predictor variable, R-squared will be the squared correlation between x and y (R = r). For multiple predictor variables, R-squared will be the squared correlation of all the x's with y, after the correlation between/among the x's is removed (the overlap between/among the predictors).

My main predictor variable was how people used Facebook (to fixate on negative events or celebrate positive events - so actually, there were two predictor variables). The outcome was health outcomes. The other predictor variables were control variables - other variables I thought would affect the outcomes beyond Facebook use; this included characteristics like gender, race, ethnicity, and so on.

For my power analysis prior to conducting my Facebook study, I examined how many people I would need to find an R-squared of 0.05 or greater (up to 0.50 - and I knew it was unlikely I'd find an R-squared that high). I also included the following assumptions when I conducted the power analysis: my alpha would be 0.05 (Type I error rate), my power would be at least 0.80 (so beta, Type II error rate, of 0.20 or less), and my control variables would explain about 0.25 of the variance. Using a program called PASS (Power Analysis for the Social Sciences), I was able to generate a table and a graph of target sample sizes for each R-squared from 0.05 to 0.50:


For the smallest expected R-squared (0.05), I would have needed 139 people in my study to have adequate power for an R-squared that small to be significant (unlikely to have occurred by chance). The curve flattens out around 0.25, where having a large R-squared doesn't really change how many people you need.

So based on the power analysis, I knew I needed about 140 people. The survey was quite long, so we expected a lot of people to stop before they were finished; as a result, I adjusted this number up so that even if I had to drop a bunch of data because people didn't finish, I would still have at least 140 usable cases. Surprisingly, this wasn't an issue - we ended up with complete data for 257 participants, 251 of whom were Facebook users.

R is for r (Correlation)

You've probably heard the term "correlation" before. It's used to say that two things are related to each other. Two things can be correlated with each other but that says nothing about cause - one could cause the other OR another variable could cause both (also known as the "third variable problem" or "confound").

BTW, my favorite correlation-related cartoon:


There are different statistics that measure correlation, but the best known is Pearson's correlation coefficient, also known as r. This statistic, which is used when you have two interval or ratio variables, communicates a great deal of information:
  • Strength of the relationship: r ranges from -1 to +1; scores of +/-1 indicate a perfect relationship, while scores of 0 indicate no relationship
  • Direction of the relationship: positive values indicate a positive relationship, where as one variable increases so does the other; negative values indicate a negative or inverse relationship, where as on variable increases the other decreases
Just like the t-test I hinted at in my post on p-values, r also has a p-value to let you know if the relationship is significant (stronger than we would expect by chance alone). And as with any statistic we've talked about thus far, there's the potential for Type I error. We could, just by luck, get a significant correlation between two variables that actually have nothing to do with each other. Why? Because probability, that's why.

Here's a demonstration of that concept. I created 20 samples of 30 participants measured on two randomly generated continuous variables. Because these are randomly generated, they should not be significantly correlated other than by chance alone. I then computed correlation coefficients for each of these samples. If you recall from the alpha post, with an alpha of 0.05, we would expect at least 1 of 20 to be significant just by chance. It could be more or less, because, well, probability. It's a 5% chance each time, just like you have a 50% chance of heads each time you flip a coin - you could still get 10 heads in a row. And you could figure out the probability of getting multiple significant results just by chance in the same way as you would multiple heads in a row: with joint probability.

The results? 3 were significant.


BTW, using joint probability, the chance of having 3 significant results in this situation was 0.0125%. Small, but not 0.

Tomorrow I'll talk about how we visualize these relationships.

Thursday, April 20, 2017

Q is for Quota Sampling

I'm kind of cheating, because this is more of a methods topic than a statistics topic. But, as I've argued from atop my psychology pedagogy soapbox, the two are very much connected ("and should be taught as a combine course!" I shout from atop my... well, you get the idea). Your methods can introduce bias, increasing the probability of things like Type I error, and the methods you use can also impact what statistical analyses you can/should use.

Here's something you may not realize: nearly every statistical analysis you learn about in an introductory statistics course assumes random sampling, meaning the sample you used in the study had to be randomly selected from the population of interest. In other words, every person in the population you're interested in (who you want to generalize back to) should have an equal probability of being included in the study.

Here's something you probably do realize: many studies are conducted on college students, mainly students currently taking introductory psychology (and thus, mostly freshmen). Further, students are usually given access to a list of studies needing participants and they select the ones to participate in.

See the issue here? We analyze data using statistics meant for random sampling, on studies that used convenience sampling (i.e., not random). In fact, there's even some potential for selection bias since people choose which studies to participate in. There is much disagreement on whether this is a big deal or not. This is why I balk when people act as though statistics and research issues are clear-cut and unanimously agreed upon.

In fact, true random sampling is pretty much impossible. If your study requires people to come into the lab, you can't exactly recruit people at random from around the world, or even more narrowly, around the US. Survey research firms probably come the closest to true random sampling, but even then, there are limitations. Random digit dialing will miss people who don't have a phone (which, true, is very few people) and will have differential probability of being selected if two or more people share a phone. If your population is more narrow than, say, the entire US population, it might be a little more doable to have nearly random sampling, but there's also that pesky issue of consent. You can't force people to participate in your study unless you're the Census Bureau and can threaten them with legal action if they fail to comply. No matter what, you're going to have selection bias.

But fine, let's say we can actually have truly random sampling. We still might not end up with a sample that accurately represents the population. Why? Because probability. (For those playing along at home, that's been the answer to nearly every rhetorical question this month.)

Weird things can happen when you let something be random. Like 10 heads in a row, or snake eyes twice in a row, or a sample of 70% women from a population that is 50% women. Sometimes we have to give probability a hand, so we might stratify our sample, to ensure we have even representation for different characteristics. So if our population is 50% women, we would force our sample to be 50% women.

We select the characteristics that matter to us - usually things like gender, race, ethnicity, socioeconomic status, and so on, but it also depends on what you're studying - and draw our sample to ensure it has essentially the same proportions of these different characteristics as we see in the population. We call this stratified random sampling.

So why is the title of this post quota sampling? As I said, many studies are conducted using convenience samples, especially when random sampling would be costly, time-consuming, and/or impossible. But it might still be important to us to have similar characteristics as the population. So we set quotas.

If I want to make sure my sample is 50% women, I would open up half my slots for women, and when I had as many women as I needed, I would close that portion of the study. Probably the easiest way to accomplish this is with a screening questionnaire or interview. Screening is done to exclude people who don't qualify for the study for some reason (e.g., they had 5 cups of coffee this morning), but it can also be used to enforce quotas. Quota sampling is the non-random counterpart to stratified random sampling.

So if you're using a convenient sample (and let's face, most researchers are), but want it to mirror the characteristics of the population, use quota sampling.

Wednesday, April 19, 2017

P is for P-Value

Hopefully you're picking up on a recurring theme in these posts - that statistics is, by and large, about determining the likelihood that some outcome would happen by chance alone, and using that information to conclude whether something caused that outcome. If something is unlikely to occur by chance alone, we decide that it didn't occur by chance alone and the effect we saw has a systematic explanation.

We use measures like standard deviation to give us an idea of how much scores vary on their own, and we make assumptions (which we should confirm with histograms) about how the data are distributed (usually, we want them to be normally distributed). These pieces of information allow us to generate probabilities of different scores. When we conduct statistical analysis, one of the pieces of output we get is the probability that we would see the effect we saw just by chance. That, my friends, is called a p-value. We compare our p-value to the alpha we set beforehand. If our alpha is 0.05, and our p-value less than or equal to 0.05, we conclude there is a real difference/effect.

Let's use our caffeine study example once again. Say I conducted the study and found the following (note - M = mean, SD = standard deviation):

Experimental group: M = 83.2, SD = 6.1
Control group: M = 79.3, SD = 6.5

Let's also say there are 30 people in each group. This is all the information I need to conduct a simple statistical analysis, in this case a t-test, which I'll talk more about in the not-so-distant future. I conduct my t-test, and obtain a p-value of 0.02. The difference in mean test performance (between 83.2 and 79.3) has a 2% chance of happening by chance alone. That's less than 0.05, so I would conclude there is a real difference here - caffeine helped the experimental group perform better than the control group.

But 2% isn't 0. The finding could still be just a fluke, and I could have just committed a Type I error. The only way to know for certain would be to replicate the study.