Statistical power analysis for EEG experiment(s)?

As EEG amateurs some of us like exploratory experiments and informal, heuristic research, and sometimes we make the results of these DIY investigations available to others as preliminary findings. However, occasionally a strange thing happens and other people actually look at the results from these informal experiments!

Sometimes, professionals (like Grant R. Vousden-Dishington) go so far as to suggest that we get serious and analyze the significance of our proposed home-brew experiments before we even do them.

At this point we might say: “Hey! We’re amateurs here, just trying to do some cool things and see what we find. We don’t need a Bagger 293-sized pile of statistical machinery to see whether our experiment is interesting or not. Leave that to the professionals!”

We could say that. But, since we’re starry-eyed swell-headed auto-didactic upstarts who like biting off things slightly bigger than we can chew, we don’t.

Instead, we say, “Sure! Let’s do a power analysis!”

But then there’s this problem: we are, in fact, amateurs. Which means that we haven’t ever done a power analysis, especially not on something as quantitatively murky as an EEG experiment. In other words, we need professional help.

Wikipedia and other sources spell out the basics of power analysis fairly clearly. But, there are certain peculiarities of the case we’re looking at (a future binaural beats experiment) that are not so clear. Specifically, we’re looking for your input on these topics:

  • In hypothesis testing experiments, α (type I, false positive error) is conventionally kept at or below 0.05, and β (Type II, false negative error) is considered acceptable up to 0.2. This assumes, though, that there is a baseline null hypothesis and a novel, alternate hypothesis that is being tested. In the case of question such as “do binaural beats entrain neural oscillations?”, a negative answer would be in many ways more controversial and unexpected than a positive one. (Saying nothing of either subjective or behavioural/cognitive effects, only entrainment). How should we determine appropriate α and β values for an experiment of this sort?

  • In our case, the hypothesis (assuming, notwithstanding the above, that H1 = entrainment from binaural beats) is that there will be an increase in the amplitude of a particular frequency that corresponds to the beat frequency of the binaural audio. So we could probably say that the effect is equal to at - ac, where at and ac are the amplitude at the target frequency for trials and controls, respectively. The power spectra of control and trial segments are averaged, so each data point represents an entire segment. In each session there will probably only be a small number of these segments (likely two each of control and trial segments). This means the total number of data points for controls or trials will be 2*n. What are the appropriate data to use in calculating the standard deviation in this scenario?

  • Using spectrograms and power plots it’s pretty easy to get a visual sense of whether there is an effect or not — at least when there is no effect or, as in the previous case, a slight negative effect. If we did see an effect, or in order to complete a power analysis, we would need to determine if that effect qualified as significant in the practical sense (not in the statistical sense). Which brings us to our final question, for now: Is there a rigorous way to set the expected effect size? If not, any thoughts on coming to a reasonable effect size in this case?

Thanks for your help!


1 Like


Very inspiring, loving it!

Hahaha Even professionals are eager to skip it as recruiting subject is always hard. Let’s see if we can do better! Power analysis for the win! … Even if it’s not my cup of tea… My contribution to this conversation will likely be limited but I’ll try to brush up on stats.

In the meantime, one thing I would love to see (and would definitely be interested in helping with the data processing) is replicate some of the findings in

Ross & Fujioka - Human cortical responses to slow and fast binaural beats reveal multiple mechanisms of binaural hearing

Interested ?


1 Like

Hey, it’s that guy that I am! Glad that I could offer a concrete suggestion, but I wouldn’t go so far as to call me a professional, not until I have a job doing these things at least.

Power analysis is one of those things everyone’s supposed to do but you can never quite be sure who’s doing it the right way unless you’re a grand master at the stuff, which I’m certainly not. Even my grad course in biomedical statistics didn’t do too well talking about it, and there aren’t many online resources that do a good job talking about both the theory behind power analysis and the tools that do it. That said, I think it’s always easier to do something wrong and get feedback on how to correct it than to have nothing at all and ask how to do something. So let’s see what we can get.

A basic power analysis involves four quantities:

  • Effect size, aka d
  • Sample size, aka n
  • Significance Level: P(Type-1 error), aka α
  • Power: P( Type-2 error), aka 1 - β

Two of these you’ve essentially mentioned already. I say four quantities are “involved” because this is actually a basic constraint satisfaction problem: you enter three of the given values into your power analysis calculator and it gives you the value of the fourth variable needed to meet your needs. For example, if you know the significance level, statistical power, and effect size you’re looking for, power analysis will tell you how big your sample needs to be.

Research scientists may have fancy tools at their disposal, but fortunately, the tools for doing statistics are freely available. The statistics and plotting software du jour for many labs is [R][1], which can be called and used from a command line in much the same way as Python. People who prefer a GUI to go along with it tend to use [RStudio][2]. There’s even a working R kernel for Jupyter notebooks, [IRKernel][3], which has a fairly straightforward installation process. This format should be familiar to you, since you already use IPython notebooks, and you can even display the output of both languages in the same notebook, as long as you don’t delete the output and you switch to the correct kernel before running commands. For all the stuff below, I’m just going to be copying and pasting from my terminal.

The “pwr” package is a good basic starting point for doing power analysis and performs many of the functions mentioned in [Cohen’s original 1988 paper][4] where he introduced the concept of power analysis for behavioral science. Once you have R up and running, you can install the package with one line and use another line to import the package into your current workspace.


The last thing you need to decide is what kind of statistical test is most appropriate for your design. From a read of your post and the hypothesis you’ve put above, you’re essentially looking to see whether the mean amplitude at the entrained frequency - we’ll get to the question another time whether or not entrainment at the beat frequency is a reasonable expectation - differs between the control group and the binaural beat trial condition. Note that, of course, it isn’t the place of these statistical tests to tell you whether or not your control condition was correct - that’s a whole different area of experimental design. For these parameters, a paired two-tailed t test seems to work just fine. The point of a t test is to look for a difference in means of two sample populations. If you want to do something else, you might want to look into ANOVA, pearson correlation, or one of the other tests available with the pwr package. We use two-tailed because, presumably, you’ll consider it to be an effect if the binaural beats increase or decrease amplitude at that frequency in the EEG readout, and indeed the direction of the effect, if it exists, might differ by frequency band. We also use the paired condition because your set-up is a type of “repeated measure” on subjects in and out of the experimental condition of interest. In other words, because each subject serves as his or her own control, the “noise” factors that influence the trial condition are probably also true for the control condition. In general, this increases the statistical power of your design vs. comparing two independent populations whose noise factors are more difficult to correct for.

With that, we’re ready to start crunching some numbers for your experiment. So why don’t we start off with some back-of-the-napkin work based on the information in your post. You have two subjects in your sample, you want a significance level of .05 and β of 0.2, meaning the desired power is 0.8. Let’s see what the effect size would need to be for us to get useful results from that.

pwr.t.test(n = 2, d = NULL, sig.level = .05, power = .8, alternative = "two.sided", type = "paired")
Error in uniroot(function(d) eval(p.body) - power, c(1e-07, 10)) : 
f() values at end points not of opposite sign

Well, crude. That error usually means that the expected value pwr.t.test had to return was so outside the bounds of what pwr.t.test expects that it crashed. Let’s try a slightly lower power and see if that works better.pwr.t.test(n = 2, d = NULL, sig.level = .05, power = .7, alternative = “two.sided”, type = “paired”)

 Paired t test power calculation 

          n = 2
          d = 9.340765
  sig.level = 0.05
      power = 0.7
alternative = two.sided

NOTE: n is number of *pairs*

Much better, now we have some real output. You’ll notice the function returned the values I entered in the first place, but the value you want to pay attention to is effect size, d = 9.340765. In plain english, that means your experiment is underpowered unless the effect you expect to see is the mean of your trial condition is at least 9.34 times the mean of the control. I can’t really say for sure off the top of my head, but that sounds physiologically impossible.

Don’t worry about the NOTE at the end of the output, that’s just a reminder on how to interpret the n here. It’s saying we need 2 pairs of subjects, not just 2 subjects. In this case, each subject is his or her own partner, control vs. trial. I won’t paste that part in future copy-pastes.

But we already knew the study was underpowered. This function just gives us a number to put to how much of an effect we’d need to see for this to work. Let’s crunch some numbers that are a little more interesting. The biggest difference in PSD/Hz in your post seems to be around 10 Hz, where the trial condition gets to about 19 and the control looks to be 17. If we assume this to be a true effect, the effect size is about 19-17 = 2, assuming standard deviation is near 1. Given two subjects, how much power does the experiment have?

pwr.t.test(n = 2, d = 2, sig.level = .05, power = NULL, alternative = "two.sided", type = "paired")

 Paired t test power calculation 

          n = 2
          d = 2
  sig.level = 0.05
      power = 0.1757074
alternative = two.sided

Well, that doesn’t bode well. The output essentially means, if we have an effect size of 2, there’s only about a 17% chance we’d detect it with this set up. All this really does is confirm your original description in the post that no significant effects were found, but it means that result could just be due to under-powering.

Without reviewing the literature on binaural beats and other calming/relaxing stimuli, it’s hard to say just what effect size we should expect. Cohen, the guy who made power analyses, stated effect sizes of 0.2, 0.5, and 0.8 correspond to low, medium, and high effect sizes, respectively, which would put the possible effect size as very high, but Cohen was speaking in terms of social sciences, which may not be appropriate here. Just out of curiosity, we can ask how many subjects would we need to run to be certain (sig level = .05, power = 0.8) that and effect size of 2 this is real?

pwr.t.test(n = NULL, d = 2, sig.level = .05, power = .8, alternative = "two.sided", type = "paired")

 Paired t test power calculation 

          n = 4.220731
          d = 2
  sig.level = 0.05
      power = 0.8
alternative = two.sided

Fancy that. Assuming seeing a simple difference in mean value at the peak is what you’re after, you should be able to detect an effect pretty easily, if it exists, with just five subjects. Honestly, this is much more optimistic than I expected, and though I’ve seen consumer grade tests with as few as 7 subjects, I feel like this might be due to me using inappropriate effect sizes or the wrong statistical test. There’s a lot to be said about interesting ways to analyze time series and electrophysiological data like EEG, so there is almost certainly a better way to formulate a hypothesis about the effects of binaural beats that fits with a much more appropriate statistical test. Some of them probably rely less on sample size and would be easier to carry out - one of the weaknesses of null hypothesis statistical tests is their reliance on sample size.

If you’re interested in doing more with the R pwr package, I suggest you start with the very brief materials from the [Open Science Framework][5] and read this [overview of power analysis in R.][6] Each uses a slightly different definition of effect size, so it’ll help you answer the question of which is most appropriate for you, and keep in mind that it’s inextricably linked to how your hypothesis is formulated. Heck, OSF offers free statistical and methodological consulting for experiments and projects. Maybe you can buzz them and see if they’d help you out if you have specific questions.

As a cherry on top, I’ve gone ahead and made this little graph. It shows statistical power on the x-axis, and each line represents a different effect size. It’ll give you an idea of how many samples you need for a given effect size and desired power. The long and short of it is that, if the effect is > 1, you probably don’t need more than 10 subjects to get a well-powered study. Any less than that, and you probably need 1-3 dozen depending on the actual effect size. It’s not much, but at least it’ll help you plan a bit. You can find the code for making the plot here:

Let me know if there’s any other way I can help. -Grant


@GrantRVD: WOW! That’s a fantastic post! You’ve gone well beyond the call of duty. This could be published as a very serviceable tutorial on power analysis which — unlike most of what I’ve found online — includes almost all the ingredients needed for the complete meal. Excellent!

We’ve got a bunch of background reading to do before we embark on the next binaural experiment (see this thread). In the mean time, I’m going to experiment with R and try to familiarize myself with the process that you’ve laid out so helpfully.

It’s encouraging that we may be able to get a well-powered test with a modest number of subjects.

Once again, thanks for the fantastic tutorial and examples.


Sebastien (replying out of sequence here): just reading the Ross & Fujioka paper now. They’re using MEG, so proper replication is out of our reach, but I think it would be easy and super interesting to try a beat frequency sweep, as they did, and see if the subjective experience matches their reports.

Also, it’s interesting to see the changes in response depending on beat frequency — once again it looks like alpha and beta range shows the lowest response, with the highest responses at 40 Hz and at low frequencies. I don’t know at all what the neurological relationship between the activity that the MEG is detecting and what an EEG would detect, but this pattern of frequency response fits quite well with the rough consensus of binaural beat EEG experiments I ran across upon following @wjcroft’s link from the other binaural experiment thread.

Thanks for the nice explanation
Can you please let me know how to select the effect size for a specific eeg data