15 (Ex-ante) Power calculations for (Experimental) study design

Power is the ability to distinguish signal from noise. - Coppock

The ‘statistical power’ of an analysis is the probability that this analysis diagnoses that ‘an effect is present’ (or a parameter is nonzero). The power of an analysis can only considered in light of (as a function of)

  • a particular true effect size (parameter magnitude),
  • or an effect size stated as relative to the underlying dispersion,
  • or as ‘the probability of achieving a certain desired precision.’*

Thus one often sees the power of an analysis ‘plotted’ against particular effect sizes (sometimes ‘alternative hypotheses’).

* E.g., we may see power calculated as a function of a measure of ‘effect relative to variation’ … like Cohen’s \(d\) of … effect size/SD.

Considering standard frequentist null hypothesis testing, ‘the power of a test’ (or analysis) represents one minus the probability of a type-II error (approximately ‘one minus the false-negative rate’).

Todo: an aside here about power calculations in a Bayesian context.

A basic primer: Egap - 10 things to know about statistical power

15.1 What is the point of doing a ‘power analysis’ or ‘power calculations?’

There are several reasons to consider power and to do ‘power calculations’ in advance of running an experiment or doing an analysis.

An ‘underpowered study’ is one with a low likelihood of diagnosing a ‘substantial’ effect is present when it is present. Such a study is also likely to be an uninformative study. Furthermore, if ‘rejecting a particular null hypothesis’ is important**, an underpowered test is unlikely to reject this null hypothesis even when it is ‘substantially and meaningfully false.’

** But please see McElreath and other’s discussion about the follies of the ways NHST is used in science on this point.

An underpowered study is likely to yield wide ‘confidence intervals’ (or wide ‘posterior Bayesian credible intervals’). Simply put, after an underpowered study (at least in isolation), we still won’t have a good sense of what the true value (of the parameter or effect of interest) is.

All else equal, you want to run a study that is as high-powered as possible (or a study that is part of a larger project that collectively yields substantial power) because:

  1. you want your study to be informative and to contribute to science (empirical analysis) and

  2. ‘low powered studies’ (at least in certain publication contexts) can potentially harm the accuracy of our scientific consensus.

Other than perhaps being ‘too costly,’ is it bad for a study to be ‘overpowered?’ I argue that this is not a problem.

15.1.1 What are the practical benefits of doing a power analysis

A power analysis may allow you to:

  • consider the cost/benefits of ‘more data’, to help you determine ‘how much to collect,’

  • consider and optimise over the tradeoffs in design choices (e.g., introducing more treatments usually involves a loss of power), and

  • understand whether you ‘have enough funds to gather enough data to make it worth doing a study?’

  • (maybe) be more credible in making ex-post statements about null effects.*

Furthermore, if you are trying to do a replication exercise to diagnose the credibility of previous work you want to be able to claim ‘I have power to do this credibly.’*

* I’m not sure about these latter point, this needs to be stated more carefully

In general, power analyses and ensuring sufficient power is good for science, avoiding the harm from ‘underpowered studies.’ There are arguments that individual ‘underpowered’ studies may undermine science, particular in conjunction with ‘publication bias.’

15.2 Key ingredients for doing a power analysis (and designing an experimental study in light of this)

  1. Our assumptions (or existing prior data used that we can use for simulations) over the data-generating-process. In particular, ‘what do we expect the distribution of the outcomes to look like’ (e.g., ‘normally distributed’), and ‘with what dispersion?’*

However, a measure of dispersion is not necessary for all power calculations. As noted throughout, we can calculate the power to detect an effect of a particular size relative to the dispersion, or to attain a particular confidence interval over such an effect.

Also note that for binary outcomes the choice of a ‘distribution function’ is obvious and ‘dispersion’ only depends on the share of units with each outcome.

  1. The nature of the sampling or assignment to experimental treatments *, **

* E.g., "complete random assignment to two a single treatments and a single control, each with probability 1/2".)

** Remember, here we are focusing on power calculations in experimental contexts. However, power calculations are also relevant in other empirical contexts, particularly where data-collection is costly.

  1. The specific proposed statistical test (or procedure) to be used (e.g., a t-test or a rank-sum test)

and perhaps the most important and debated ingredient:

*4. Which ‘metrics of power and effect size’ are we considering, and what are our targets?**

  • Are we simply seeking ‘precise estimates,’ estimates with small confidence/credible intervals? If so, how precise, and how do we measure this precision?

  • Do we seek power to detect some ‘minimal effect size of interest?’

If so, what is this ‘MESOI?’

Note that there is (often? always?) a mathematical equivalency between the confidence interval and the standard ‘power to detect X’ criteria.

Another criterion might be “power to do an ‘equivalency test’”… but I need to learn more about this.

Considering: ‘should I use a function of previous estimated effect sizes to determine the MESOI?’

From David Moss (unfold)


… not basing power calculations on previously observed effect sizes?

Lakens uses the SESOI approach, which we often do, but SESOI can be specified based on the effect sizes previous found in the literature though obviously there are a bunch of ways to do it

Moss, citing Lakens: … use earlier work to decide which effect sizes are deemed to be ‘meaningful,’ with particular specific recommendations:

Subjective justification of a smallest effect size of interest … Second, the SESOI can be based on related studies in the literature. Ideally, researchers who publish novel research would always specify their SESOI, but this is not yet common practice. It is thus up to researchers who build on earlier work to decide which effect size is too small to be meaningful when they examine the same hypothesis. Simonsohn (2015) recently proposed setting the SESOI as the effect size that an earlier study would have had 33% power to detect.

With this small-telescopes approach, the equivalence bounds are thus primarily based on the sample size in the original study. For example, consider a study in which 100 participants answered a question, and the results were analyzed with a one-sample t test. A two-sided test with an alpha of .05 would have had 33% power to detect an effect of d = 0.15. Another example of how previous research can be used to determine the SESOI can be found in Kordsmeyer and Penke (2017), who based the SESOI on the mean of effect sizes reported in the literature. Thus, in their replication study, they tested whether they could reject effects at least as extreme as the average reported in the literature. Given random variation and bias in the literature, a more conservative approach could be to use the lower end of a confidence interval around the meta-analytic estimate of the effect size (cf. Perugini, Gallucci, & Costantini, 2014).

Another justifiable option when choosing the SESOI on the basis of earlier work is to use the smallest observed effect size that could have been statistically significant in a previous study. In other words, the researcher decides that effects that could not have yielded a p less than \(\alpha\) in an original study will not be considered meaningful in the replication study either, even if those effects are found to be statistically significant in the replication study. The assumption here is that the original authors were interested in observing a significant effect, and thus were not interested in observed effect sizes that could not have yielded a significant result. It might be likely that the original authors did not consider which effect sizes their study had good statistical power to detect, or that they were interested in smaller effects but gambled on observing an especially large effect in the sample purely as a result of random variation. Even then, when building on earlier research that does not specify a SESOI, a justifiable starting point might be to set the SESOI to the largest effect size that, when observed in the original study, would not have been statistically significant.

DR response: Why do we assume previous authors considered MESOI?

I’m missing the logic in the quotes above as to “why the previously detected affects, or some bounds on these should represent the minimum effect size of interest?”

Perhaps there is some justification in “assuming that previous authors have powered their study correctly to detect such a minimum affects,” But to me this just seems like kicking the can down the road and I do not assume this in general. We know that people run under powered studies all the time (see the previous discussion on the harm to science)

DR: A second reason why one might see that as the “minimum effect size of interest” simply has to do with being able to publish a paper that can in some way “refute” previous claimed findings.

But that is flawed, in my view, as a way of doing science. We should power the study that is most informative, either by itself, or when made part of a meta analysis. I don’t see the value of this adversarial back-and-forth approach.

DR preferred approach - power a study based on policy concerns, also considering it’s use in meta-analysis.

One basic argument is: I want to power a study as a practical goal based on my policy concerns. Typically the value of the study will depend on how precisely you are able to estimate and “bound” a parameter. (This may be expressed as a conference interval or a credible interval if we are thinking of a Bayesian posterior).

So, in determining how much power I wish to achieve, I need to weigh the benefits of this precision against the cost of a larger sample size.

This is a very different considerations from “what power do I have to detect that a previously estimated effect size, if true, is ‘statistically significant’” (By the standard definition).

Of course “We should power” is subject to constraints and cost concerns.

So, when I am designing the study I am almost never able to have the power I want for all possible tests/hypotheses. Where these considerations come into play, is whether to decide to run the study now or wait to get more funding, and in considering which hypotheses to test and which treatments to put in the study, et cetera

15.3 The ‘harm to science’ from running underpowered studies

"One worries about underpowered tests. Your result (may have) relatively large effect sizes that are still insignificant, which makes me wonder whether it has low power. Low powered studies undermine the reliability of our results. - From an anonymous referee report

Perhaps most of us consider power largely in thinking about

  1. “Is our analysis going to be fruitful for ourselves as researchers?”

and perhaps also, where we find a null result…

  1. “Is the analysis powerful enough to plausibly rule out an effect of a meaningful size?”

The conventional wisdom has been that, at least for papers reporting non-null effects, running a low-power study is mostly done at the authors’ own peril. We might think “if I am lucky enough to observe a strong effect in an low-powered study then I have managed to mine a vein of truth on a relatively unproductive plot, and have thus earned my reward.”

However (buttonPowerFailureWhy2013?) point out that running lower-powered studies reduces the positive predicted value—the probability that a “positive” research finding reflects a true effect—of a typical study reported to find a statistically significant result.

In combination with publication bias, this could lead a large rate of type-1 error in our body of scientific knowledge (false-positive cases, where the true effect was null and the authors had a very “lucky” draw). True non-null effects will be underrepresented, as underpowered tests will too-often fail to detect (and publish) these. Furthermore, in both cases (true null, true non-null), underpowered tests will be far more likely to find a significant result when they have a random draw that estimates an effect size substantially larger than the true effect size. Thus, the published evidence base will tend to overstate the size of effects.

DR: However, I speculate that this idea might be less clear-cut than it seems. E.g., if we consider a (“non-sparse”) world where every factor indeed has an effect, lower powered studies are more likely to detect effects that are truly larger, which are arguably more policy-relevant; moreover, overstated effect sizes might be adjusted with a standard correction.

Ferraro discussion on magnitude error due to underpowered studies: {-}

… if you are looking at an under-powered design then sure, you might pick up a significant result which is actually spurious. But on top of that even if there is a genuine effect there, the effect that you actually pick up as being significant will (likely) be overestimated. The intuition behind that result is (I think) that for an effect to be picked up in a study then it has to be large enough to overcome the issue that you face with power. Low-powered studies can only detect really large effects, and so the large effect you pick up in such a study could be genuine, but it equally could be a poorly-estimated coefficient. By using a low powered study you sift through for these kind of effects.

15.4 Power calculations without real data

R ‘Paramtest’ package vignette is helpful here.

15.5 Power calculations using prior data

Adapt example in ‘scopingwork.Rmd’ to this

15.5.1 From Reinstein upcoming experiment preregistration

We are searching for a design and sample size that has sufficient power to detect (or ‘statistically rule out’) an effect of ‘minimal interest’ size, given our somewhat-limited budget. The ‘design parameters’ we can play with are given above.

While conventional practice seems to involve completely simulated data based on parametric assumptions (normality, etc) we prefer to draw from comparable ‘untreated’ real-world data (see (Barrios 2014) for a related discussion). Assuming the general distribution of outcomes (and covariates) is in some sense constant or predictable over time (perhaps stationarity?), this should give us more accurate estimates of power.

We will consider each design’s power to detect particular ‘treatment effects’ (of a minimum relevant size) on particular outcomes, which may be linear, proportional, or otherwise. Our calculations do not depend on any assumptions over the ‘true treatment effect.’

Assignment procedures to consider

We consider three categories of possible assignment criteria: (We are coding these below.)

1. Simple data-based

(Coded here)

Here we imagine a very simple dynamic assignment to treatments with alternation (or repeated from an urn with one ball per treatment, refilling the urn once empty). This procedure will essentially guarantee an equal share of observations in each treatment.

We will also consider an unbalanced design, both in this and in other categories, which may achieve greater power, especially considering the differential costs of our treatments.

Although as we have no evidence on the treatments and thus no reason to anticipate a differential variance between treatments and control, an unbalanced design may allow greater power for the same cost, as observing controls is costless, and the ‘low-donation’ treatment is lower cost.

2. Ad-hoc (Reinstein adapts Barrios), using prediction.Rmd quantiles

General summary: Fit a predictive model of the outcome (total donations) based on pre-treatment observables, using set-aside training data. Generate quantiles of ‘predicted donation’ (tuning parameter=number of quantiles?). Power-test block randomisation with these blocks as quantiles, using set-aside testing data.

Caveats: If we test the power with multiple models on the test data (e.g., ‘tuning’ the number of quantiles) we will be overly optimistic and maybe overfitting.

Also, this assignment procedure is not necessarily robust to TE heterogeneity. By luck, it may assign substantial imbalance across any particular dimension.

Prediction algorithm (folded)
  • Adapt from code in CharitySubstitutionExperiment repo, in assignments_power.Rmd and analysis_subst.Rmd; examples of Elastic net etc
  1. Define and organize the set of variables available at intervention

  2. Define and calculate the outcome variables (total amount raised)

  3. Split and set-aside validation and simulation data (?within prediction also)

  4. Model the outcome, using a Ridge regression with all features. The regularization/penalty parameter could be optimized for best fit. (Cross-fold).

Blocking process (folded)
  1. We can test block randomization by the predicted quantile of this model with a bootstrapped procedure using set-aside ‘testing’ data not used in the above regression. We want to run the simulations until we find the optimal “block width” or quantile to use for blocking the randomization. Ideally, this procedure should take into account the fact that if we stopped at a random time we may have uneven cell sizes.

(Caveat – this resampling may need to be done based a randomised ‘start time’ to address random time-specific effects. ) Note: Because of this and for feasibility, we may abridge step 5, and just try out a few reasonable large block widths (e.g., quartiles)

Probably the way to do this is, for each proposed block width (e.g., quartiles, deciles, 15 bins, etc) we draw random samples from the set-aside data according to this procedure, and estimate an “effect size” for each sample. We then consider the 99% bounds of these simulated effect sizes. The block width (and regularization parameter?) that consistently gives us the tightest bounds should be the one that allows us the greatest power to rule out an effect of a certain size, given the null hypothesis of no treatment effect.

The intuition: The smaller the largest difference in mean total donations (between treatment and control) that occurs by chance in 99% of draws… the smaller the actual effect that we will be powered to detect (able to judged as statistically significant a reasonable share of the time). [Note: these latter notes may bot go with the procedure proposed below; these are older.]

Kasy method?

We may or may not get to considering the method proposed in (kasyWhyExperimentersMight2016?). It is ideal for Bayesian approaches to policy, but it may make frequentist inference difficult (?). (See steps/notes folded below.)

See kasy_2016_dont_randomise.md; kasy_dynamic.md may also be relevant.

  • Prepare concise data set (csv?) to throw into his app, choose baseline covariates
  • Estimator: Difference of means or Bayes
  • Prior: Squared exponential or Linear?
  • Re-randomization draws (default=1000); Expected R-sq (default=0.7)

Notes: - Stratify on ‘discrete strata’ - Conservative: difference in means without controls or interactions - More reasonable, fully general: Make estimator in Power calc regression with strata dummies and interactions with treatment, usual Robust standard errors - But Kasy’s technique makes frequentist inference difficult (Bayesian OK)

See Guidance/code:



Treatment assignment functions, used in power calculations

  1. simple_assign: assign treatment dummy to first t-share of n rows
simple_assign <- function(df, tshare=0.5, blockvar="NA") { #note: no blocking here, this is just to get a homogenous code
  mutate(df, d_t = row_number() <= (n() * tshare)) #note: the data must be randomised first!

Data-based power calculation: create simulation function {#power-calc-sim-func}

power_data <- function(ds, reps, yvar, N, tshare=0.5, linTE=0, propTE=0, alpha=0.05, test_nm= wilcox.test, f_assign=simple_assign, bv) {

    results  <- sapply(1:reps, function(r) { #TODO - replace with purr::map (Toby)
#1. Sample size N from data for each iteration
    exp_sample <- sample_n(ds, size = N, replace = TRUE)

#2. Selection of control and treatment group using function `f_assign'
    exp_sample <- exp_sample %>%
      f_assign(tshare=tshare, blockvar=bv) %>%

#3. Add treatment effects (`propTE` and `linTE`) to treatgroup
      mutate({{yvar}} :=   ifelse(
        d_t == "TRUE", {{yvar}} * (1 + propTE) + linTE, {{yvar}})) %>%
#TODO: Do this for several y-variables in each run; 'map' these?
      mutate(yv := {{yvar}}) %>%  #reassign variable because I couldn't figure out how to get the unquoted argument to work in tests below #TODO-fix

#4. Run chosen test `test_nm`, output p-value
      dplyr::select(yv, d_t)
    test <- exp_sample %>%
      do(tidy(test_nm( yv ~ d_t, data = ., paired = FALSE )))

#5. Output share of p-values below alpha
      sum(results < alpha) / reps
  1. block_1d_assign: assign treatment dummy to first t-share within each (pre-calculated) one-dimensional block group (blockvar)

Note: I am using ‘randomizr’ here to assign blocks

block_1d_assign <- function(df, tshare=0.5, blockvar) {
      block_rand <- as.tibble(
          blocks =  df[[glue("{blockvar}")]], conditions = c("control","treat"), prob_each=c(1-tshare, tshare)
      df <- as.tibble(bind_cols(df, block_rand)) %>%
        rename(d_t=value) %>%

Power calculation (just simple examples): adapt to built-in data

#pwr_n400_L50_p15 <- power_data(ds=df,reps=100,yvar=sum_don,N=400,tshare=0.5,linTE=50,propTE=0.15,alpha=0.05, f_assign=simple_assign) Loop and plot over…

linTE.try <- c(0,50,100)
propTE.try <- seq(from=0.05, to = 0.2, by = 0.05)

outcomes.try <- c("sum_don","count_don")
tests.try <- c(t.test, wilcox.test)

Testing equicost parings, determine necessary cost

With only control and treatment we have

\[cost = multiplier \times avgcost \times N \times tshare\] \[\rightarrow N = cost/\big(multiplier \times avgcost \times tshare\big)\]

Setting a 2x multiplier (for the ‘large’) treatment, avgcost=£30 (hard coded) for now, and imagining a £6000 initial budget yields

\[\rightarrow N = 6000/\big(60 \times tshare\big) \]

cost <- 6000 #make this an entry in the 'design_params' list
multip <- 2
avdon <- 30

tshare.try <- seq(from=0.1, to=0.5, by=0.1)
sample.try <- seq(from=400, to=1200, by=200)
equi_sample <- cost/(avdon*multip*tshare.try)

# now maybe make a vector of tshare.try and equi_sample, for iterating over
# I assume we can consider the 'optimal tradeoff' and this will be invariant to the cost; am I right?

For ‘total x’ outcome variable

(Some sample code below, needs discussion)

#Unvarying parameters up here:
linTE <- 0

#power_vals: tibble to collect parameters and results #TODO: faster to generate a list?
power_vals <- tibble(
  prop_te = NA,

#tic() #timer

#for (o in seq_along(outcomes.try)) { #TODO: map instead of loops. See R4ds 21.7 'mapping over multiple arguments'

  #OUTCOME <- outcomes.try[[o]] #TODO: may be faster to generate all outcomes for each sample ; also, I haven't got the syntax to work

  df_x <- df %>%
    #select(OUTCOME) #Minimal data set to speed it up; (seems to save about 50% of the time)

  #loop over tests
  for (t in seq_along(tests.try)) {
    TEST <- tests.try[t]

    #Loop over proportional TE
    for (p in seq_along(propTE.try)) {
      PT <- propTE.try[[p]]

      #Loop over sample sizes
      for (s in seq_along(sample.try)) {
        N <- sample.try[s]
        PW <- power_data(
            ds = dfX, reps = 80, yvar = sum_don, N = N, tshare = 0.5, linTE = linTE, propTE = PT, alpha = 0.05
        power_vals <- add_row(
            power_vals, n_try = N, prop_te = PT, lin_te = linTE, test = as.character(TEST), power_est = PW
          ) ##TODO - more efficient to save the results in a list and combine it into a single vector or dataframe at end (see r4ds 21.3.3)



15.6 Digression: Power calculations/optimal sample size for ‘lift’ in a ranking case

We want to know what the ‘best title for our new movie’ is. Twenty titles have been suggested. We have funds to do a survey of a relevant representative audience.

We need to decide on a general experimental design, a statistical analysis, and on sample sizes considering power (or perhaps ‘lift’).

Note that although we are mainly framing this in terms of statistical inference, it might also/instead be considered a ‘reinforcement learning’ problem.*

* See Max Kasy’s slides and articles on adaptive field experiments, particularly considering ‘exploration sampling.’

A long debate on this in the folds below:

Is classical statistical inference the right framing here?

David: I think classical statistical inference is the wrong framing here* o think about ‘what sample size (or number of arms) to maximise value’ … even if we don’t do adaptive sampling

Matt: I’m (currently) not sure there is any meaningful alternative (or, therefore, a better alternative)

David: The reinforcement-learning and value-maximization (lift) framework used in industry (I believe). The challenge here would be to map between the survey responses given and the actual outcomes of interest… e.g., if a person responds “I would definitely watch a movie with this title,” what is the probability they actually will, relative to someone who responds “I might watch…” (also, what is the ‘population size’ we are sampling from, to understand the scale)

Matt (“The challenge here would be to map between the survey responses given and the actual outcomes of interest”) This is the thing I’m just assuming we cannot do. Given that, it’s just not clear how reinforcement learning framework helps us?

David: why not make reasonable assumptions about the above, for example? But even if we framed it strictly in terms of ‘what we learn about how the population would respond to similar questions asked in the survey,’ I still don’t see why (e.g.) our goal should be to be “strongly powered to reject an H0 that ‘all titles rank just as highly on average”. Instead we want to ‘learn as much as possible about whatever we think, in the survey, is likely to be the most valuable response’

Is there a better approach to determining sample size?

Matt: What is the better approach to determining sample size?

David: It’s a cost/benefit calculation. If we are cost-constrained, this question becomes ‘how many arms (titles)’ and ‘is this worth doing.’ NHST framework logic may say ‘given your budget, only try 2 titles, and even if you have to choose them at random’ and/or “don’t bother, the chance of statistical significance is too low.”

In contrast (my loosely-informed guess is that) a RL approach says that (if we cannot do adaptive learning, if the ‘client’ finds it equally likely that all 20 titles are optimal, the ‘most value-increasing learning’ (highest ‘lift’) will simply come from dividing their budget equally across all 20 titles and choosing the one that ‘performs best’ (obviously ‘performs best’ is another can of worms) … even if this yields little power for strong statistical inference.

Matt: The approach I describe/propose (using classical approach to power analysis/hypothesis testing) also involves cost/benefit calculation. So I think this misdescribes the contrast.

David: You write “can detect Y size effects for X dollars while testing 20 titles vs can detect Z size effects for X dollars while testing 10 titles.” But I don’t think ‘can detect Y effect sizes’ is entirely accurate/descriptive.

CLASSICAL (NHST): At a certain cost you can achieve a certain probability (power) that you will conclude that a particular difference is ‘less than 5% likely to occur by chance under H0.’ However: 1. Even if you don’t ‘reject the null,’ the evidence (if interpreted using Bayesian methods) may already suggest that the null is very unlikely and the center of your belief distribution should be substantially shifted. 2. How does the Classical approach weigh the benefit of learning more about a few titles vs less about many titles?

Matt: (setting aside number of titles question): NHST approach doesn’t tell us what sample size is best given the cost/benefit. NHST just tells us what sample size we need to have power to do X. Our overall judgement about what sample size to use (and what design etc.) is based on our independent judgements about what amount of cost/power is optimal..

Matt: I think may explain some of the apparent disagreement where you propose some approach to cost/benefit as some alternative to NHST, whereas I think NHST and the cost/benefit calculation of what we should do are separate questions On my approach we also judge cost/benefit of having a given level of power to detect an effect of a given size (and a given number of titles). So what’s the alternative approach to establishing sample size required for [a certain level of precision/ability to detect effect.. construed in whatever terms you like], and separately judging what level is optimal given the cost?

DR: I think these are linked questions, and it is not meaningful to simply state ‘power to detect an effect’ … as there are something like 20 x 20 possible ‘effects’ we could be looking for here.

Related discussion: should “Achieving a minimum statistical significance for a particular comparison” be our criterion?

David: “Achieving a minimum statistical significance for a particular comparison” should probably not be our criterion … we can generate a lot of value even without being able to make a strong statistical inference (‘rejecting the null… meh’)

Matt: This seems a separate question. The fundamental Q here just seems to be how much confidence we want/need to think this valuable to our decision-making. Statistical significance is an arbitrary threshold: but point remains that <significant results give less value Concretely, I would update little on noisy, less significant differences results

David: My impression is that ‘optimal’ Bayesian updating, even starting with diffuse priors, actually does move a lot in response to data that would not yield strongly significant results. Yes, it does update less with less data and more noisy data, but it still updates a lot in ways that substantially change decisions and add value even in cases where NHST yawns and says ‘nothing to see here, move on.’ If “p=0.25” in a NHST statistical test, that (very loosely and probably not strictly correct here) might suggest that ‘there is a 25% chance that one option is substantially better.’ (I do think there is a place for NHST in scientific inquiry and ‘establishing results without giving the benefit of the doubt. But where you ‘must make a decision’ a different approach is justified IMHO).

Matt: I agree that non-sig results can be worth updating on and potentially useful for practical purposes. We could have a meaningful discussion about how confident/precise we want our estimates to be / how small an effect is worth measuring without thereby ditching NHST: and you could argue that we should shoot for a lower level of power. It’s not clear there’s a simply better alternative framework for approaching the question. (Even translating these discussions into Bayesian terms seems to just leave us in exactly same position)

David: Not sure if it’s the same position; see above (previous fold, “it’s a cost/benefit calculation”)

Mostly I agree with what you are saying but you may be missing some of the complications here. The problem is not uni-dimensional, as we are asked to consider a large set of (20) titles. I am claiming that you might gain more value, and indeed substantial value, from learning “a little about 20 titles” instead of “a lot about 2 randomly selected titles”

Matt: Question of the value of 20 titles vs 2 titles seems distinct from the question of how much we can learn from noisy/sub-significant results. (Links other post…)

… informed guess about what would be worth measuring.. In the simple case, this just involves comparing $X vs can detect Y size effects In the slightly more complex case this also involves comparing: ‘can detect Y size effects for X dollars while testing 20 titles’ vs ‘can detect Z size effects for X dollars while testing 10 titles.’

Concretely I have been informing my judgement of how small an effect we might want to detect based on our prior XXX message testing which found (by convention) small effects, and which seems as reasonable a proxy for what we might expect to find here. As noted, the key ingredient to better knowing how big an effect we want to be able to detect would be knowing how well these ratings correspond to real world differences, but we do not know that… Of course there is also a hard ceiling on [costs] i.e. our client won’t/can’t pay more than X dollars …

I agree that focusing on 2 random titles out of the 20 would probably provide little practical value as to which out of the proposed titles is best (due to providing no data about 18/20 titles).

David: :)

At the other extreme: if a test with 100 titles would only be able to detect effects that were enormous (given resource constraints on our sample size)- then it’s likely we should change test/test fewer titles, because a test that won’t be able to detect reasonable effect sizes with reasonable power, won’t be able to offer practically significant evidence.

David: I was thinking of the same reductio that I think you are getting at here, and I think the answer may be ‘strictly speaking yes, if our prior is that all titles have identical distributions on average, better to test each title 1x than to test some titles 2x or more. For decision-making purposes, you learn the most and update the most from the first piece of evidence. So I think it actually would be practically significant to the decision problem. Highly recommended: read/listen to that Bayes Rule chapter of Algorithms to live by.

15.6.1 Design: Which questions to ask the audience about the proposed titles, and in what order?

This is an ‘experimental design for internal identification and external generalisability’ question. (See ’Identifying meaningful and useful (causal) relationships and parameters)

Some possibilities:

  • Subjects asked to rank (or rate) all \(K=20\) titles (titles \(k=1,2,...,K=20\))
  • Subjects asked to identify ‘top \(C\)’ and ‘bottom \(C\)’ (e.g., top and bottom 3) titles
  • Subjects presented a series of pairwise comparisons
  • Subjects asked to rate (or say whether they would attend) a single title, with between-subject variation

Which statistical test(s)/analyses to run (if any) and what measures to report?

Suppose we asked each subject to rank all \(K=20\) titles.

How could we test if there were any ‘substantial difference in the title rankings’ and what would be a meaningful measure of the ‘extent’ of this difference? We might want to consider some ‘minimum effect size of interest’ and ensure that we have a large enough sample to diagnose such an effect with (e.g.) 80% probability (while maintaining a false-positive type-1 error rate of less than 5%).*

* However, it is not clear why this is the most relevant question. Simply determining ‘there is a difference of some minimum size’ doesn’t tell us how confident we are about the best title, nor how much value is gained by choosing that title. This suggests a reinforcement learning approach.

Friedman’s Q, is a measure of whether ‘any (at least one?) items are systematically ranked higher or lower.’ \(Q\) can be normalized into, Kendall’s W, a measure of ‘inter-rater agreement’ going from 0 to 1. There is a significant test for W “against a null hypothesis of no agreement (i.e. random rankings).”

Kendalls uses the Cohen’s interpretation guidelines of 0.1 to 0.3 being a ‘small effect’

“A significant Friedman test can be followed up by pairwise Wilcoxon signed-rank tests for identifying which groups are different,” with multiple testing corrections. datanovia website

How to assign the ‘treatments,’ and how large a sample is optimal, considering ‘power’ (or ‘lift’)?

Simple assignment

Suppose we are restricted to a single allocation of treatments across the 20 titles. Suppose we asked all subjects to rank all of the \(K=20\) titles, or perhaps only to focus on the ‘best’ and ‘worst’ \(C\) titles.

The true population has a large number of subjects (individuals). In our survey we are sampling some number \(N\) of individuals. Let each subject be indexed by \(i\), so a particular sample will contain subjects \(i=1,...,N\).

Calc. 1: Detect ‘minimally important effect?’ (with simple assignment)

We might frame our test and power calculation as the following:

Suppose the ‘Minimal effect of interest’ that we want to be able to detect is (sort of the ‘alternative hypothesis HA’)…

HA: “One title is ranked by a share of the population that is one and a half times as high as any other title.”

If all the other titles have the same (lower) ranking on average, this should offer the greatest chance of detecting such a difference. Thus, if we assume all other titles share the same average ranking, the computations (below) should underestimate the necessary sample size.

I.e., defining \(r_{k,i}\) as the rank given to title \(k\) by subject \(i\), and letting \(\bar{R^1_i}=\frac{1}{N}\sum I(r_{k,i}=1)\) be the share of the sampled population (subjects) ranking title \(k\) as first…

we may consider a case where \(\bar{R^1_j} > \frac{3}{2}\frac{1}{19}\sum_{k\neq j}\bar{R^1_k}\) for some title \(j\) relative to all other titles \(k\).*

I think the latter term may simply be \(\frac{3}{2}\frac{19}{2}\), or something similar, the ‘average rank.’

Perhaps, we want to power our test so, for the “HA” described above, we have an 80% chance that we ‘find an effect.’ I.e., an 80% chance that our test statistic (whatever it is) tells us that “it is less than 5% likely that this title would have performed as well in our sample by chance if (H0) all titles been perceived as equally good and thus randomly ranked in the population.”**

** However, I don’t really think that that is really what we are looking for, as we are facing a decision problem.

To test for this we would follow a certain procedure, e.g.,**

*** I am not sure this is an appropriate procedure.

  1. Find the title ‘j’ that has the most people in our sample who rank it first; call this share \(\bar{R^1_j}\)

  2. Compute (perhaps through simulation) the probability that, if all titles were randomly ranked by the population, in a sample of size \(N\) (our actual sample size), the average rank of the highest-ranked title would be as high as \(\bar{R^1_j}\).****

    **** Ideally there is an analytical formula for this; worth looking up.

  3. If this computed ‘probability of such an extreme result,’ given our sample size \(N\)\(P(N, \bar{R^1_j})\)— is below our threshold \(\alpha=0.05\) we ‘reject the null.’

We have defined a (design and) testing procedure. We can now simulate (or perhaps analytically calculate) the power (of our design and test) to detect the above HA, as follows.

Run \(T\) simulations \((t=1,...,T)\). For each simulated sample \(t\):

  • Draw \(N\) observations from an imagined ‘true population,’ i.e.,
    • for each of the \(N\) subjects drawn, say, ‘subject \(i\),’ for each of their ‘ranks \(r=1,2,..,20\),’ draw a title (by random) to be assigned this rank (\(r_{k,i}=r\)),
    • with one title (the same one always) having a \(\frac{2}{21}\) probability of being drawn for the first rank, and the other 19 titles each having a \(\frac{1}{21}\) probability of being drawn for the first rank
  • Compute \([\bar{R^1_j}]^t\): the average rank for each title in simulation \(t\),
    • and then compute (or simulate) \(P(N, [\bar{R^1_j}]^t) \equiv [P]^t\), the test statistic as defined above, the ‘likelihood of such an extreme result under the null hypothesis of random ranks’

Over a sufficient number of simulations, determine the average probability of ‘rejecting the null’ in favor of the above HA (specifically for the ‘correct’ title \(j\)).* This is the estimated power of the test.

* This last point is a wrinkle I’ve not seen in previous work involving power calculations, so I hope I am not missing something here. Perhaps we should say something like “power of the test in the right direction.”

Calc. 2: Find a title in the top \(\eta\) (e.g., third) of the ‘value of title’ distribution (with simple assignment)

For any target outcome (movie box office, general acclaim, etc.), each of the (\(K=20\)) titles will have some true population parameter.

Let us call the parameter of interest \(\theta_k\) for title \(k\), which we will call the ‘value of title \(k\).’ For now, let us assume that this ‘population parameter’ represents the mean (or some other function?) of something that we can observe in our survey from each sampled subject.*

* Of course, for this example, and in general, there are many reasons why we may not be able to observe exactly the thing that informs our parameter from a particular survey. Our survey/experiment design may not be able to get subjects to tell us exactly what we want to know about their impressions of the title. (E.g., they may not themselves know which title would be most likely to get them to attend the film. Even if we ask them something related to this like, “would you like us to email you a discount coupon for a movie with his title?” this may not perfectly track with movie-attendance choices in other contexts.

Consider a ‘(prior) distribution over parameter \(\theta_k\) for each title \(k\).’**

** I won’t get into the details here about what this distribution means in a philosophical sense. There are various types of Bayesian conceptions of this. Perhaps another way of thinking about it is that each title was as if randomly drawn from some distribution of titles.

Without further information, we might let (our belief about) the distribution of this parameter for each title be the same. We might consider that, e.g., the “expected revenue from each title” is Normally distributed \(\theta_k \sim N(\mu,\sigma^2) \forall k\).

Perhaps each subject’s ranking of titles is monotonic (i.e., rank-preserving) in the true probability that they would attend a movie with each of these titles.

Perhaps, we want to choose a sample size such that the title we claim has the “best overall ranking in our sample” has a \(B=0.8\) or greater chance of being in the top-third of the true most-profitable titles.

Note that choosing a title that is “most likely to be in the top third” is not necessarily the same as “choosing the title with the greatest expected profit.” Our ranks are a multi-dimensional measure; thus one title need not ‘dominate’ the other title. E.g., in a sample of \(N=100\) title \(A\) may be ranked first by 10 people and last by 90 people, while title \(B\) may be ranked second by all 100 people. Which is more profitable on average will depend on the relationship between ranking and probability of attendance.*

* E.g., perhaps this implies a 90% probability that \(B\) will be more profitable than \(A\), but for the 10% of the states-of-the world when is more profitable than \(B\), \(A\) is 100 times more profitable. This would imply that title \(A\) has a greater expected value profit, but also involves more risk.


Consider whether title \(k\) is among the titles that is ‘ranked in the top-third’ by the largest share of the population. More specifically, consider \(\bar{R}^{7+}_k \equiv E[r_{k}\leq7]\) as the ‘share of the population ranking title \(k\) top 7 out of 20 or better. Let \(D_k=1\) if \(\sum_{k,j}1\big(\bar{R}^{7+}_k>\bar{R}^{7+}_j\big)\geq7\); i.e., if \(k\) is one of the top-7 titles in terms of the measure ’ranked in the top third.’

We might want to…

  1. based on our sample (and some assumptions?) choose the title with the highest chance of being ranked in the top-third by the population as a whole (\(k^\ast \equiv argmax_k\big(P(D_k=1)\big)\)), and

  2. sample \(N\) large enough so that this ‘chosen title’ has a \(\beta=0.8\) chance of indeed being ranked in the top-third by the population as a whole, i.e., \(P(D_{k^\ast}=1)>\beta=0.8\).

I imagine that under certain (overly restrictive?) assumptions about the distribution of rankings, we would be able to calculate:

  1. The appropriate procedure for selecting the title that is best by the above metric (\(k^\ast\))

  2. Determine the minimum neccesary \(N\) to achieve this \(\beta=0.8\) chance of …

But do we want to do this?!

Sequential/adaptive designs, multi-armed bandits

More generally, see ??.

15.7 Survey design digression: sample size for a “precise estimate of a ‘population parameter’” (focus: mean of a Likert scale response)

15.7.1 How to measure and consider the precision of Likert-item responses

Considering ‘precision of Likert-item responses’ and sample-size calculations:

What are commonly used/justifiable measures of central tendency and dispersion for Likert-items?

How can we think about ‘precision of estimated Likert-item responses?’ and attaining sufficient precision, and a metric for this?

“How precise is precise, and by what metric?”

A simple naive approach?

Interval coding: \(y=[1,2,3,4,5]\) for a 5-item

Outcome: \(\bar{y} :=\) Sample mean of numeric-coded responses,

Measure of dispersion: \(\hat{s}\) := Sample standard deviation of y *

Perhaps with the \(n-1\) correction, but who cares/

Measure of (inverse of) precision: estimated standard error of the mean \(\hat{SE_m} = s/\sqrt(n)\)

If we assume \(y\) is normally distributed (which obviously can’t be precisely the case)…**

* * but Wiki (Derrick and White 2017?) claim “responses often show a quasi-normal distribution.”

… then a 95% confidence interval for \(\bar{y}\) would be

\[\bar{y} \pm 1.96 \: \hat{SE_m}\]

A. ‘Absolute’ metric?: Target a 95% CI range less than (e.g.) 1 ‘Likert scale unit,’ i.e.,

\[2 * 1.96 \: \hat{SE_m} < 1\] *

* Or perhaps considering the actual rather than estimated 95% CI this should be “\(2 * 1.96 \: SE_m < 1\).”

Recall that \(\hat{SE_m} = s/\sqrt(n)\).

Thus, to choose a sample size to achieve these bounds we need to have a measure/guess/estimate of \(s\), the standard deviation of \(y\), perhaps based on previous data.**

To have (e.g.) an 80% probability of getting these bounds for the actual confidence intervals we would also need a measure of the dispersion of our estimate of this sd. (Hmm, it’s getting complicated).

B. ‘Relative’ metric?: Target a 95% CI range below \(B\) sd of the Likert-item-integer-response \(y\): i.e.,

\[2*1.96 SE_m < B*sd(y)\]

i.e., \(2*1.96 * s/\sqrt(n) < B*s\), i.e., \(2*1.96/\sqrt(n) < B\) i.e., \(\sqrt(n) > 2*1.96/B\) i.e.,

\[n > 15.3664/(B^2)\]

… where \(sd(y)\) is the true standard deviation of the outcome.

Note: This \(n\) gives should give us an estimated CI with a range of B standard deviations of the outcome. I’m not sure if it implies that, after collecting the sample, our estimates of the CI will always have a range equal to the estimated sd. Need to think about this more.

As you can see (caveat: calculations need doublechecking), if we assume the Likert-integer-thing is normally distributed, the calculation of ‘how big a sample size (n) we need in order to get, on average, a CI of 1 sd or smaller’ is straightforward.

15.7.2 Computing sample size to achieve this precision

Initial thoughts (unfold):

If we assume normality, there should be a simple analytical formula for

’Minimum sample size…. … for (e.g.) 80% likelihood … … of achieving an (e.g.) 95% “confidence interval (CI) over the mean” of a variable… … that is within (e.g.) 1 standard deviation of the variable on either side


  • This is to get CI bounds on the means stated relative to the SD of the variable. If we wanted to bounds in ‘units of the variable’ we would need to know, guess, or estimate the SD of the variable.

  • For a Likert variable normality is not a great assumption. We should probably make another assumption over the distribution (or even draw from past data), and then we can either do a similar analytical computation or a simulation based computation (which should be fairly easy)

  • I put ‘CI over the mean’ in scare quotes because these are frequentist confidence intervals which are hard to interpret. A Bayesian approach might be more appropriate… worth thinking about

  • Not sure whether ‘mean of a Likert-item response’ is important anyway. I’ll read more on Likert scales.