14 Robust experimental design: pre-registration and efficient assignment of treatments

14.1 Pre-registration and Pre-analysis plans

14.1.1 The benefits and costs of pre-registration: a typical discussion

BB: That said, I would be interested to think about the benefits – and more importantly limitations to – pre-registration. I think it could solve some of the p-hacking problems but not much else. How to not relegate exploratory analyses too far is also unclear to me.

DR: I’m much more on the ‘pro’ side pre-registration and PaPs. It also helps deal with publication bias and file drawers. And p-hacking is a huge issue IMHO. But it is also good to have some consideration of the pros and cons, so this would be great.

BB: RE pre-reg: yes I think it is enough that it prevents p-hacking (there could be very little cost associated with pre-reg) but I fear that it could prevent other advancements if it relegates exploratory analyses too far.

DR: I don’t think it should be binary. Systems need to be worked out for adjustments to the meaning of reported estimates depending on whether they were or were not preregistered, and how many were preregistered. While reported significance levels could be adjusted in the frequentist framework, this will all presumably based on measures of the likelihood that such a result would have been estimated/reported. Thus I think this could most easily be incorporated into a Bayesian framework but I’m not saying it would be easy. Still, they have done some good work on adjustments for ‘sequential designs.’

BB: I think that it could also stifle students a bit – it may reduce further the number of students who have access to funding that allows for experiments that will be able to be published if all experiments have to be high-powered.

DR: Statistical power is an important issue. I was skeptical at first about the ‘dangers of underpowered studies’ but maybe I’m coming around a bit.

My thinking was that ‘we can simply make downward adjustments to the estimates reported in underpowered studies.’ See the discussion under power calculations.

Anyways, we don’t want to put the cart before the horse: as Gelman said at a conference we should be supporting science not the careers of scientists. I tend to think there are strong arguments for more centralization in social science.

And my impression is that we actually have too many different studies and distinct research programs being run, and too many papers being published and not carefully brought together into a framework. Going through the studies on the https://www.replicationmarkets.com/ reinforces this impression for me.

Still, I think there are ways around this to enable early career people. ‘Underpowered’ experiments could be registered as part of a longer/sequential research program, perhaps collaborative and enabling meta-analysis.

BB: I also don’t think it gets at publication bias very much unless pre-reg’ed studies are followed up on. Only then do you know why the study didn’t come out – and quite a lot of the time I think it will be attrition/inability to gather the necessary data. Someone could launch that journal though – the Journal of Failed Studies – to have a place for a record that they have been run and what happened to be kept. So I am pro pre-reg, I just think the system needs a bit of work.

DR: If preregistration is made public and well-organize, then the ‘failed’ exercises willtbe integrated into future meta-analyses; so that’s at least a partial solution here.

Agreed, we need to build better systems for incentivising pre-registration and careful data sharing. We need to give career credit to people for planning designing and reporting credible experiments and projects, even if they ‘fail.’ Part this is publishing/rewarding tight null results, which actually do add a lot of value.

We might also consider offering some reward careerwise to experiments that fail – in terms of being deeply inconclusive– for some arbitrary or random reason even though they were well-planned and executed. But I think it is hard to get the incentives right for the latter.

14.1.2 The hazards of specification-searching

14.2 Designs for decision-making

See ‘reinforcement learning,’ ‘lift,’ and multi-armed bandit explore/exploit tradeoffs (with emphasis on the explore part)

14.2.1 Notes on Bandit vs Exploration problems/Thompson vs Exploration sampling

From a conversation

: “Ah interesting, my understanding is that Thompson sampling is used to balance out statistical power with treatment effectiveness?”

Sort of. None of this is really isometric to ‘statistical power’ imho.

My impression is that

Thompson sampling is used to optimise in a case where we simultaneously explore (learn what’s best) and exploit (use what’s best)
Kasy’s “exploration sampling” is used to optimise in a case where we have a defined exploration period (testing period) during which exploitation is not important

Thompson’s sampling converges to a single treatment to optimise exploitative benefit. Exploration sampling converges to two treatments to maximise ‘learning for future benefit.’

Q ## Sequential and adaptive designs {#sequential}

Sequential

Needs to adjust significance tests for augmenting data/sequential analysis/peeking Statistics/econometrics new-statistics sagarin_2014 http://www.paugmented.com/ resubmit_letterJpube.tex, http://andrewgelman.com/2014/02/13/stopping-rules-bayesian-analysis/

Yet …

$P_{augmented}$ may overstate type-1 error rate Statistics/econometrics response to referees, new-statistics "

A process involving stopping “whenever the nominal $p < 0.05$” and gathering more data otherwise (even rarely) must yield a type-1 error rate above 5%. Even if the subsequent data suggested a “one in a million chance of arising under the null” the overall process yields a 5%+ error rate. The NHST frequentist framework can not adjust ex-post to consider the “likelihood of the null hypothesis” given the observed data, in light of the shocking one-in-a-million result. While Bayesian approaches can address this, we are not highly familiar with these methods; however, we are willing to pursue this if you feel it is appropriate.

Considering the calculations in , it is clear that $p_{augmented}$ should the type-1 error of the process if there is a positive probability that after an initial experiment attains p$<0.05$, more data is collected. A headline $p<0.05$ does imply that this result will enter the published record. Referees may be skeptical of other parts of the design or framework or motivation. They may also choose to reject the paper specifically because of this issue; they believe the author would have continued collecting data had the result yielded $p>0.05$, thus they think it is better to demand more evidence or a more stringent critical value. Prompted by the referee, the author may collect more data even though $p<0.05$. Or, she may decide to collect more data even without a referee report/rejection demanding it, for various reasons (as we did after our Valentine’s experiment). Thus, we might imagine that there is some probability that after (e.g.) an initial experiment attaining p<0.05, more data is collected, implying that $p_{augmented}$ as calculated above overstates the type I error rate that would arise from these practices. As referees and editors, we should be concerned about the status of knowledge as accepted by the profession, i.e., in published papers. If we recognize the possibility of data augmentation after any paper is rejected, it might be a better practice to require a significance standard substantially below $p=0.05$, in order to attain a type-1 error rate of 5% or less in our published corpus."

14.2.2 Adaptive

See Max Kasy’s slides and articles on adaptive field experiments, particularly considering ‘exploration sampling.’

This also relates to ‘reinforcement learning.’

14.3 Efficient assignment of treatments

(Links back to power analyses)

14.3.1 See also multiple hypothesis testing

14.3.2 How many treatment arms can you ‘afford?’

A side conversation: who runs more treatments, academics or practicioners?

R: In a “policy” context, where we want to decide which message to use, and we are not necessarily trying to establish a very robust general result for scientific purposes, I would err on the side of more messages perhaps

M: That’s very interesting. I may just defer to what you think, but my intuitions would have been the reverse in terms of the practical policy vs academic/generalisation

OK it really depends on the context and the cost/benefit of having confidence in a particular result.

E.g., p<0.01 vs profitable Bayesian updating (‘lift’ etc) In academia (Economics anyway) there is still the old-line insistence that ‘the result you present must be strongly statistically significant in a frequentist test or it is not publishable.’ In business practice a result may be highly valuable even if it is something like “there is an 80% chance that message A works better than message B, and the mean additional ‘lift’ of message A is +$50,000, with an 80% credible interval of (0,$50,000) and a 95% CI of (-$20k, +$70K)

But academia also cares about ‘deep’ and ‘fancy’ mechanisms stuff: academia asks ’is this question interesting, can you rule out alternative hypothesis, etc … motivating one to have more treatments/outcomes as long as you can ‘get p<0.01’ for rejecting a (perhaps trivial null). So academia compares A1,A2,A3, B1,B2,B3, C1, C2,… etc.

While business might find it more valuable to gain greater insight into ‘whether A outperforms B’ even if we don’t know why. It may not care about testing A1 vs A2 if there is little practical difference between the two.

And ‘academic publication incentives’ doesn’t care much about precision after we ‘get p<0.01’ E.g., if my paper shows that the some ‘early donation seed’ raises (p<0.01 significantly) more funds than no such seed, ‘publication-wise’ I may not care much about bounding the size of this effect, nor about bounding a measure the size of the optimal seed.

But as a business (or EA org etc) I know these seeds are costly, and I may only want to do it if I have a certain level of confidence that it will be substantially above the cost of the seed, perhaps considering the risk/return tradeoffs. I may also find it valuable to have an extremely precise measure of the optimal seed.

14.3.3 Other notes and resources

ICYMI:
Recording of my talk on experimental design in the Chamberlain seminar, with discussions by dmckenzie001 and Max Tabord-Meehan:https://t.co/xKtTrH8X1U https://t.co/e7Cq90D6Sl
— Maximilian Kasy (maxkasy) May 25, 2020