# 5 Robustness and diagnostics, with integrity; Open Science resources

## 5.1 (How) can diagnostic tests make sense? Where is the burden of proof?

Where a particular assumption is critical to identification and inference … iFailure to reject the violation of an assumption is not sufficient to give us confidence that it is satisfied and the results are credible. Authors frequently cite insignificant statistical tests as evidence in support of a substantive model, or of evidence that they do not need to worry about certain confounds. Although the problem of induction is difficult, I find this approach inadequate. Where a negative finding is given as an important result, the authors should also show that their parameter estimate is tightly bounded around zero. Where it is cited as evidence they can ignore a confound, they should provide evidence that they can statistically bound that effect is small enough that it should not reasonably cause an issue (e.g., as using Lee or McNemar bounds for selective attrition/hurdles).

I am concerned with the interpretation of diagnostic testing, both in model selection, and in the defense of the exclusion restrictions or identification assumptions. It is problematic, when the basic consistency of the estimator (or a main finding of the paper) critically depends on such tests failing to reject a null hypothesis, to merely state that the ‘test failed to reject, therefore we maintain the null hypothesis.’

How powerful are these tests?

I.e. what is the probability of a false negative Type II error?

How large a bias would be compatible with reasonable confidence intervals for these tests?

### 5.1.1 Further discussion: the DiD approach and ‘parallel trends’

BB 17/06:

[In general] economists do … suggest that they have evidence which allows them to accept null hypotheses. The motivating example would be in the case of DiD estimation, where we have to assume parallel trends. My hunch is that fairly frequently people just say “yes we have evidence the trends were parallel pre-intervention” and then point away to something that shows a \(p>0.05\) on pre-interenvetion differences in trend. Sometimes this is done better where they say ‘yeah here is a (relatively) tightly estimated zero effect.’ But what do we accept as good enough to show that?

Building on that point, it seems that we do need something to tell us when DiD is valid and yet don’t yet have something. My hunch is that the answer may well lie in Bayesian approaches. DR’s suggestion was, I think, that we could maybe simultaneously try and estimate the difference in trends both before and after intervention simultaneously.

DR: My idea was that a Bayesian approach would have involve a parameter for ‘differential trend,’ with a prior (and then posterior) distribution over both this parameter and the ultimate estimate. I’ll look to find comparable approaches in the literature.

BB: I was wondering about a less sophisticated approach in which we set up the prior to be not-parallel trends and see over what range of “not parallel” priors the posterior is pulled closer to the idea of there actually being a parallel trend.

DR: It’s an interesting reframe, to place the burden of proof on the researcher; sort of ‘rejecting the null of a non-parallel trend.’ But how can this work? And how to deal with the idea that the null could be a differential trend *in either direction*?

@BB: And perhaps we should try to have this discussion using the specific equations and assumptions of DiD?

BB: I think similar to what DR suggested above is what we used in the Bayesian approaches to do for Covid. Seems to me that the intervention in this case is the lockdown. There we assume that when a country locks down it reduces the R, and we estimate two different Rs.

DR: Can you define this with a few equations to fix ideas? It’s hard to follow.

BB: Thinking about that approach here, we could try and estimate a pre-intervention relationship between the two groups, and then introduce a post-intervention term which is asked to explain the residual of the variation in the intervention group post-treatment.

BB: Again, I think they key would be trying to establish ‘over what prior assumptions for the pre-treatment trend is the posterior outcome for post-treatment effect in a particular direction.’

DR: Yes, essentially, we need to consider the sensitivity of the estimates to the assumption!

But I think before we get into this we should definitely look at the tweet on DiD that you included as well as the drive folder of Patrick Button

Just to add that there seem to some further papers we should take a look at.

Specifically: This one and this one

## 5.2 Estimating standard errors

## 5.3 Sensitivity analysis: Interactive presentation

## 5.4 Supplement: open science resources, tools and considerations

I'm in psychology research (just finished PhD) and I want to get into #OpenScience to make sure I'm following best practices. But this is something that wasn't explicitly taught to me. What are some good resources? Thanks! @AcademicChatter @OSFramework @OpenAcademics @dsquintana

— Alessa Teunisse (@alessateunisse) April 21, 2020

A couple of months ago I made a guide on how to use Binder to make our #RStats code #reproducible. E.g., Binder will make your code runnable using the versions of R and r-packages used when you analyzed your data. https://t.co/srYNazwy0q #reproducibility #openscience pic.twitter.com/gcTlVFpaY5

— Erik Marsja (Emarsja) December 17, 2019

## 5.5 Diagnosing p-hacking and publication bias (see also meta-analysis)

Ever wonder “Were those results p-hacked?” Brodeur et al. propose a useful new check (“speccheck” on Stata. R/etc. coming soon). #ASSA2020 pic.twitter.com/NCZ1jZTaO5

— Eva Vivalt (evavivalt) January 4, 2020