4 Basic statistical inference and regressions: Common mistakes and issues

4.1 Basic regression and statistical inference: Common mistakes and issues briefly listed

Peer effects: Self-selection, Common environment, simultaneity/reflection (Manski paper) Identification

Random effects estimators show a lack of robustness Specification Clustering SE is more standard practice

OLS/IV estimators not ‘mean effect’ in presence of heterogeneity

Power calculations/underpowered

Selection bias due to attrition

Selection bias due to missing variables – impute these as a solution

Signs of p-hacking and specification-hunting

Weak diagnostic/identification tests

Dropping zeroes in a “loglinear” model is problematic

Random effects estimators show a lack of robustness

Dropping zeroes in a “loglinear” model is problematic

Random effects estimators show a lack of robustness

-(Some notes on multi-level modeling) and RE linked in this Twitter thread

With heterogeneity the simple OLS estimator is not the ‘mean effect’

P_augmented may overstate type-1 error rate

Impact size from regression of “log 1+gift amount”

Lagged dependent variable and fixed effects –> ‘Nickel bias’

Peer effects: Self-selection, Common environment, simultaneity/reflection (Manski paper)

Weak IV bias

Bias from selecting instruments and estimating using the same data

Failure to adjust for multiple-hypothesis testing

4.1.1 Bad control

From MHE:

some variables are bad controls and should not be included in a regression model even when their inclusion might be expected to change the short regression coefficients. Bad controls are variables that are themselves outcome variables in the notational experiment at hand. That is, bad controls might just as well be dependent variables too."

– They could also be interpreted as endogenous variables.

Example of looking at a regression of wages in schooling, controlling for college degree completion:

Once we acknowledge the fact that college affects occupation, comparison of wages by college degree status within occupation are no longer apples to apples, even if college degree completion is randomly assigned."

– The question here was whether to control for the category of occupation, not the college degree.

It is also incorrect to say that the conditional comparison captures the part of the effect of college that is ‘not explained by occupation’ … so we would do better to control only for variables that are not themselves caused by education."

4.1.2 Does ‘controlling for more’ increase the probability that a (controlled) difference between too groups represents a causal effect? Not really (Informal discussion in fold)

Claim: ‘controlling for more factors does not make it more likely that the (remaining) observed association between factors A and D represents a causal relationship’

Suppose we want to know ‘does being raised in Australian culture make you donate more?’ The ‘do this, change that’ hypothesis is ‘if we moved an individual, from birth, into an Australian culture, holding all else constant,’ he/she would donate more. Suppose that 90% share of Australians have a university degree, because Australian culture also encouraged education, while only 50% of non-Australians did. That raises the first issue: do we want to try to get at only ‘direct effects’ of Australian culture (e.g., on generous spirit) without considering the ‘indirect effects’ on education, which itself might increase concern for others and thus charitable giving? This is not clear. But suppose we only wanted to consider the former. And suppose that Australian culture in fact does increase generosity through this direct channel. But how can we measure it? We could compare the average donation of Australians and non-Australians overall. But here we would not be able to disentangle this from ’“the effect of education –> concern –> giving” alternatively 2. We could compare the average donation of Australians and non-Australians “with the same levels of education” i.e., down-weight the Australians with a university degree, counting them only 1/5 as much as Australians without a university degree. However, we would then be overweighting (fivefold!) Australians who, in spite of the big Australian cultural pressure towards university, nonetheless decided not to get a degree. These people may have other (potentially unmeasured) characteristics tied to their hatred for education, that caused them not to go to Uni. these characteristics may also make them less generous. The average donations from these ‘extreme anti-education’ Australians may in fact be lower than the the donations of the average non-Australian. This would then be masking the impact of Australian culture on donations, which (we assumed above) was positive,

4.1.3 “Bad control” (“colliders”)

Endogenous control: Are the control variables you use endogenous? (E.g., because FDI may itself affect GDP per capita)

4.1.4 Choices of lhs and rhs variables

  • Missing data
  • Choice of control variables and interactions
  • Which outcome variable/variables

4.1.5 Functional form

  • Logs and exponentials
  • Nonlinear modeling (and interpreting coefficients) ‘Testing for nonlinear terms’

Quadratic regressions are not diagnostic regarding u-shapedness: Simonsohn18


4.1.6 OLS and heterogeneity

OLS does not identify the ATE

In general, with heterogeneity, OLS does not identify the ATE. It weights observations from different parts of the sample differently. Parts with greater residual variation in the treatment (outcome) variable are more (less) heavily weighted.

E.g., if the treatment is binary, the estimator will most heavily weight those parts of the sample where the probability of treatment is closest to 1/2.

The formula is …

Some intuition

Why is this the case? The OLS type estimators we are taught in Econometrics are ‘BLUE’ under the assumption of a single homogenous ‘effect’ (the ‘slope’… although the discussion itself is often agnostic as to whether this represents a causal effect).

It is ‘best’ in a minimizing MSE sense under certain assumptions; in particular, we must also know the true functional form and the set of variables to be included. See ‘overfitting’ issues.

In order to have the estimate of the true slope that minimizes the squared errors, OLS (and related estimators like FGLS; as well as 2SLS in a more complicated sense) weights some observations more than others. The ‘influence’ of an observation on the estimated slope depends on the nature of the variation in the dependent and independent variables in the region that observation is drawn from. Think of drawing a line through a set of points that were drawn with some noise from the true distribution. If you drew it based on a bunch of points (from a region where) the treatment varies very little and the outcomes have a lot of noise, the line you draw will be very sensitive to the latter noise and thus unreliable. So, would optimally ‘down-weight’ these observations in drawing the line.

However, if the actual slope varies by region, this also means you are under-representing certain regions, and thus getting a biased estimate of the average slope.

How can we deal with this? If we think that the treatment effect varies with observable variables, we could include ‘interactions’; essentially making separate estimates of the slope for each share of the population (but potentially allowing other control variables to have a homogenous effects, and pooled or clustered estimation of underlying variance.)

…Although we may want to consider both overfitting here and the idea that there may be some shared component, so the fully-interacted model may be sub-optimal. See mixed modeling (?)

However, this does not tell us how to recover the average of these slopes (approximately, the ATE). Should we weight each of the slopes by the share of the population that this group represents?

Mechanically, the standard way of estimating and representing these interactions and economics has been with simple dummies (0,1) for each compared group. This yields a ‘base group’ (e.g., males aged 35-60) – this obviously does not recover the average slope– as well as the ‘adjustment’ coefficients.

Another way of expressing interactions, particularly helpful with multi-level interactions is called ‘effect coding’: each group is coded as a ‘difference from 0’ (e.g,. with bunarybinary gender data males are -1/2 and females +1/2), before doing the interactions. This could allow for a more straightforward interpretation: at each level, the uninteracted term represents the average treatment effects, and the interacted terms represent adjustments relative to this average. But under which conditions is this in fact the case?

[insert here].

WB blog - your-go-to regression-specification is -biased-here-s-simple-way-fix-it

A key paper: http://www.jcsuarez.com/Files/Suarez_Serrato-BFE.pdf

In particular, we compare treatment effect estimates using a fixed effects estimator (FE) to the average treatment effect (ATE) by replicating eight influential papers from the American Economic Review published between 2004 and 2009.1 Using these examples, we consider a randomized experiment in Section 1 as a case study and, in Section 3, we show generally that heterogeneous treatment effects are common and that the FE and ATE are often different in statistically and economically significant degrees. In all but one paper, there is at least one statistically significant source of treatment effect heterogeneity. In five papers, this heterogeneity induces the ATE to be statistically different from the FE estimate at the 5% level (7 of 8 are statistically different at the 10% level). Five of these differences are economically significant, which we define as an absolute difference exceeding 10%. Based upon these results, we conclude that methods that consistently estimate the ATE offer more interpretable results than standard FE models

By “FE” here I think they mean group dummies; they are focused on cross-sectional and not panel data!

While fixed effects permit different mean outcomesamong groups, the estimates of treatment effects are typically required to be the same; in more colloquial terms, the intercepts of the conditional expectation functions may differ, but not the slopes


\[y_i = x_i \beta_{g(i)} + \mathbf{z_i}' \gamma + \epsilon_i\]

where $y_i is the outcome for observation i among N [N what?], \(x_i\) is treatment or another variable of interest, and \(z_i\) contains control variables, including group-specific fixed effects.

The treatment effects aregroup-specific for each of the \(g=1,...,G\) groups, where group membership is known for each observation.

Defining ATE

\[\beta^{ATE}=\sum_g \pi_g \beta_g \]

where the \(\pi\) terms are population frequencies

The use of interaction terms is delicate…

  • Modeling heterogeneity: the limits of Quantile regression

4.1.7 “Null effects”

"While the classical statistical framework is not terribly clear about when one should ‘accept’ a null hypothesis, we clearly should distinguish strong evidence for a small or zero effect from the evidence and consequent imprecise estimates. If our technique and identification strategy is valid, and we find estimates with confidence intervals closely down around zero, we may have some confidence that any effect, if it exists, is small, at least in this context. To more robustly assert a ‘zero or minimal effect’ one would want to find these closely bounded around zero under a variety of conditions for generalizability.

In general it is important to distinguish a lack of statistical power from a “tight” and informative null result; essentially by considering confidence intervals (or Bayesian credible intervals). See, e.g., Harms and Lakens (2018), “Making ‘null effects’ informative: statistical techniques and inferential frameworks” (harmsMakingNullEffects2018?).

Psychologists must be able to test both for the presence of an effect and for the absence of an effect. In addition to testing against zero, researchers can use the two one-sided tests (TOST) procedure to test for equivalence and reject the presence of a smallest effect size of interest (SESOI). The TOST procedure can be used to determine if an observed effect is surprisingly small, given that a true effect at least as extreme as the SESOI exists. We explain a range of approaches to determine the SESOI in psychological science and provide detailed examples of how equivalence tests should be performed and reported. Equivalence tests are an important extension of the statistical tools psychologists currently use and enable researchers to falsify predictions about the presence, and declare the absence, of meaningful effects. (Lakens et al, 2018) Confidence intervals and Bayesian credible intervals Comparing relative parameters

E.g., “the treatment had a heterogeneous effect… we see a statistically significant positive effect for women but not for men.” This doesn’t cut it: we need to see a statistical test for the difference in these effects. (And also see caveat about multiple hypothesis testing and ex-post fishing).

See (Gelman and Stern 2006)

4.1.8 Multivariate tests and ‘tests for non-independence’

‘Multiple’ tests typically consider

H0: “None of the following coefficients are different from 0”

H1: “At least one of these coefficients are different from 0”

But of course, ‘rejecting H0’ doesn’t tell us which of these coefficients are distinct from 0, nor how different from zero we should expect the average coefficients to be

Other tests look for ‘associations’ between categorical variables, each potentially involving many categories.

From a discussion on the appropriate adjustment of Chi-Sq tests for categorical variables, and the possible adjustments to this test for multiple response categorical variables

If I am correct, I don’t see a huge value in doing/adjusting any ‘aggregate tests for association between (e.g.) ethnicity and involve mode.’ I say this because I think most people would have a strong belief that there will be at least some association here overall, even if it is a small one, and even if it isn’t ‘caused by ethnicity,’ whatever that could mean.o

The general idea that ‘we expect an association in most cases, so finding one is not particularly informative (without further results) … is mentioned HERE and discussed by Gelman here (context of ’Bayesian hypothesis testing’) and here (in terms of a new typeology of ‘errors’).

4.1.9 Multiple hypothesis testing (MHT)

A typical conversation (unfold)

Namely that when we have multiple hypotheses testing in a paper how do we control for that and what do our p-values really mean. Particulary, there is this paper by Alwyn Young which (I think) is basically saying that when we have a load of hypotheses within a study and then one has a significant p value we are quite a lot of the time picking up a false result. Obviously that isn’t very much of a surprise. But I think it also talks somewhat to the Steffano DellaVigna paper that was presented at the VAFE - that the publication bias really creeps in when we are publishing studies that have just one or two p values around 0.05. After all, it is those studies which have a real pay-off from getting that significant result.

From List et al (2019), discussing experiments reporting tests of multiple treatments, on multiple subgroups, and considering multiple outcomes:

it is uncommon for the analyses of these data to account for the multiple hypothesis testing. As a result, the probability of a false rejection may be much higher than desired. To illustrate this point, consider testing \(N\) null hypotheses simultaneously. Suppose that for each null hypothesis a p value is available whose distribution is uniform on the unit interval when the corresponding null hypothesis is true. Suppose further that all null hypotheses are true and that the p values are independent. In this case, if we were to test each null hypothesis in the usual way at level \(\alpha \in (0, 1)\), then the probability of one or more false rejections equals \(1 − (1 − \alpha)N\), which may be much greater than \(\alpha\) and in fact tends rapidly to one as \(N\) increases. For instance, with \(\alpha = 0.05\), it equals \(0.226\) when \(N = 5\) equals 0.401 when N = 10 and 0.994 when N = 100. In order to control the probability of a false rejection, it is therefore important to account appropriately for multiplicity of null hypotheses being tested.

Winston Lin’s response to an email:

On multiple comparisons corrections, there are some thorny philosophical issues and there isn’t one right answer:

mixed feelings about the recent popularity of multiplicity corrections in the social sciences. I think there’s a tendency to fetishize p-values and null-hypothesis significance testing, and multiplicity-corrected p-values can be hard to interpret. (In principle, one could also report multiplicity-corrected confidence intervals, but I haven’t seen people doing that.) To me, the real value of p-values and NHST (if there is any) is as a restraining device, and if this restraining device isn’t working because of the multiple comparisons issue, there’s more than one way to address the problem. In medical research, people often pre-specify one or two primary comparisons (which implies a commitment that these are the comparisons that’ll be highlighted in the abstract, and the rest are basically treated as exploratory) without doing multiplicity corrections.

On the other hand, maybe you have to do multiplicity corrections because your reviewers want them, or maybe you don’t have one or two primary comparisons between your 7 treatments. When I last followed the multiple comparisons literature (about 4 years ago), my feeling was that if I had to report multiplicity-corrected p-values, I’d use either the Westfall-Young or the Romano-Wolf procedure.

There are some explanations and references in section 7 and footnotes 5-12 of this EGAP guide that Al Fang, Don Green, and I worked on DR: I will try to start a hypothes.is or forum conversation in that Egap guide.


Frequentist NHST: Understated type-one error rate overall if no adjustment, overly narrow confidence intervals (coverage)

  • “Family-wise error rate” (FWER)

  • “False discovery rate” (FDR)


‘FWER Control methods’

  • Bonferroni correction: divides the threshold p-value by the number of tests (\(k\))
    • Extremely conservative, assumes independence of hypotheses/outcomes
  • Holm-Bonferroni
    • slightly less conservative than Bonferroni – orders p-values, applies Bonferroni to the “most significant” result (smallest p-value), and the correction reduces the denominator by 1 for each ‘remaining result’
  • Westfall-Young Step-down:
  • List, Shaikh, Xu (2019): “bootstrap-based procedure for testing these null hypotheses simultaneously using experimental data in which simple random sampling is used to assign treatment status to units.”

    (Does it apply to blocked sampling?)

our procedure (1) asymptotically controls the familywise error rate—the probability of one or more false rejections—and (2) is asymptotically balanced in that the the marginal probability of rejecting any true null hypothesis is approximately equal in large samples. Importantly, by incorporating information about dependence ignored in classical multiple testing procedures, such as the Bonferroni (1935) and Holm (1979) corrections, our pro- cedure has much greater ability to detect truly false null hypotheses. In the presence of multiple treatments, we additionally show how to exploit logical restrictions across null hypotheses to further improve power.

False discovery rate (FDR) Control

  • Benjamini and Hochberg … but this one seems to be conservative in the sense that it implicitly assumes independent hypotheses in making the correction (?)

Considering some of Nik Vetr’s work below (genetics-related)


rlkj <- function (K, eta = 1) {
  alpha <- eta + (K - 2)/2
  r12 <- 2 * rbeta(1, alpha, alpha) - 1 #generates n=1 random 'deviates' from the beta function with parameters alpha ... what does this mean in context?
  R <- matrix(0, K, K) #a K by K matrix of 0's
  R[1, 1] <- 1
  R[1, 2] <- r12
  R[2, 2] <- sqrt(1 - r12^2)
  if (K > 2)
    for (m in 2:(K - 1)) {
      alpha <- alpha - 0.5
      y <- rbeta(1, m/2, alpha)
      z <- rnorm(m, 0, 1)
      z <- z/sqrt(crossprod(z)[1])
      R[1:m, m + 1] <- sqrt(y) * z
      R[m + 1, m + 1] <- sqrt(1 - y)

d <- 500
n <- 100
FT_SE <- 1 / sqrt(n-3)
zscore_thresh_2t <- abs(qnorm(0.025, 0, 1)) #for 2-tailed test

corrmat <- rlkj(K = d, eta = 1)
x <- rmvnorm(n, sigma = corrmat)
obs_corrmat <- cor(x)

#let's just keep it in symmetric matrix form for clarity's sake
atanh_corrmat <- atanh_obs_corrmat <- diag(d)

atanh_corrmat[upper.tri(atanh_corrmat)] <- atanh(corrmat[upper.tri(corrmat)])
atanh_corrmat <- atanh_corrmat + t(atanh_corrmat)

atanh_obs_corrmat[upper.tri(atanh_obs_corrmat)] <- atanh(obs_corrmat[upper.tri(obs_corrmat)])
atanh_obs_corrmat <- atanh_obs_corrmat + t(atanh_obs_corrmat)

diag(atanh_corrmat) <- diag(atanh_obs_corrmat) <- NA

#compute sample z-scores
abs_zscores <- abs(atanh_corrmat - atanh_obs_corrmat) / FT_SE

#look at false positive rate
sum(abs_zscores > zscore_thresh_2t, na.rm = T) / 2 / choose(d, 2)
#huh look at that

#let's just more explicitly confirm in a linear modeling context
y <- x*0 + matrix(rnorm(d*n), nrow = nrow(x), ncol = ncol(x))
Bs <- sapply(1:d, function(i) (solve(t(cbind(1, x[,i])) %*% cbind(1, x[,i])) %*% t(cbind(1, x[,i])) %*% t(t(y[,i])))[2])
Bs_SEs <- sqrt(sapply(1:d, function(i) var(y[,i]) / var(x[,i]) * (1-cor(x[,i], y[,i])^2) / (n-3)))
Bs_Zs <- abs(Bs / Bs_SEs)
sum(Bs_Zs > zscore_thresh_2t) / d

#let's just do one more quick confirmation replicated at low d
#doing this cos lkj(eta=1) concentrates probability density around identity which might throw things off

low_d_replicate <- function(d, n, bonf = F){

  FT_SE <- 1 / sqrt(n-3)
  zscore_thresh_2t <- abs(qnorm(0.025, 0, 1)) #for 2-tailed test
  zscore_thresh_2t_pairwise_bonf <- abs(qnorm(0.025 / choose(d,2), 0, 1)) #for 2-tailed test
  zscore_thresh_2t_totalD_bonf <- abs(qnorm(0.025 / d, 0, 1)) #for 2-tailed test

  corrmat <- rlkj(K = d, eta = 1)
  x <- rmvnorm(n, sigma = corrmat)
  obs_corrmat <- cor(x)

  #let's just keep it in symmetric matrix form for clarity's sake
  atanh_corrmat <- atanh_obs_corrmat <- diag(d)

  atanh_corrmat[upper.tri(atanh_corrmat)] <- atanh(corrmat[upper.tri(corrmat)])
  atanh_corrmat <- atanh_corrmat + t(atanh_corrmat)

  atanh_obs_corrmat[upper.tri(atanh_obs_corrmat)] <- atanh(obs_corrmat[upper.tri(obs_corrmat)])
  atanh_obs_corrmat <- atanh_obs_corrmat + t(atanh_obs_corrmat)

  diag(atanh_corrmat) <- diag(atanh_obs_corrmat) <- NA

  #compute sample z-scores
  abs_zscores <- abs(atanh_corrmat - atanh_obs_corrmat) / FT_SE

  #let's just more explicitly confirm in a linear modeling context
  y <- x*0 + matrix(rnorm(d*n), nrow = nrow(x), ncol = ncol(x))
  Bs <- sapply(1:d, function(i) (solve(t(cbind(1, x[,i])) %*% cbind(1, x[,i])) %*% t(cbind(1, x[,i])) %*% t(t(y[,i])))[2])
  Bs_SEs <- sqrt(sapply(1:d, function(i) var(y[,i]) / var(x[,i]) * (1-cor(x[,i], y[,i])^2) / (n-3)))
  Bs_Zs <- abs(Bs / Bs_SEs)

    return(c(sum(abs_zscores > zscore_thresh_2t_pairwise_bonf, na.rm = T) / 2, sum(Bs_Zs > zscore_thresh_2t_totalD_bonf)))
  } else {
    return(c(sum(abs_zscores > zscore_thresh_2t, na.rm = T) / 2, sum(Bs_Zs > zscore_thresh_2t)))


nrep <- 1E4
d = 5
n = 100
apply(replicate(nrep, low_d_replicate(d, n)), 1, sum) / c(choose(d, 2) * nrep, nrep * d)
apply(replicate(nrep, low_d_replicate(d, n, T) > 0), 1, sum) / nrep
#yep yep seems to hold still! okie-dokie

4.1.10 Interaction terms and pitfalls

See also ‘effect coding’ or ‘contrast coding.’ ‘Moderators’ and their confusion with nonlinearity

Moderators: Heterogeneity mixed with nonlinearity/corners

In the presence of nonlinearity, e.g., diminishing returns, if outcome ‘starts’ at a higher level for one group (e.g., women), it is hard to disentangle a heterogeneous response to the treatment from ‘the diminishing returns kicking in.’ Related to https://datacolada.org/57 [57] Interactions in Logit Regressions: Why Positive May Mean Negative

4.1.11 Choice of test statistics (including nonparametric)

(Or get to this in the experimetrics section)

See also ‘Common statistical tests are linear models’

4.1.12 How to display and write about regression results and tests

4.1.13 Bayesian interpretations of results

4.2 Aside: effect and contrast coding of categorical variables

Suppose we want to do a simple ‘descriptive model of income.’ Suppose we have three groups, North, Central, and South (think US regions).

Comparing otherwise similar groups, auppose average income in the North is 130, Central is 80, and South is 60. Suppose equal group sizes, so mean is 90

There should be a way, in a (linear regression) model, to report coefficients as differences from overall means (in a multivariate context, ‘all else equal’) and get one for each . \(\beta_{North} = 40\)

\(\beta_{Central} = -10\) \(\beta_{South} = -30\)

And skip the intercept, obviously.

This seems most intuitive to me. But I can’t for the life of me figure out how to get this with R’s ‘contrast coding.’ (And also, that seems to mess up the variable names).

Set parameters

m_inc <- 90
b_n <- 40
b_c  <- -10
b_s <- -30

sd_prop <- 0.5 #sd as share of mean

pop_per <- 1000

Simulated data


#later, do this with a list for good code practice
n_income <- rnorm(pop_per, m_inc + b_n, (m_inc + b_n)*sd_prop)
c_income <- rnorm(pop_per, m_inc + b_c, (m_inc + b_s)*sd_prop)
s_income <- rnorm(pop_per, m_inc + b_s, (m_inc + b_s)*sd_prop)

noise_var <- rnorm(pop_per*3, 0, (m_inc + b_s)*sd_prop)

i_df <- tibble::tibble(
  region = rep( c("n", "c", "s"), c(pop_per, pop_per, pop_per) ),
  income = c(n_income, c_income, s_income),
) %>%
  mutate(region = as.factor(region))

i_df %>% group_by(region) %>%
    dplyr::summarise(n = n() ,
              mean = mean(income),
              sd = sd(income))
## # A tibble: 3 x 4
##   region     n  mean    sd
##   <fct>  <int> <dbl> <dbl>
## 1 c       1000  80.1  29.4
## 2 n       1000 131.   67.0
## 3 s       1000  59.6  31.5
i_df %>%                               # Summary by group using purrr
  split(.$region) %>%
## $c
##  region       income      noise_var   
##  c:1000   Min.   :-10   Min.   :-104  
##  n:   0   1st Qu.: 59   1st Qu.: -20  
##  s:   0   Median : 80   Median :   0  
##           Mean   : 80   Mean   :   0  
##           3rd Qu.: 99   3rd Qu.:  20  
##           Max.   :175   Max.   :  81  
## $n
##  region       income      noise_var  
##  c:   0   Min.   :-86   Min.   :-89  
##  n:1000   1st Qu.: 88   1st Qu.:-20  
##  s:   0   Median :132   Median :  0  
##           Mean   :131   Mean   :  0  
##           3rd Qu.:176   3rd Qu.: 20  
##           Max.   :345   Max.   : 88  
## $s
##  region       income      noise_var   
##  c:   0   Min.   :-36   Min.   :-101  
##  n:   0   1st Qu.: 39   1st Qu.: -17  
##  s:1000   Median : 60   Median :   1  
##           Mean   : 60   Mean   :   1  
##           3rd Qu.: 80   3rd Qu.:  21  
##           Max.   :158   Max.   :  90

Looks close enough.

Now I want to ‘model income’ to examine the differences by region ‘controlling for other factors’ (which here are noise, but let’s do it anyways; will make this exercise richer later).

Just to be difficult, let’s make the south the base group.

options(contrasts = rep ("contr.treatment", 2))

i_df <- i_df %>%   mutate(region = relevel(region, ref="s"))

  basic_lm <- i_df %>% lm(income ~ region + noise_var, .)
## Call:
## lm(formula = income ~ region + noise_var, data = .)
## Coefficients:
## (Intercept)      regionc      regionn    noise_var  
##    59.56067     20.56332     71.53226     -0.00566

The standard thing: intercept is (approximately) the mean for the ‘base group,’ the south, and the coefficients ‘regionc’ and ‘regionn’ represent the relative adjustments for these, roughly +20 and +70.

This is standard ‘dummy coding,’ i.e., ‘treatment coding.’ This is the default in R:

## $contrasts
## [1] "contr.treatment" "contr.treatment"

We can adjust this default (for unordered variables) to something called ‘sum contrast coding’ for both unordered and ordered

options(contrasts = rep ("contr.sum", 2))

Running the regression again

  basic_lm_cc <- i_df %>% lm(income ~ region + noise_var, .)
## Call:
## lm(formula = income ~ region + noise_var, data = .)
## Coefficients:
## (Intercept)      region1      region2    noise_var  
##    90.25920    -30.69853    -10.13521     -0.00566

Now this seems to get us the adjustment coefficients we are looking for, but

  1. The names of the regions are lost; how do I know which is which?
  2. It is apparently reporting the adjustment coefficients for s (south) and c (central). Not very intuitive.

This seems to be the case no matter how we relevel the region to set a particular base group (I tried it) … the coefficients don’t change.

What if we de-mean the outcome (income) variable, and force a 0 intercept?

i_df %>%
  mutate(m_inc = mean(income)) %>%
 lm(income - m_inc ~ 0  + region + noise_var, .)
## Call:
## lm(formula = income - m_inc ~ 0 + region + noise_var, data = .)
## Coefficients:
##   regions    regionc    regionn  noise_var  
## -30.69581  -10.13249   40.83645   -0.00566

Yay! This is what I wanted, and miraculously the variable names are preserved. But it seems a weird way to do it. Also note that, with the above code, this set of coefficients appears no matter which contrast matrix I force, whether sum or treatment.

However, this does not carry over easily to multiple dimensions of contrast coding (illustrate).

List of references

Gelman, Andrew, and Hal Stern. 2006. “The Difference Between ‘Significant’ and ‘Not Significant’ Is Not Itself Statistically Significant.” The American Statistician 60 (4): 328–31.