10 Econometric, statistical, and data issues

10.1 Discussion

This chapter is a work in progress. I am also considering how deeply to delve into this as much of this material is decently covered in resources mentioned below.

At the present, my aim is to give highlights and focus on the tips and insight that are most relevant. I will try to deal with those problems, mistakes, issues, and points of confusion that come up most often in student work.

Many (most?) projects are empirical, involving econometrics/statistics. It doesn’t make sense to give an entire course on Statistics, Econometrics, and Data Science here. However, I think it’s helpful to give somewhat of an applied overview, to:

  • link to key resources,
  • refresh your memory of these topics,
  • help you consider how to do this stuff in a real project (rather than in exams and problem sets),
  • give a sense of one economist’s (my) impressions of how to choose an approach, and
  • point out some common misunderstandings/mistakes students make in this area.

As this cannot be comprehensive, I suggest referring to other resources (texts etc) for more detailed considerations.

10.2 Some recommended applied econometrics and statistics resources

Causal Inference: The Mixtape v 1.8: From what I have seen, this looks great. This is a highly recommended resource for students from a variety of backgrounds and levels. It covers concepts from Econometrics from the ground-up, (re-)covering basic principles. The text focuses on ‘causal inference’ (rather than descriptive analysis, calibrating models, or prediction). Causal inference is at the heart of most (but not all) modern empirical work in microeconomics.

And it’s (for now) FREE and open-source. Look on Professor Cunningham’s page for most updated (free online) version. There are numerous examples and example code in Stata. Future editions will use R, they tell me.

It often takes an intuitive or example-based approach. There are few proofs or derivations (for better or for worse).

  • Core econ: Doing Economics

  • Angrist and Pischke: Mastering metrics

  • Angrist, J. D., and J. S. Pischke. Mostly Harmless Econometrics: An Empiricist’s Companion.

  • Peter Kennedy’s “A Guide to Econometrics”: Highly recommended guide; slightly older, but great intuition and practical tips exceeding what I have found in comparable guides.

  • “Introductory Econometrics” by Wooldridge: A very good textbook with all the proofs etc.

Time series

Economics 452 time series with stata econ.queensu.ca/faculty/gregory/econ452/manual.pdf

2 Working with economic and financial data in Stata (Chris Baum)

Kennedy, A Guide to Econometrics, Chapter 18, “Time Series Econometrics”

In Stata

Stata Web Books: Regression with Stata

See also Causal Inference: The Mixtape v 1.8, mentioned above – look for most updated (free online) version

With R

R for data science

10.3 The ‘ideal Econometric approach’*

I sketch my impression of the ideal process for Econometric modeling and ‘identification’ of (causal) effects.

We rarely attain this ideal.

Note that in principle, all of this “should” be done before you actually analyze your data! This is not a process of ‘iteration’ between modeling and estimation. (Although there are arguments for different approaches…)*

* There are alternative paradigms; see, e.g., ‘Statistical Rethinking’, as well as arguments for more exploratory and flexible Econometric approaches.

1. Economic theory: Present/outline a theoretical (optimization, game theoretic, market equilibrium,… ) model of the relationships between the ‘fundamentals of interest’ and how these lead to observed outcomes (and data).

Outcomes in this model should depend on

  • the unknown parameters to be estimated and/or

  • the fundamental hypothesized model being tested (versus alternative model) *

* Contrast: maintained versus tested assumptions.

2. Data and estimator:

Data generating process: ‘How does the (Economic theory) model generate the observable data?’ You may need to add (and justify) additional ‘statistical’ assumptions about (e.g.) the error structure/distribution.

Estimator: Formally specify your estimator (e.g., OLS, 2SLS, DiD, VECM)

Justification: Prove that, under the maintained assumptions, your estimator is unbiased (or consistent) for the ‘fundamental parameter of interest’.*

*It is more common, and perhaps advisable, to adopt an approach that has been justified in prior work, and to demonstrate that the justification should apply equally to your case.

10.4 Regression analysis, regression logic and meaning

Below, I briefly outline some key issues in applied (practical) Econometrics. These are better-covered in more detail in the references listed above.

We need to consider:

  • What is a regression? When should you use one?

  • How to specify the regression?

  • Which dependent variable do we use?

  • Which right-hand side variables?

    • Which is/are the focal variable(s) and which are ‘control variables’?
  • Endogeneity and identification

  • Other statistical issues (e.g., functional form, error structure)

How will you interpret your results?

How to create a regression table and put it in your paper.

Writing about regression (and statistical) analysis; yours and others’.

What is a regression? When should you ‘run’ one?

A way of fitting a line (plane) through a bunch of dots.

  • In multiple dimensions

  • It may have a causal interpretation (or not)

Classical Linear Model (CLM): Population model is linear in parameters:

OLS: Estimating Actual Linear Relationship?

  • Best linear approximation; ‘average slopes’

  • Causal or not

Identifying restrictions; CLM model assumptions

Some coefficients/tests depend on normality, others an “asymptotic” justification with a large enough sample

“… Regression coefficients have an ‘average derivative’ interpretation. In multivariate regression models this interpretation is unfortunately complicated by the fact that the OLS slope vector is a matrix-weighted average of the gradient of the CEF. Matrix-weighted averages are difficult to interpret except in special cases (see Chamberlain and Lemur, 1976).”

(Angrist and Pischke 2008)

10.5 How to specify a regression – some considerations

Regression fishing, overfitting, p-hacking, researcher degrees of freedom, and pre-analysis plans.

For most work, you are not merely trying to ‘best fit the data’, and you are surely not trying to just ‘best fit the sample of data you are using.’ You are trying to estimate one or more ‘parameters’ for the overall population, expressing a fundamental economic relationship (as well as the heterogeneity in this relationship). You often want to measure causal relationships in particular.

It may be tempting to continue to vary the regression specification and details: the choice of variables and their functional form, the part of the data you choose to look at or discard, etc., until you find a ‘significant result’ or ‘maximize R-squared.’ But this should not be your goal.8 You want to find the true parameters and measure the actual relationship in the population, not just gain a best-fit in your sample.

These are deep issues that require greater consideration. But a reasonable way to avoid falling into these pitfalls may be to specify a ‘pre-analysis plan.’ Once you have a basic idea of what your data consists of, but before you have ‘peeked at the results of regressions’, it may be advisable to specify (and write down) an approximate or exact plan. Which functional form(s) will you run and report, which control variables, what is your dependent variable, will you look at particular interaction relationships, etc. If you write down and justify such a plan at the beginning, you can ‘tie your hands’ to do an honest analysis.

This can give your work greater scientific credibility, particularly if you can specify and ‘register’ such a plan (e.g., at osf.io or aspredicted.org) in advance of obtaining your data, and you can demonstrate this to others. E.g., this is increasingly becoming a strong norm in experimental work.

Some more resources on this:

Functional form?

You are estimating a relationship between two or more variables. You might have some hypotheses about which variables should ‘matter’ in causing or predicting other variables. But you must also answer another question: what ‘functional form’ should your regression take?

For example…

  • Should it be linear? … This would lead to statements such as like ‘each year of schooling increases annual wage income by $5000 on average.’

  • Should it be proportional or log-linear (the log-log specification)? … leading to statements such as “each 1% increase in ‘years of schooling’ increases annual wage income by 2/10 of a percent on average.”

Should there be quadratic relationships? Should we allow ‘floor and ceiling effects’… where the impact of any variable diminishes as the outcome approaches certain values? Some specifications can allow flexible functional form, and there are even fully ‘nonparametric’ techniques, allowing any possible relationship, although these have their limitations.

Traditionally, economists have advocated using fundamental theory to determine which functional forms to use (as well as which ‘restrictions’ to impose on parameters).9

However, in most practical cases, at least in microeconomics and applied work, there is no ‘known’ functional form. We may use intuition (e.g., it is often reasonable to expect effects to be proportional to the level of an outcome variable) or rely on previous evidence.

Still, in many cases there is a reasonable argument for fitting the ‘best linear approximation’ of the relationship [ref: Angrist and Pischke, page…].

Impose restrictions?

Which dependent variable?

  • Is this meaningful to your question and interpretable?

  • Is it relevant to what you are looking for (e.g., available for right years and countries)?

  • Is it reliably collected?

  • Specified variables in logs? Linearly? Categorically?

  • Aggregated at what level?

Which right-hand side (rhs) variables? The focal variables and control variables

Typically, you care about:

The effect of one (or a few) independent variable on the dependent variable,

e.g., education on wages.

(Although you might have more complicated hypotheses/relationships to test, involving differences between coefficients etc.)

\(\rightarrow\) You should focus on credibly identifying this relationship.

Other rhs variables are typically controls (e.g., control for parent’s education, control for IQ test scores).

Be careful not to include potentially “endogenous” variables as controls, as this can bias all coefficients (more on this later).

Be careful about putting variables on the right hand side that are determined after the outcome variable (Y, the dependent variable).

10.6 Endogeneity

You care about estimating the impact of a variable \(x_1\), on \(y\).

Consider the example of regressing income at age 30 on years of education to try to get at the effect of education on income.

\(x_1\): years of education

\(x_2 ... x_k\): set of “control” variables

\(y\): income at age 30

You regress: \[y = \beta_0+\beta_1x_1 +\beta_2x_2+...+\beta_kx_k+u.\]

Suppose the true relationship (which you almost never know for sure in Economics) is: \[y = \beta_0+\beta_1x_1 +\beta_2x_2 + v.\]

For unbiasedness/consistency of all your estimated terms, the key requirement is: \[E(u|x_1, x_2,… x_k) = 0,\] which implies that all of the explanatory variables are exogenous.

Alternatively, it is still ‘consistent’ if \(E(v) = 0\) and \(cov(x_j,v) = 0\), for \(j = 1, 2, …, k\).

There are various reasons why the above assumption might not hold; various causes of what we call “endogeneity”. Two examples are reverse causality and omitted variable bias.

Does Economics tell us anything about the ‘true model’… (unfold)

There are few practical cases in Economics where we can confidently assert either (1) the functional form of a relationship (linear, nonlinear, etc.) or (2) the ‘variables’ that enter into this relationship, i.e., the arguments to the function. How could theory tell us for sure what all the meaningful factors are that affect a person’s income?

Theory does sometimes provide ‘exclusion restrictions’, essentially, variables that are asserted to not have an impact on the outcome. Theory may also suggest functional forms (perhaps linear or proportional relationships); but this is relatively rare, and not always robust to behavioral-economics-driven relaxations of assumptions.

For example, simple consumer optimisation implies that the amount of a good an individual consumes is ‘homogenous of degree zero’ in prices and income; if both double, the consumption of goods must stay the same. This could allow us to impose a restriction if we are estimating a demand system. However, some behavioral economic evidence suggests that even the most basic axioms of micro theory, such as transitivity, may not always hold. We may have violations of ‘axioms of revealed preferences’.

Theory may also suggest which variables should be present, which variables are likely to have an impact on the outcome. E.g., in most models of labor markets an individual’s marginal productivity will impact her wage, but the nature of this relationship may depend on the market power of employers in her industry, on the nature of search costs, etc. Education may affect wages both through the productivity (‘human capital’) channel and through its signaling value (see . Thus we might expect obtaining a degree to have a greater More simply, we expect prices and income to be an important determinant of quantity demanded (or expenditure) in any market and for any individual. However, we can rarely simply ‘regress quantity demanded on price’ to estimate a demand relationship. Remember, prices and quantities are themselves endogenous variables determined by system of supply and demand functions.

Reverse casuality

Education may affect income at age 30, but could income at age 30 also affect years of education?

This is probably not a problem for this example, because the education is usually finished long before age 30 (even I finished at age 30 on the nose).

However, in other examples it is an issue (e.g., consider regressing body weight on income, or vice/versa).

Also, if the measure of education were determined years later, this might be a problem. For example, if your measure of years of education was based on self-reports at age 30, maybe those with a lower income would under-report, e.g., if they were ashamed to be waiting tables with a Ph.D.

or a third, omitted factor may affect both

Intelligence may effect both education obtained and income at age 30

Macro/aggregate: With variation across time, there may be a common trend. E.g., suppose I were to regress “average income” on “average education” for the UK, using only a time series with one observation per year. A “trend term”, perhaps driven by technological growth, may be leading to increases in education as well as increased income.

The omitted variable bias formula; interpreting/signing the bias

You care about estimating the impact of a variable \(x_1\), on y, e.g.,

\(x_1\): years of education

y: income at age 30

You estimate

\[y = \beta_0+\beta_1x_1 + u\]

But the true relationship is:

\[y = \beta_0+\beta_1x_1 +\beta_2x_2 + v,\]

Where \(x_2\) is an unobserved or unobservable variable, say “intelligence” or “personality”.

Your estimate of the slope is likely to be biased (and “inconsistent”).

The “omitted variable (asymptotic) bias” is:

\[\beta_1 = \beta_1 + \beta_2\delta\]


\[\delta = Cov(x_1,x_2)/Var(x_1)\]

In other words, the coefficient you estimate will “converge to” the true coefficient plus a bias term.

The bias is the product:

[Effect of the omitted variable on the outcome] \(\times\) [“effect” of omitted variable on variable of interest]

E.g., [effect of intelligence on income] \(\times\) [“effect” of intelligence of years of schooling]

This can be helpful in understanding whether your estimates may be biased, and if so, in which direction!

This is also a helpful mechanical relationship between “short” and “long” regressions, whether or not there is a causal relationship.

Control strategies

Control for “\(x_2-x_k\)” variables that have direct effects on y; this will reduce omitted variable bias (if these variables are correlated to your “\(x_1\)” of interest)

Including controls can also make your estimates more precise.

If you put in an “\(X_k\)” variable that doesn’t actually have a true effect on Y, it will make your estimates less precise. However, it will only lead to a bias if it is itself endogenous (and correlated to your \(x_1\) of interest).

If you can’t observe these, you may use “proxies” for these to try to reduce omitted variable bias. E.g., IQ-test scores may be used as proxies for intelligence. Housing value might be used as a proxy for wealth.

“Bad control”

some variables are bad controls and should not be included in a regression model even when their inclusion might be expected to change the short regression coefficients. Bad controls are variables that are themselves outcome variables in the notational notional experiment at hand. That is, bad controls might just as well be dependent variables too. - (Angrist and Pischke 2008)

– They could also be interpreted as endogenous variables.

Once we acknowledge the fact that college affects occupation, comparison of wages by college degree status within occupation are no longer apples to apples, even if college degree completion is randomly assigned.

– The question here was whether to control for the category of occupation, not the college degree.

It is also incorrect to say that the conditional comparison captures the part of the effect of college that is ‘not explained by occupation’"

so we would do better to control only for variables that are not themselves caused by education. - Angrist and Pischke

Fixed effects estimators (de-meaning)

The net effect of omitted variables and the truly random term may have “fixed and varying components”. There may be a term “\(c_i\)” that is specific to an individual or “unit”, but that does not change over time (it is a “fixed effect”). For example, an individual may be more capable of earning, a firm may have a particularly good location, and a country may have a particular high level of trust in institutions. There may also be a term that varies across units and over time. An individual may experience a particular negative shock to her income, a firm may be hit by a lawsuit, and a country may have a banking scandal.

If this \(c_i\) part of the “error term” may be correlated to the independent variable of interest, \(x_1\), it may help to “difference this out” by doing a Fixed Effects Regression. This essentially includes a dummy variable for each individual (or “unit”), but these dummies are usually not reported. The resulting coefficients are the same ones you would get if you “de-meaned” every X and Y variable before running the regression.

When we demean our equation it becomes: \[y_{it}-\bar{y_i} = \beta_1(x_{it}-\bar{x_i}) + u_{it}-\bar{u_i}\]

where the bars indicate “the mean of this variable for individual i”. Due to the fact that \(c_i\) does not change over time, when we demean our equation this fixed effect drops out.

Instrumental variables

A variable Z that satisfies both:

  1. Instrument exogeneity: Z causes” the \(x_1\) variable of interest but has no independent effect on Y or \[Cov(z,u) = 0\]

  2. Instrument relevance: Z is correlated with the \(x_1\) variable of interest such that, \[Cov(z,x) \neq 0\]

… may be used as an “instrument”.

For example, it might be argued (debatable) that if one’s parents had a job near a good university, this would increases one’s chances of going to a good university. To use “distance to nearest university” as an instrument you would have to argue

  1. there is no direct effect of living near a good university on later income.

  2. The probability of living near a good university is not caused by a third unobserved factor (e.g., parent’s interest in children’s success) that might also affect later income.

As suggested above, it is hard to find a convincingly ‘valid’ instrument. This “exclusion restriction” cannot itself be easily tested, and is largely justified theoretically. (If you have multiple instruments there is something called an ‘overidentification test’ but it is controversial).

On the ’test of overidentifying restrictions (unfold)…

If you have multiple instruments there is the ‘J’ test from Hansen and Sargan. Essentially this tests whether the estimate is substantially different when you use the different instruments separately. If all the instruments are valid, the estimates should all be approximately the same. But there are issues here: if some instruments are ‘weak’, or the sample is small relative to the underlying variation, each estimate may be imprecise. This will make the ‘test for a significant difference between the estimates’ unlikely to reject the null hypothesis of no difference even if there is a difference; this becomes a low powered test.

Perhaps even more critically, the test is only reasonable if you already know one of these instruments is valid. Intuitively, if I have two non-valid instruments that both are both biased in similar ways, they will yield similar (but incorrect) estimates.

However, it seems plausible to me that if this test does reject the null hypothesis (that all instruments are valid), this is indeed a cause for concern.

In addition, there are other issues with IV techniques that some argue make them unreliable. In particular, consider (and read about) issues of

  • weak instruments

  • heterogeneous effects (heterogeneity), differential ‘compliance’, and the ‘Local Average Treatment Effect’ (‘LATE’).


One form of instrumental variables (IV) technique is called “two stage least squares” (2SLS). This essentially involves regressing \(x_1\) on Z (and other controls) and obtaining a predicted value of \(x_1\) from this equation, and then regressing Y on this (and the same set of other controls) but “excluding” Z from this second-stage regression.

You should generally report both the first and second stages in a table, and “diagnostics” of this instrument.

Some other issues to consider and read about

  • Time series (and panel) data: issues of autocorrelation, lag structure, trends, non-stationarity

  • Non-normal error terms, small samples (beware of results that require ‘asymptotics’)

  • Categorical dependent variable: consider Logit/Probit if binary, Multinomial logit if categorical, Poisson if ‘count’ data; other variants/models

  • Bounded/censored dependent variable: Consider Tobit and other models

  • Sample selection issues; self-selection, selectivity, attrition, etc.

  • Missing values/variables and Imputation

  • Errors in variables (classical, otherwise)

  • The meaning of R-squared; when it is useful/important? (Hint: a ‘high R-sq’ is not always good, nor vice versa)

  • Reporting meaningful estimates with nonlinear functional forms

I maintain a further listing of issues, both simple and advanced, with references and discussion… Here

You can see a list of some of my common critiques of empirical work in the folds below (with some emphasis on experimental work)…

  1. Identification issues (unfold)

Endogenous control (Angrist and Pischke)… “Colliders” is the new hip word for this, I believe (see Judea Pearl)

IV not credible (see discussion above)
Control strategy not credible: In the causal inference context, a ‘control strategy’ is “control for all or most of the reasonable determinants of the independent variable so as to make the remaining unobservable component very small, minimizing the potential for bias in the coefficient of interest”. All of the controls must still be exogenous, otherwise this itself can lead to a bias. There is some discussion of how to validate this approach; see, e.g., (Oster 2019).

Weak diagnostic/identification tests:

Where a particular assumption is critical to identification and inference …Failure to reject the violation of an assumption is not sufficient to give us confidence that it is satisfied and the results are credible. At several points the authors cite insignificant statistical tests as evidence in support of a substantive model, or of evidence that they do not need to worry about certain confounds. Although the problem of induction is difficult, I find this approach inadequate. Where a negative finding is given as an important result, the authors should also show that their parameter estimate is tightly bounded around zero. Where it is cited as evidence they can ignore a confound, they should provide evidence that they can statistically bound that effect is small enough that it should not reasonably cause an issue (e.g., as using Lee or McNemar bounds for selective attrition/hurdles).

“Conditional on positive”/“intensive margin” Analysis ignores selection: See (Angrist and Pischke 2008) on ‘Bad CoP’. See also bounding approaches such as (Lee 2018)

Mediators: Heterogeneity mixed with nonlinearity/corners: In the presence of nonlinearity, e.g., diminishing returns, if the outcome ‘starts’ at a higher level for one group (e.g., women), it is hard to disentangle a heterogeneous response to the treatment from ‘the diminishing returns kicking in’.

Related to DataColada 57 Interactions in Logit Regressions: Why Positive May Mean Negative

FE/DiD does not rule out a correlated dynamic unobservable, causing a bias

Selection bias due to attrition or censored outcome (differential by treatment) One response: bounding approach (Lee, 2009)

Selection bias due to missing variables – imputation is one possible response

Lagged dependent variable and fixed effects –> ‘Nickel bias’

Peer effects: Self-selection, Common environment, simultaneity/reflection (Manski paper)
Random effects estimators show a lack of robustness. Clustering SE is more standard practice

  1. ‘Treatment effects’ issues (unfold)

Ignores “LATE” nature of IV estimator in presence of heterogeneity

See various discussions, e.g., in (Angrist and Pischke 2008)

With heterogeneity the simple OLS estimator is not the ‘mean effect’

See (Angrist and Pischke 2008), also World bank blog discussion here

  1. Basic statistical concepts and issues (unfold)

Imprecise/weak “Null result”

While the classical statistical framework is not terribly clear about when one should “accept” a null hypothesis, we clearly should distinguish strong evidence for a small or zero effect from the evidence and consequent imprecise estimates. If our technique and identification strategy is valid, and we find estimates with confidence intervals closely down around zero, we may have some confidence that any effect, if it exists, is small, at least in this context. To more robustly assert a “zero or minimal effect” one would want to find these closely bounded around zero under a variety of conditions for generalizability.

See (???) among others.

Failure to adjust significance for multiple-hypothesis tests

See ‘Bonferroni corrections’ and other adjustments.

A recent discussion and approach, for the experimental context, can be found in (List, Shaikh, and Xu 2019).

Mediators and interactions: failure to test for difference in effect

E.g., “the treatment had a heterogeneous effect… we see a statistically significant positive effect for women but not for men”. This doesn’t cut it: we need to see a statistical test for the difference in these effects. (And also see above caveat about multiple hypothesis testing).

Functional form not appropriate

** Dropping zeroes in a “loglinear” model is problematic**

Failure to justify modeling choices: You need to justify your modeling choices; do not merely cite ‘precedent’.

  1. Robustness, replicability, and the ‘New Statistics’ ((Cumming 2014)) (unfold)

Signs of p-hacking and specification-hunting

See (Simonsohn, Nelson, and Simmons 2014).

Power calculations/underpowered

Especially relevant to experimental work.

One worries about underpowered tests. Your result may have relatively large effect sizes that are still insignificant, which makes cone wonder whether it has low power. Low powered studies undermine the reliability of our results.

(Button et al. 2013) point out that running lower-powered studies reduces the positive predicted value—the probability that a “pos- itive” research finding reflects a true effect—of a typical study reported to find a statistically significant result. In combination with publication bias, this could lead a large rate of type-1 error in our body of scientific knowledge (false-positive cases, where the true effect was null and the authors had a very “lucky” draw). True non-null effects will be underrepresented, as underpowered tests will too-often fail to detect (and publish) these. Furthermore, in both cases (true null, true non-null), underpowered tests will be far more likely to find a significant result when they have a random draw that estimates an effect size substantially larger than the true effect size. Thus, the published evidence base will tend to overstate the size of effects.

Quadratic regressions are not diagnostic regarding u-shapedness

See data-colada 62 and (Simonsohn 2018).

Needs to adjust significance tests for augmenting data/sequential analysis/peeking

(Especially relevant to experimental work.)

See (Sagarin, Ambler, and Lee 2014), p-augmented, and Gelman’s blog on stopping rules


OLS coefficients are still unbiased/consistent but maybe not efficient

  • Estimated standard errors of estimator/tests are not unbiased/consistent

(Autocorrelation: similar considerations, but it can be a sign of a mispecified dynamic model)

Responses (to heteroskedasticity and “simple” autocorrelation)

“Feasible” GLS (only consider doing with lots of data) or

Regular OLS with robust standard errors (or clustered in a certain way); this has now become the norm.

My general critique of how diagnostic tests are sometimes used…

You “test” for heteroskedasticity if you fail to reject homoscedasticity say “whew, I can ignore this”?

DR: Would you say “I fail to strongly statistically reject the possibility that my car’s brakes are not working, therefore I will drive the car on a mountain pass?”

Controversial; I don’t like this because the test may not be powerful enough. So use ‘robust’ anyway.

This same issue applies to the use of many diagnostic tests. If you cannot meaningfully bound the extent to which a potential violation of the assumptions is biasing your estimate, you should use the more robust procedure (imho).

Interpreting your results 1: test for significance

Simple differences (not in a regression): A variety of parametric, nonparametric and “exact” tests

Regression coefficients: t-tests

  • Difference from zero (usually 2-sided)

  • Difference from some hypothesis (e.g., difference from unit)

  • Joint test of coefficients

Evidence for ‘small or no effect’: one-sided t-test of, e.g.,

\(H_0: \beta > 10\) vs \(H_A: \beta \leq 10\); where 10 is a ‘small value’ in this context.

Joint significance of a set of coefficients: F-tests

\(H_0\): all tested coefficients are truly \(= 0\)

\(H_A\): at least one coefficient has a true value \(\neq 0\)

Interpreting results 2: magnitudes & sizes of effects

In a linear model in levels-on-levels the coefficients on continuous variables have a simple “slope interpretation”

Note: assuming a homogenous effect, otherwise it gets complicated.

Dummy variables have a “difference in means, all else equal” interpretation.

But be careful to describe and understand and explain the estimated effects (or “linear relationships”) in terms of the units of the variable (e.g., impact of years of education on thousands of pounds of salary at age-30, pre-tax)

Transformed/nonlinear variables

When some variables are transformed, e.g., expressed in logarithms, interpretation is a little more complicated (but not too difficult). Essentially, the impact of/on logged variables represent “proportional” or “percentage-wise” impact. Look this up and describe the effects correctly.

In nonlinear models

(e.g., Logit, Tobit, Poisson/Exponential) the marginal effect of a variable is not constant, it depends on the other variables and the error/unobservable term. But you can express things like “marginal effect averaged over the observed values” or (for some models) the “proportional percentage effect.”

Interpreting results 3: interaction terms

You may run a regression such as:

\[INCOME = \alpha + \beta_1YEARS\_EDUC + \beta_2FEMALE \cdot YEARS\_EDUC + \beta_3FEMALE + u\]

Where FEMALE is a dummy variable that =1 if the observed individual is a woman and = 0 if he is a man.

How do you interpret each coefficient estimate?

\(\alpha\): A constant “intercept”; fairly meaningless by itself unless the other variables are expressed as differences from the mean, in which case it represents the mean income.

\(\beta_1\) : “Effect” of years of education on income (at age 30, say) for males

\(\beta_2\) : “Additional Effect” of years of education on income for females relative to males

\(\beta_3\) : “Effect” of being female on income, holding education constant

What about \(\beta_1 + \beta_2\)?

= “Effect” of years of education on income for females

Other concepts I hope to add or integrate into the above

  • Simple statistics and simple tests

  • Hypothesis testing

  • (OLS vs) Probit, Tobit, and Nonlinear Specifications

  • Clustered standard errors

  • Time series data issues

Missing data mechanisms (this section is a work in progress)

We can think of an observation as having an underlying probability of being missing. (Rubin 1976) introduced mechanisms by which to classify this underlying probability which is helpful for gaining an insight into the effect of missing data on analysis, as well as evaluating the strength missing data solutions.

Despite being such an important topic, missing data is not always covered in undergraduate Economics courses.

This section will introduce the various types of missing data, drawing from multiple sources (Allison 2002; Soley-Bori 2013; “Statistical Analysis with Missing Data | Wiley Series in Probability and Statistics,” n.d.).

Missing Completely at Random (MCAR)

A data point may be missing completely at random (MCAR). This means that the missingness itself does not depend on any observed or unobserved variables. This means that the probability of a data point being missing is the same across all data points.

\[P(m \mid x_o,x_u) = P(m)\] where m is a missing data point. \(x_o\) are the observed variables and \(x_u\) are the unobserved variables

Adapted from (“Statistical Analysis with Missing Data | Wiley Series in Probability and Statistics,” n.d.)

MCAR is the most desirable missing classification as the only consequences are a loss of precision. However, MCAR may not be all that common in data sets relevant to Economics. Although MCAR is a strong assumption it can occur in Economics, for example when observations are missing due to experiment design, or where a survey response is lost in the post.

Missing at Random (MAR)

When a data point is MAR this means that the missingness of the data depends only on observed (rather than unobserved) variables: \[P(m \mid x_o,x_u) = P(m \mid x_o)\]

Unfortunately MAR is not an assumption which we are able to verify empirically; we must resort to intuition or economic theory to justify this.

For example, MEN might be less likely than Women to fill in a survey about alcohol consumption, but their probability of response might be unrelated to alcohol consumption after conditioning on their gender.

This may still be implausible: intuitively, heavy drinkers within each gender might be less willing to fill out such a survey. Furthermore, we might expect that, because of social stereotypes, female heavy drinkers may be particular unwilling to respond.

In other words The probability of data being missing given observed and unobserved data as the same as the probability of data being missing given observed data.

Not Missing at Random (NMAR)

NMAR occurs when the missingness of a variable is neither MCAR nor MAR. Instead the missingness depends on the value of the variable, which is missing itself, rather than observed data. Analysis based on a sample containing NMAR data points may be biased. In our above example, this would mean that the lack of response to the alcohol consumption is due to the level of alcohol consumption itself rather than the respondents being male. \[P(m \mid x_o,x_u) = P(m \mid x_u)\]

Impact on analysis If the MAR assumption is fulfilled then the missingness classification can be said to be Ignorable meaning that extra steps to model the missingness do not need to be taken.
Otherwise, the missingess is said to be Nonignorable. In this case, to obtain adequate estimates an understanding of the underlying structure of missingess and various imputation solutions is needed.

Methods for handling missing data There are many methods for handling missing data. Some of these methods are very simple (such as listwise deletion or mean imputation) and some are far more advanced (such as multiple imputation and maximum likelihood estimation). As this is a brief introduction, for further information see the sources referenced above

10.7 Formatting figures and tables

  1. Summary statistics

  2. Simple tests

  3. Graphs and figures

  4. Regression tables (small)

  5. Regression tables (many columns or rows)

Works Cited

Allison, Paul David. 2002. Missing Data. Thousand Oaks, Calif.; London: SAGE.

Angrist, Joshua D., and Jörn-Steffen Pischke. 2008. Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton university press.

Button, Katherine S, John P A Ioannidis, Claire Mokrysz, Brian A Nosek, Jonathan Flint, Emma S J Robinson, and Marcus R Munafò. 2013. “Power Failure: Why Small Sample Size Undermines the Reliability of Neuroscience.” Nature Reviews Neuroscience 14 (5): 365–76. https://doi.org/10.1038/nrn3475.

Cumming, Geoff. 2014. “The New Statistics : Why and How.” https://doi.org/10.1177/0956797613504966.

Lee, David S. 2018. “Training , Wages , and Sample Selection : Estimating Sharp Bounds on Treatment Effects,” no. May: 1071–1102.

List, John A, Azeem M Shaikh, and Yang Xu. 2019. “Multiple Hypothesis Testing in Experimental Economics.” Experimental Economics 22 (4): 773–93.

Oster, Emily. 2019. “Unobservable Selection and Coefficient Stability: Theory and Evidence.” Journal of Business & Economic Statistics 37 (2): 187–204.

Rubin, Donald B. 1976. “Inference and Missing Data.” Biometrika 63 (3): 581–92.

Sagarin, Brad J., James K. Ambler, and Ellen M. Lee. 2014. “An Ethical Approach to Peeking at Data.” Perspectives on Psychological Science 9 (3): 293–304. https://doi.org/10.1177/1745691614528214.

Simonsohn, Uri. 2018. “Two Lines: A Valid Alternative to the Invalid Testing of U-Shaped Relationships with Quadratic Regressions.” Advances in Methods and Practices in Psychological Science 1 (4): 538–55.

Simonsohn, Uri, Leif D Nelson, and Joseph P Simmons. 2014. “P-Curve: A Key to the File-Drawer.” Journal of Experimental Psychology: General 143 (2): 534. https://doi.org/10.1037/a0033242.

Soley-Bori, Marina. 2013. “Dealing with Missing Data: Key Assumptions and Methods for Applied Analysis.” Boston University 4: 1–19.

“Statistical Analysis with Missing Data | Wiley Series in Probability and Statistics.” n.d. https://onlinelibrary.wiley.com/doi/book/10.1002/9781119013563.

  1. In ‘prediction problem’ cases, you may be seeking the best out-of-sample prediction; there are special techniques that aim at these, often using what is called ‘machine learning’ approaches.↩︎

  2. For example, we expect a demand function to be homogenous of degree zero in income and all prices. But even this restriction may be questioned in light of ‘behavioral economics’ evidence of framing effects, nontransitive choices, etc.↩︎