# Predictors, responses and residuals: What really needs to be normally distributed?

Introduction

Many scientists are concerned about normality or non-normality of variables in statistical analyses. The following and similar sentiments are often expressed, published or taught:

• "If you want to do statistics, then everything needs to be normally distributed."
• "We normalized our data in order to meet the assumption of normality."
• "We log-transformed our data as they had strongly skewed distribution."
• "After we fitted the model we checked homoscedasticity of the residuals."
• "We used non-parametric test as our data did not meet the assumption of normality."

And so on. I know, it is more complex than that, but still, it seems like normal distribution is what people want to see everywhere, and that having things normally distributed opens the door to the clean and powerful statistics and strong results. Many people that I know regularly check if their data are normally distributed prior to the analysis, and then they either try to "normalize" them by e.g. log-transformation, or they adjust the statistical technique accordingly based on the frequency distribution of their data. Here I will examine this more closely and I will show that there may be less assumptions of normality than one might think.

Example of some widely used models

The most general way to find out what needs to be the distribution of your data and residuals is to look at the complete and formal definition of the model that you are fitting - I mean both its deterministic and its stochastic components. Take the example of Normal linear regression of response $y_i$ and predictor $x_i$ (function lm(y~x) or glm(y~x, family="Gaussian") in R):

$y_i\sim Normal(\hat{\mu}_i,\sigma^2)$
$\hat{\mu_i}=\alpha+\beta \times x_{i}$
$i\in1:N$

which can be rewritten to

$y_i=\alpha+\beta \times x_{i} + \epsilon_i$
$\epsilon_i\sim Normal(0,\sigma^2)$
$i\in1:N$

where $\sigma^2$ is variance, $\alpha$ and $\beta$ are intercept and slope respectively, and $N$ is number of observations. Strictly speaking, $\alpha$ and $\beta$ also have their own distributions (e.g. normal), but let's not go there for now. If you write such definition of the model, you have it all at one place, all assumptions are written here. And you can see clearly that the only assumption of "normality" in this model is that the residuals (or "errors" $\epsilon_i$) should be normally distributed. There is no assumption about the distribution of the predictor $x_i$ or the response variable $y_i$. So there is no justification for "normalizing" any of the variables prior to fitting the model. Your predictors and/or response variables can have distributions skewed like hell and yet all is fine as long as the residuals are normal. The same logic is valid for ANOVA - a skewed and apparently non-normal distribution of the response variable does not necessarily imply that one has to use non-parametric Kruskall-Wallis test. It is only the distribution of ANOVA residuals which needs to meet the assumption of normality.

Now take an example of another widely used model - the log-linear Poisson regression model of count observations $y_i$ and predictors $x_i$ (function glm(y~x, family="Poisson") in R):

$y_i\sim Poisson(\lambda_i)$
$log\lambda_i=\alpha+\beta \times x_{i}$
$i\in1:N$

where $\alpha$ and $\beta$ are model parameters, and $N$ is number of observations. This is the complete description of the model. All of its assumptions are laid here. There is no other hidden assumption somewhere in the literature. And you can see clearly: nothing has to be normally distributed here. Not the predictors, nor the response, nor the residuals.

Finally, let's examine another hugely popular statistical model, the logistic regression of binary response (or proportion) $y_i$ against predictor $x_i$ (function glm(y~x, family="Binomial") in R):

$y_i\sim Bernoulli(p_i)$
$logit (p_i)=\alpha+\beta \times x_{i}$
$i\in1:N$

Again, the complete model is here, with all its assumptions, and there is not a single Normally distributed thing.

This is hardly surprising to anyone already familiar with the inner structure of Generalized Linear Models (GLM) - a category that encompasses all of the above mentioned models. It is why GLMs were invented at the first place, i.e. in order to cope with non-normal residual ("error") structures. It is well known that Poisson and logistic regressions have heteroscedastic residuals (Kutner et al. 2005). Moreover, it is also known that exactly for this reason it is problematic to use the residuals for model diagnostics or to judge if the model is appropriate (Kutner et al. 2005). It has been advised to transform the residuals to something called Pearson residuals (in order to normalize them?), but I still do not understand why one would want to do that - as I mentioned, there is no assumption of normality anywhere in Poisson or logistic regressions. Yet still, there is the R function plot.glm() which one can apply to any fitted glm object and it will pop out the residual diagnostic plots. What are these good for in the case of logistic and Poisson regressions?

Why do people still normalize data?

Another puzzling issue is why do people still tend to "normalize" their variables (both predictors and responses) prior to model fitting. Why this practice emerged and prevailed even when there is no assumption that would call for it? I have several theories for this: ignorance, tendency to follow statistical cookbooks, error propagation etc.
Two of the explanations seem more plausible: First, people normalize the data in order to linearize relationships. For example, by log-transforming the predictor one can fit an exponential function using the ordinary least squares machinery. This may seem sort of ok, but then why not specify the non-linear relationship directly in the model (through an appropriate link function for instance)? Also, the practice of log-transforming of the response can introduce serious artifacts, e.g. in the case of count data with zero counts (O'Hara & Kotze 2010).
A second plausible reason for the "normalizing" practice was suggested by my colleague Katherine Mertes-Schwartz: Maybe it is because researchers try to resolve the problem that their data were collected in a highly clumped and uneven way. In other words, it is very common that one works with data that have high number of observations aggregated at particular part of a gradient, while another part of the gradient is relatively under-represented. This results in skewed distributions. Transforming of such distributions then results in seemingly regular spread of observations along the gradient and elimination of outliers. This can be actually done with a good intention in mind. However, it is also fundamentally incorrect.

Summary

I would like to finish this post by a couple of summarizing advices to anyone who is troubled by skewed frequency distributions of the data, by heteroscedastic residuals or by the violations of the assumption of normality.

• If you are unsure about the assumptions of the predefined model that you are just about to use, try to write down its complete formal definition (especially its stochastic part). Try to understand what the model actually really does, dissect it and have a look at its components.
• Learn your probability distributions. Bolker (2008) offers a brilliant bestiary of the most commonly used distributions from which you can learn what are they good for and in which cases do they emerge. My opinion is that such knowledge is an alternative and reliable way to judge if the model is appropriate.
• Learn to specify your models in a Bayesian way, e.g. in BUGS language (used by OpenBUGS or JAGS). This will give you very clear insight into how the models work and what are their assumptions.
• Think twice before transforming your data in order to normalize their distributions (e.g. see O'Hara & Kotze 2010). The only semi-valid reason for transforming you data is to linearize the relationships. However, make sure that you really need to do that as it is usually possible to specify the non-linear relationship directly within the model (e.g. through the log link function in the Poisson regression).

I am aware that my understanding may be limited. Hence, if you have any comments, opinions or corrections, do not hesitate to express them here. Thank you!

References

Bolker, B. (2008) Ecological models and data in R. Princeton University Press.
Kutner, M.H. et al. (2005) Applied linear statistical models. Fifth edition. McGraw & Hill.
O'Hara & Kotze (2010) Do not log-transform count data. Methods in Ecology and Evolution, 1, 118-122.

### 16 thoughts on “Predictors, responses and residuals: What really needs to be normally distributed?”

1. Your write: "Strictly speaking, \alpha and \beta also have their own distributions (e.g. normal), but let's not go there for now."

It doesn't look like you are taking a Bayesian approach so, strictly speaking, \alpha and \beta do NOT have their own distributions, as they are parameters. Now, \hat\alpha and \hat\beta do have distributions, but there is a world of difference between \alpha,\beta and \hat\alpha,\hat\beta.

2. Thanks for the post 🙂

I too am aware that my understanding is limited. I am not a statistician, but an early career ecologist (just started my PhD).

Just one minor addition. You mention getting familiar with Bayesian inference in order to better understand how the model works and what its assumptions are. Great tip, but I would suggest that with enough confidence in maximum likelihood estimation, this can be done effectively in a frequentist way too, by writing out the statistical model, then the likelihood function, and maximising the likelihood. This can be done in R by writing an R function for the likelihood and running optim over the model parameters. You can get a clear insight into these things, regardless of statistical disposition, just by looking 'under the hood' a bit.

The following R code does just this for a normal linear regression (eq1 above):
# Generate some data
a<- 5
b<- 2
sigma<- 0.5
x<- runif(1000,0,20)
y<- a+(b*x)+rnorm(1000)*sigma
dat<- data.frame(x=x,y=y)
plot(dat$x,dat$y)

# Write the likelihood function for the statistical model (in this case, normal linear regression)
loglike<- function(par,dat)
{
a<- par['a']
b<- par['b']
sigma<- par['sigma']

return(sum(dnorm(dat$y-a-(b*dat$x),mean=0,sd=sigma,log=T)))
}

# Maximise the log likelihood
res<- optim(par=c(a=1.5,b=1.5,sigma=0.6),fn=loglike,dat=dat,method='L-BFGS-B',lower=rep(0.2,3),upper=c(a=5,b=5,sigma=2),control=list(fnscale=-1))
res$value # Plot the fitted result over the data to see if it looks like a good fit (which we know it will be). x<-seq(from=min(dat$x),to=max(dat$x),by=0.1) y<-res$par['a']+(res$par['b']*x) plot(dat$x,dat\$y,type='p')
points(x,y,type='l',col='red')
#####

• This is a brilliant point! Very true. The thing is that I now work more with the Bayesian MCMC methods and I have got some very valuable insights from them. So that is why I have been advising it. But sure, if you dive into likelihood optimization, then you also need to understand the structure of your model very well (so that you can code your model correctly), which leads to equivalent insights.

3. Hi,
I like your post very much. actually I want to write something like yours, but now it's no need, I just cite your post!

Great work!

Huidong

4. Just flipping through the amazon preview of the cited Bolker book, which in the preface seems to take the contrary view about normally distributed response variables and glm: "correlation analysis assumes the data are bivariate normally distributed, while linear regression assumes a linear relationship between a predictor variable and a normally distributed response variable. Although one can sometimes transform data to satisfy these assumptions, Sandin and Pacala took a more powerful approach: they built explicit models...."

The assumption of normal distribution of response variables seems very widespread? I admit this is what I was "taught", but probably more accurate to say this is what I memorized.

5. I'm curious as to your opinion of the following situation: y ~ x, and y is highly skewed towards zero (but with only positive values). Thus, the fitting of a Normal linear regression dominated by the upper values of the relationship. I'm wondering if you might elaborate on why a transformation of clumped data (e.g. via a log-transformation) is an incorrect approach.
My hunch is that log-transformation also changes the assumption of the error in estimating y ; I believe that a linear model of a log-transformed response variable will also assume that the residuals exponentially increase with y rather than have a constant spread. To the contrary, a GLM with Gaussian distribution and a log-link function maintains that the spread of residuals are constant throughout the range of y. Does one have to know something about the error in estimating y in order to use such transformations correctly, or is their another reason for your opinion that this is incorrect?

• Hi Marc, you hit an important and intriguing problem. My opinion is that yes, you should at least ASSUME, EXPECT or HYPOTHESIZE what the errors will be (this is why statistical models have assumptions). For example, you can hypothesize that the variance of y has two components: the deterministic component caused by (exponential) effect of x, and the stochastic part (the error) not caused by x. The usual expectation is that normally distributed noise is a satisfactory and natural stochastic part (because of central limit theorem). In that case the way is to fit Gaussian GLM with log link.

If you for some reasons EXPECT that your residuals should not have a constant variance, then you may think about log-transforming your response prior to the model fitting. However, I would probably rather prefer to think about some alternative error structure specified directly within the model. For example, the Poisson log-linear model has non-constant variance (as it is equal to the mean, which is what you deterministically model), it is designed to cope with positive count variables (which is what you seem to have in your y).

Also, it would be good to know exactly what your y is. For gamma or negative binomial error structures may also be useful for positive-only and skewed data.

Finally, AIC may give you a hint on which model (which error structure) is more appropriate. But here I am not 100% sure if it is a correct approach. But people do it.

6. Nicely crafted rant against the prevailing (in some disciplines) assumption of normal errors and the urge to normalise distributions.
As an aside, linear regression still requires conditional normality (equivalent to normally distributed errors) for a simple derivation of standard errors. If your Y is highly skewed, sparse etc., conditional normality is not normally plausible. The same goes for ANOVA of highly non-normal Ys - it messes with the classic formulas for the SEs.
Finally, estimating Structural Equation Models via Maximum Likelihood relies on a multivariate normal distribution of the data (in my humble experience highly implausible outside the realm of artificial psychometric data), although there are (asymptitically) distribution free alternatives.

• Thanks. Although my rand is not against the assumption of normal errors. Nothing wrong with that, the assumption is reasonable (in normal linear regression or ANOVA). My point is that the assumption is precisely only about the errors (residuals), not about the frequency distribution of variables that go into the model.

7. The title also says something about the distribution of predictors, but you did not follow that up. The reason why I transform my predictors is to achieve as even a spread over the data range as possible. Many predictors are very skewed (think: island area in a biogeographic context) and thus large islands will have a large influence on the regression parameters. Giving a data point a lot of weight only because we measure things on a square-meter scale rather than a log-square-meter scale seems wrong to me (in physics this would be considered a violation of scale invariance, I guess).
The "idea" distribution of a predictor in a linear model would be bimodal, with most data points in the low and high range (since the regression will go through \bar{x}, \bar{y} anyway). For non-linear functions (think Michaelis-Menton) I would like to see most data points at the point where the function has the largest slope to reduce its estimation uncertainty there.
Since I do not know the functional relationship before, I aim for a uniform distribution as a compromise.
In practice, fiddling around with these transformation often leads to give-or-take normally distributed predictors. It's not what I aim for, it's not what I think is good or assumed or required, but it's what the Central Limit Theorem does to your data. But to make that point clear again: I love my predictors to be uniformly distributed and I use any transformation I can get to make them that.

Ah, one last point. Sometimes predictors are, say, lognormal "with a spike at 0". An example (in biogeography) is the percentage forest cover of a 1° grid cell. I treat those typically as a mixture of two predictors and hence create from the original two new ones, one bernoulli (forest yes/no); and one lognormal (percentage forest, with missing data where there is no forest, which I then log-transform). Apparently my behaviour is not uncommon, but I haven't bothered (yet) to look up any statistical literature on this approach.

And a final, last point: If you use regression trees, all the above rambling about predictor transformation becomes obsolete ...

• Hi Carsten,

Transforming predictors in order to achieve even spread: You seem to mix two things. Achieving even spread of data is certainly important, but you should make this during the process of data collection by sampling the predictor space evenly, not by ad-hoc transformation of unevenly sampled predictors. Log-transforming areal units is really a matter of scale - and you are right. Here you just flip between multiplicative and additive effect of area. Finally, I need to notify you that having data condensed close to one part of the axis does NOT mean that you put high weights (statistically sensu stricto) on them. You again confuse two terms.

Your example with forest cover seems very interesting. The issue of gridded data and what it does with areal coverages of land cover types is very interesting. I think it deserves a dissection. I usually do what you just described too, but I am not entirely happy with it.

Final one: aah, regression trees. Well, they are not parametric, they are not models. I prefer hypotheses and models. Regression trees are a good data exploratory technique, they show patterns, some may consider them useful for predictions, but they are not any close to statistical models, or to real explanations. I do not deal with them here.

8. Your write: "Strictly speaking, \alpha and \beta also have their own distributions (e.g. normal), but let's not go there for now."

If the residuals of the models are normally distributed and they are composed of separate and additive (normal) variables, then the distributions of the coefficients alpha and beta are normally distributed as well. That is primarily the reason for which we assume normal residuals, so that you know the underlying distribution of the coefficients and can perform t and f tests.

9. I agree with you 100%, I see researchers transforming predictors so they become 'normal' even though one of the assumptions of standard MLR is that the predictors have very little measurement errors. I can understand transforming predictors to achieve linearity though.

10. I guess your post is polemical on purpose. My experience with regression is that log-tranformation often helps (for prediction) especially when there are many predictors. I agree that we do not care about normality per se (linear regression can work with binary data!) but if the data are not in a good scale, standard statistics can be problematic. Think of computing standard deviation when the data cover different scales. It is not going to be meaningful. Put it differently, normality is not important but standard sanity checks concerning the scales of variables or the validity of a linear relationship can help. Checking for goodness-of-fit (g.o.f.) can provide information about the data we might miss if we do not look at g.o.f..
All the best and happy new year.
Michael Blum