AIC & BIC vs. Crossvalidation

Model selection is a process of seeking the model in a set of candidate models that gives the best balance between model fit and complexity (Burnham & Anderson 2002). I have always used AIC for that. But you can also do that by crossvalidation. Specifically, Stone (1977) showed that the AIC and leave-one out crossvalidation are asymptotically equivalent. I wanted to experience it myself through a simple exercise.

My goal was to (1) generate artificial data by a known model, (2) to fit various models of increasing complexity to the data, and (3) to see if I will correctly identify the underlying model by both AIC and cross-validation. Out of curiosity I also included BIC (Bayesian Information Criterion).

Here is the model that I used to generate the data:

 y= 5 + 2x + x^2 + 2x^3 + \varepsilon
 \varepsilon \sim Normal (\mu=0, \sigma^2=1)

I then fitted seven polynomials to the data, starting with a line (1st degree) and going up to 7th degree:

polynomials
Figure 1| The dots are artificially generated data (by the model specified above). The lines are seven fitted polynomials of increasing degree, from 1 (red straight line) to 7.

My next step was to find which of the seven models is most parsimonous. I calculated AIC, BIC (R functions AIC() and BIC()) and the take-one-out crossvalidation for each of the models. This is the function that I used to do the crossvalidation:

crossvalidate <- function(x, y, degree)
{
  preds <- numeric(length(x))
	for(i in 1:length(x))
	{
		x.in <- x[-i]
		x.out <- x[i]
		y.in <- y[-i]
		y.out <- x[i]
		m <- lm(y.in ~ poly(x.in, degree=degree) )
		new <- data.frame(x.in = seq(-3, 3, by=0.1))
		preds[i]<- predict(m, newdata=data.frame(x.in=x.out))
	}
  # the squared error:
  return(sum((y-preds)^2))
}

Here are the results:

AIC_BIC_Cross
Figure 2| Comparison of effectiveness of AIC, BIC and crossvalidation in selecting the most parsimonous model (black arrow) from the set of 7 polynomials that were fitted to the data (Fig. 1).

All three methods correctly identified the 3rd degree polynomial as the best model. So it works. Interestingly, all three methods penalize lack of fit much more heavily than redundant complexity. I knew this about AIC, which is notoriously known for insufficient penalization of overly complex models. BIC should penalize complexity more than AIC does (Hastie et al. 2009), which is what Fig. 2 shows clearly. But still, the difference is not that pronounced. I was surprised to see that crossvalidation is also quite benevolent in terms of complexity penalization - perhaps this is really because crossvalidation and AIC are equivalent (although the curves in Fig. 2 do not seem identical).

References
1. Burnham K. P. & Anderson D. R. (2002) Model selection and multimodel inference: A practical information-theoretic approach. Springer.
2. Hastie T., Tibshirani R. & Friedman J. (2009) The elements of statistical learning: Data mining, inference, and prediction. Springer.
3. Shao J. (1993) Linear model selection by cross-validation. Journal of American Statistical Association, 88, 486-494.
4. Stone M. (1977) An asymptotic equivalence of choice of model by cross-validation and Akaike’s criterion. Journal of the Royal Statistical Society Series B. 39, 44–7.

R code for the whole exercise

# the figures require ggplot2 library and
# all packages it depends on
library(ggplot2)

# generate the x predictor
x <- runif(100,-2,2)
# generate the y response
y <- 2*x^3 + x^2 - 2*x +5 + rnorm(100)
xy <- data.frame(x=x, y=y)
# specify the maximum polynomial degree that will be explored
max.poly <- 7

# cretaing data.frame which will store model predictions
# that will be used for the smooth curves in Fig. 1
x.new <- seq(min(x), max(x), by=0.1)
degree <- rep(1:max.poly, each=length(x.new))
predicted <- numeric(length(x.new)*max.poly)
new.dat <- data.frame(x=rep(x.new, times=max.poly),
                      degree,
                      predicted)

# fitting lm() polynomials of increasing complexity
# (up to max.degree) and storing their predictions
# in the new.dat data.frame
for(i in 1:max.poly)
{
  sub.dat <- new.dat[new.dat$degree==i,]
  new.dat[new.dat$degree==i,3] <- predict(lm(y~poly(x, i)),
                                          newdata=data.frame(x=x.new))
}

# plotting the data and the fitted models
p <- ggplot()
p + geom_point(aes(x, y), xy, colour="darkgrey")
p + geom_line(aes(x, predicted,
                  colour=as.character(degree)),
                  new.dat)
p + scale_colour_discrete(name = "Degree")
p

# creating empty data.frame that will store
# AIC and BIC values of all of the models
AIC.BIC <- data.frame(criterion=c(rep("AIC",max.poly),
                                  rep("BIC",max.poly)),
                      value=numeric(max.poly*2),
                      degree=rep(1:max.poly, times=2))

# calculating AIC and BIC values of each model
for(i in 1:max.poly)
{
  AIC.BIC[i,2] <- AIC(lm(y~poly(x,i)))
  AIC.BIC[i+max.poly,2] <- BIC(lm(y~poly(x,i)))
}

# function that will perform the "leave one out"
# crossvalidation for a y~poly(x, degree) polynomial
crossvalidate <- function(x, y, degree)
{
  preds <- numeric(length(x))
	for(i in 1:length(x))
	{
		x.in <- x[-i]
		x.out <- x[i]
		y.in <- y[-i]
		y.out <- x[i]
		m <- lm(y.in ~ poly(x.in, degree=degree) )
		new <- data.frame(x.in = seq(-3, 3, by=0.1))
		preds[i]<- predict(m, newdata=data.frame(x.in=x.out))
	}
  # the squared error:
  return(sum((y-preds)^2))
}

# crossvalidating all of the polynomial models
# and storing their squared errors in
# the "a" object
a <- data.frame(cross=numeric(max.poly))
for(i in 1:max.poly)
{
  a[i,1] <- crossvalidate(x, y, degree=i)
}

# plotting AIC and BIC against model complexity
# (which is the polynomial degree)
AIC.plot <- qplot(degree, value, data=AIC.BIC,
                  geom="line", linetype=criterion) +
                  xlab("Polynomial degree") +
                  ylab("Criterion value") +
                  labs(title="Information theory & Bayes")+
                  geom_segment(aes(x=3, y=400,
                                   xend=3, yend=325),
                  arrow = arrow(length = unit(0.3, "cm"),
                  angle=20, type="closed")) +
                  theme(legend.position=c(0.8,0.5))
AIC.plot

# plotting crossvalidated squared errors agains
# model complexity
cross.plot <- qplot(1:max.poly,cross, data=a, geom=c("line"))+
                    xlab("Polynomial degree") +
                    ylab("Squared error") +
                    geom_segment(aes(x=3, y=400,
                                     xend=3, yend=200),
                    arrow = arrow(length = unit(0.3, "cm"),
                    angle=20, type="closed")) +
                    labs(title="Crossvalidation")
cross.plot

9 thoughts on “AIC & BIC vs. Crossvalidation

  1. Your full example has a couple of minor issues:
    - It needs library(grid) for the arrow function
    - when building the initial plot of data + models, each line should start with
    p <- p + ...
    as you build the ggplot.
    - the crossvalidate function has an extra, confusing line. y.out is not needed and if it were it should be y.out <- y[i].

    Thanks for an interesting experiment!

  2. There is a function in the MPV library called PRESS that will do the same thing as your crossvalidate function, i.e. MPV::PRESS(lm(y~poly(x, 3))) will give the same result as crossvalidate(x,y,3).

    The fun part is that it's a one-liner!

    > MPV::PRESS
    function (x)
    {
    sum(resid(x)^2/(1 - lm.influence(x)$hat)^2)
    }

    • There is also a funvction 'PRESS' in my qpcR package that also works on nonlinear models obtained by 'nls':

      library(qpcR)
      ## example for linear regression
      x <- 1:10
      y <- rnorm(10, x, 0.1)
      mod <- lm(y ~ x)
      PRESS(mod)

      ## example for NLS fitting
      DNase1 <- subset(DNase, Run == 1)
      fm1DNase1 <- nls(density ~ SSlogis(log(conc), Asym, xmid, scal), DNase1)
      PRESS(fm1DNase1)

      Cheers,
      Andrej

  3. Very useful demonstration, thanks for it. It would be even better to do it with non-normally distributed data, though... do the conclusions still hold?

  4. Quite an interesting piece of work. Thank for doing this. However, I encountered problems running lines 38 and 39
    > p + scale_colour_discrete(name = "Degree")
    Error: No layers in plot
    > p
    Error: No layers in plot

    I tried to implement the suggestion given by Kent but got no result.
    p p p p p <- p
    However, I implemeted it at lines 38 and 39 and it got rid of the error message but I did not see the effect and could not reproduce any further graph.

    The legend label appeared as as.character(degree) in my plot whereas it showed as Degree in yours. I guess because line 38 did not run. Another difference I noticed was that your plot was showing the plotted points and ploynomial lines together. The points and the lines appeared on different plots in mine.

    On the final note, I also got this error running "..+ arrow = arrow(length = unit(0.3, "cm"),..:
    Error in do.call("layer", list(mapping = mapping, data = data, stat = stat, :
    could not find function "arrow"

    I changed it to "arrows" and got the following error message as well:
    Error in arrows(length = unit(0.3, "cm"), angle = 20, type = "closed") :
    one of 'x1' and 'y1' must be given

    It would be appreciated if you could get back to me on this. Once again, thanks for the good work.

    • Encountered the same problem. Modified the code to below. It works for me.

      # plotting the data and the fitted models
      new.dat$degree <- as.character(new.dat$degree)
      p <-
      ggplot(xy, aes(x, y)) +
      geom_point(colour="darkgrey") +
      geom_line(data=new.dat, aes(x=x, y=predicted, colour=degree)) +
      scale_colour_discrete(name="degree")
      p

  5. Nice example and tutorial. However, when rerunning the code I find several cases where a larger model is picked as best by both AIC and crossvalidation (usually 5th degree), clearly due to the stochastic process. However, in many of these cases BIC still identifies the correct model, so the stonger penalizing of complexity seems to make a difference. Personally I've not used BIC before, but will definitely do so in the future.

  6. Very nice examples. Thanks for sharing! Do you know if there are any cross validation packages for "loess"? Brute force methods are too slow. I did some search and didn't find anything.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>