The man in the academic arena

Lately I went through a couple of ordinary academic failures. I had one manuscript rejected in three statistical journals in a row. I had another one rejected in Science, PNAS and PLoS Biology in a prompt sequence. Interestingly, among all of the six submissions only Science actually sent it out for review (and then rejected). Dammit.

Today, I gave a Monday seminar on "Processes vs events" at CTS, and it was rejected... by the audience. I thought I'd prepared everything well, but I messed up some key definitions at the beginning, and after four slides the audience just could not stand it and I couldn't continue. There I was, a self-taught statistical enthusiast speaking to professional theoretical physicists and mathematicians, and I wasted their time.

The obligatory lesson is: I should work harder, think smarter, explain myself clearer, and learn from my errors. But that does not cheer one up. I need something that will raise my spirits high, so that I can work tomorrow with some extra bit of energy to make my work shine. Something cheesy with plenty of pathos, something so optimistic that only an American can write. I need Theodore Roosevelt's excerpt from the speech "Citizenship In A Republic" delivered at the Sorbonne, in Paris, France on 23 April, 1910:

It is not the critic who counts; not the man who points out how the strong man stumbles, or where the doer of deeds could have done them better. The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood; who strives valiantly; who errs, who comes short again and again, because there is no effort without error and shortcoming; but who does actually strive to do the deeds; who knows great enthusiasms, the great devotions; who spends himself in a worthy cause; who at the best knows in the end the triumph of high achievement, and who at the worst, if he fails, at least fails while daring greatly, so that his place shall never be with those cold and timid souls who neither know victory nor defeat.

Yeah! I feel much better now. Back to writing papers.

Center for Theoretical Study, Prague: more intense than ivy league

I have recently been lucky to relocate from Yale to Center for Theoretical Study in Prague, Czech Republic. The institute brings together philosophers, mathematicians, physicists, sociologists, economists, biologists and others; it is similar to Santa Fe Institute or Princeton Institute for Advanced Study, and its aim is to stimulate interdisciplinary approaches to science, encouraging new ways of interaction and cooperation between disciplines.

The institute hosts the toughest kind of seminar that I have seen; it is called Thursday seminar. Guest speaker has one hour to present, but the audience reserves the right interrupt him at any time. And people do it a lot – sometimes the speaker doesn't even get beyond the first half, and his/her basic assumptions and premises are already in ruins. Follows another hour of discussion where what remains is disintegrated and buried under monologues from the audience. The guest is then brought to a pub, filled up with heavy Czech food and beer (yes, we drink beers for lunch), and subjected to another round of academic inspection.

Apart from the Thursday seminars, irregular lunches, and occasional random encounters in the coffee corner, we all meet again on Monday seminars where we contemplate a common unifying theme. Since we are in a center for theoretical study, the theme is anything but applied.

This year's theme is Process vs. event”.

When I joined, the Monday seminar series had already been in progress, and after I'd figured out what's up, I immediately got my usual but deceptive feeling that “I've got it right and they are all confused” (blame it to my youth), and I could feel the rising urge to come out with my most beloved heavy caliber: statistics. Here is how I very briefly presented it:

Statistics clearly distinguishes models (theories, hypotheses), which is what we think, and data, which is what we see. Models are the processes, or the ideas in Platonic sense, further classifiable to stochastic or deterministic. Data are the events, sometimes also called outcomes of the processes. Statistics tells the story about how the data came to be by making statements about the P(model|data), or P(data|model). That's it.

The following discussion was passionate, direct, with hard edges and egos smoking hot. After few minutes I lost track about the fate of what remained of my own contribution, or if anybody actually cared, but I'd learned that some people are definitely less confused than I initially thought.

Here are some notable ideas:

Philosopher Alexandr Matoušek noted that seeing is a peculiar term – our mind can see an idea (model, theory) more clearly than the actual event (data). This somehow resonates with my recent realization that theory is not necessarily a way of knowing how the world is, but it is rather the way of looking at the world (I have this from Marquet et al.'s recent piece in Bioscience). I need to explore these connections more.

Alexandr Matoušek and mathematician Petr Kůrka also pointed to the problem that, ultimately, we will have to deal with stochastic processes in the light of the problem of free will (and the problem of whether it exists). This is a key unresolved issue, and it can generate some intense discussions in the future.

Novelist and philosopher Michal Ajvaz presented an attempt to outline the dichotomy between processes vs. events using dimensions (time, space, horizontal, vertical) and scale. To unify this view with the statistical view will be a major and important undertaking, and I am really curious to see what comes out of it.

Paleoecologist Petr Pokorný pointed to the potential connection between rarity of an event and its significance (I think that there is no such connection). He also brought up the butterfly effect story, which turned out to be thought-provoking, but it also generated considerable misunderstanding. I reckon that this story can serve as a particularly useful illustration of our ideas, if interpreted carefully.

Sociologist Zdeněk Konopásek reminded me that quantitative science, and especially statistics, can generate substantial suspicion among some social scientists – for them, statistics potentially reduces complex stories to mere counting, without the understanding of what's really going on. I tend to defend statistics. First, no matter how deep an understanding, it is worthless if it's not formalized and communicated, and statistics (and mathematics) is a formal way to represent and communicate understanding. Second, statistics enables to separate general principles from anecdote, and while anecdotes can be entertaining, they are hard to build on. Finally, modern statistics is not only inductive and reductionist – it can also relate an individual observation to an existing model, and investigate probability (likelihood) of this observation, given the model; this guarantees the possibility for a single unusual observation to shoot down the whole model.

Mathematician Marie Větrovcová brought me to the notion of complex conditional processes in which models are conditional on the data, parts of the models are conditional on other parts of the model, but data are never conditional on other data. I'd like to explore this idea in a more formal way.

The seminar left me excited and in a state of intellectual discomfort, which I guess is a good sign.


Species Distribution Models on the right track. Finally.

Species Distribution Models (SDM) a.k.a. Niche Models have always been a busy pile of confusion, ideology and misguided practices, with the real mess being the “presence only” SDMs. Interestingly, when you go to conservation or biogeography symposiums, you can hear the established SDM gurus starting their talks with: “During the last ten years SDMs have greatly matured” or “SDMs are useful tool to guide conservation management and to predict responses to climate change” and so on. You can even read this in the introduction of many papers.

I come from a post-communist country and hence these phrases remind me of political propaganda that I used to hear as a kid; aparatchiks used to declare that everything works great and everybody is happy. But just because someone says so, or because there is a crowd of SDM users and thousands of papers on SDM, it does not mean that things work well.

So what exactly is wrong with current SDMs?

  • First, with a few exceptions, the use of pseudo-absences cannot be justified neither in model fitting, nor in model evaluation. Pseudo-absence is then just a code word for data fabrication.

  • Second, geographic bias in sampling effort must be dealt with. More specifically, even when we use a model that is suited for presence-only data (such as a point process model), the results will be heavily biased unless we account for the fact that species are more often observed around places accessible to humans (e.g. roads).

  • Third, detectability of species must be dealt with. If not accounted for, the results will always be compromised by the species being cryptic and difficult to observe (most species are), which will lead to underestimated probability of occurrence or habitat suitability. To complicate things even more, some species are easily observable in some environments, and cryptic in others.

Statisticians have been aware of these problems (e.g. Phillips et al. 2008, Yackulic et al. 2013), but because there had been only partial or ad-hoc solutions, and because the SDM crowd is comfortable with whatever technique that is user-friendly and R-packaged, everybody has been hammering “GBIF-like” data with “dismo-like” techniques (MaxEnt, random forests, neural networks, GAMs, ...) without a second thought.

To name some of the bits and pieces that do try to address some of the problems: To deal with the geographic bias in presence data, Phillips et al. (2008) suggest to select the background “absence” data with the same geographic bias as is in the presence data – a solution that makes some sense, but it is not an integral part of a statistical model, and hence the solution ignores the uncertainty that comes with it. The so-called occupancy modelling (e.g. Royle & Dorazio's book, MacKendzie et al.'s book) has had a whole set of statistical tools to deal with detectability of species, but the field has rarely touched the large-scale presence-only SDMs. Finally, it has been shown that the popular presence-only MaxEnt technique is in its core equivalent to point process models (Renner & Warton 2013 and a related blog post) – a finding that paves the road to the use of presence-only data in statistical parametric models, and subsequently in occupancy models.

And into all this, finally, comes Robert M. Dorazio with his new paper in Global Ecology and Biogeography. Dorazio puts all the pieces together to introduce a general model that addresses all three of the SDM problems outlined above, and he does it in a statistically rigorous and model-based (parametric) way. In short:

  • Dorazio's model is suitable for presence-only data because of the use of a point process component.

  • The model can explicitly incorporate information on distances to the nearest road (or other correlates of sampling effort).

  • The model has separated the observation process from the latent ecological processes, addressing the issue of detectability of species.

  • The technique allows incorporation of data from systematic surveys. I have always felt that a dozen of systematically surveyed locations were worth hundreds of presence-only incidental observations. Now we have a way to benefit from both kinds of data in a single model.

Other highlights of the approach are:

  • Because it is parametric, we can estimate full posterior distributions of the parameters, enabling probabilistic statements about ecological hypotheses, and enabling the use of model selection techniques based on likelihood (such as AIC, DIC).

  • The approach is flexible and can be extended. The model outlined by Dorazio can indeed fail in many cases just because it omits some important ecological processes (e.g. species interactions, dispersal limitations, non-linear responses to environment, spatial autocorrelation …). However, these can easily be incorporated on top of the core structure.

  • Because the model can be fitted in Bayesian framework, prior information on known habitat preferences can be beneficially used to improve the model (see my paper here for example).

To summarize: Thanks to Dorazio, we finally have a core of rigorous statistical tool that overcomes many of the critical and previously ignored problems with SDM. We had to endure almost 15 years of confusing literature to get here, and I feel that now the field has a potential to be on the right track to maturity. And so it is up to us to test the approach, improve it, and write R pacakges.


Is my brilliant idea any good? I am not sure, so I've pre-printed it on PeerJ

As a scientist, what should I do when I encounter a seemingly fundamental problem that also seems strangely unfamiliar? Is it unfamiliar because I am up to something really new, or am I re-discovering something that has been around for centuries, and I have just missed it?

This is a short story about an exploration that began with such a problem, and led to this manuscript. It began one day when I was pondering the idea of probability:

Scientists often estimate probability (P) of an event, where the event can be a disease transmission, species extinction, volcano eruption, particle detection, you name it. If the P is estimated, there has to be some uncertainty about the estimate -- every estimate is imperfect, otherwise it is not an estimate. I felt that the uncertainty about P has to be bounded: I couldn't imagine a high estimate of P, say P=0.99, associated with high uncertainty about it. The reason is that an increased uncertainty about P means broader distribution of probability density of P, which means that the mean (expected value) of the probability density shifts towards P=0.5. Inversely, as P approaches 0 or 1, the maximum possible uncertainty about P must decrease. Is there a way to calculate the bounds of uncertainty exactly? Have anybody already calculated this?

First, I did some research. Wikipedia: nothing. I hit the books: Jaynes's Probability theory, Johnson et al's Continuous univariate distributions, statistical textbooks, all the usual suspects, nothing. Web of Science: A bunch of seemingly related papers, but no exact hit. ArXiv: More seemingly related semi-legible papers by physicists and statisticians, but not exactly what I was looking for. After some time I put the search on hold.

Then one day, for a completely different reason, I was skimming through John Harte's Maximum Entropy Theory and Ecology, and it struck me. Maybe there is a MaxEnt function that can give the maximum uncertainty about P, given that we only have the value of P and nothing else. Back to the web and, finally, I found something useful by UConn mathematician Keith Conrad: a MaxEnt probability density function on a bounded interval with known mean.

I adjusted the function for my problem, and I started to implement it in R, drafting some documentation on side, and I realized that I have actually started working on a paper!

Then the doubts came. I discussed the problem with my colleagues, but I hadn't learnt much -- many well-intentioned suggestions but no advice hit the core of the problem, which is: Is the idea any good? And anyway, as a wee ecologists with little formal mathematical education, can I actually dare to enter the realm of probability theory where the demi-gods have spoken pure math for centuries?

Then I took the courage and sent my draft to two sharp minds that I admire for the depth of their formal thinking (and for their long beards): Arnost Sizling the mathematician, and Bob O'Hara the biostatistician. For a week nothing happened, then I received two quite different and stimulating opinions, a semi-sceptical review from Arnost, and a semi-optimistic email from Bob. I tried to incorporate their comments, but I am still unsure about the whole thing.

I don't have a clue whether and where such thing should be published. It feels too long and complex for a blog post, yet somehow trivial for a full-length research paper; it is perhaps too simple for a statistical journal, yet too technical for a biological journal.

And so I've decided to pre-print it on PeerJ for everybody to see. I guess that it could be the perfect place for it. If it is really new then it is citable and I can claim my primacy. If it is all bullshit, then people will tell me (hopefully), or they will perhaps collaborate and help me to improve the paper, becoming co-authors. It is an experiment, and I am curious to see what happens next.

Tailoring univariate probability distributions

This post shows how to build a custom univariate distribution in R from scratch, so that you end up with the essential functions: a probability density function, cumulative distribution function, quantile function and random number generator.

In the beginning all you need is an equation of the probability density function, from which everyting else can be derived sometimes analytically, and always numerically. The analytical solutions may require some math skills, or may be impossible to find, but the numerical solutions are always feasible, require no math literacy, and coding them in R is easy with the uniroot() and integrate() functions.

Throughout the post I will use a simple exponential distribution as an example.

Probability density function (PDF)

You will need to know what probability density (or mass) function is, and what is the difference between probability density and probability. I found the best intro to be this Khan's video (10 min). Here is a couple of additional points:

  • Probability density is relative probability.
  • Probability density function evelauated for a given data point is the point's likelihood.
  • Probability density can be greater than one.

Example: Exponential PDF

The formula for the exponential probability density function (PDF) is:

 p(x) = \lambda e ^ {-\lambda x}, \qquad x \in [0, \infty]

In literature, small  p usually denotes probability density, while capital  P is used for probability.

We can code it in R:

  my.dexp <- function(x, lambda) lambda*exp(-lambda*x)
  my.dexp(x=0:5, lambda=2)
## [1] 2.0000000 0.2706706 0.0366313 0.0049575 0.0006709 0.0000908

And compare it with R's native PDF (it uses rate instead of  \lambda ):

  dexp(x=0:5, rate=2)
## [1] 2.0000000 0.2706706 0.0366313 0.0049575 0.0006709 0.0000908

Here is a plot of my PDF using the R's built-in function curve():

  curve(my.dexp(x, lambda=2), from=0, to=2, main="Exponential PDF")

plot of chunk unnamed-chunk-3

Cumulative distribution function (CDF) - analytical solution

This function gives, for a given point  x , the area under the PDF curve all the way down to the left of the point  x . This area is the probability that an outcome  X will have value lower or equal to  x :

 P(X \le x) = \int_0^x \! f(\pi \rightarrow X) \, \mathrm{d}\pi

I liked this stepbil's video (9 min) that shows how to flip between PDF and CDF, and why do we need the  \pi in the integral. I am really lame with calculus and what really works for me is Sage – it is free, it finds analytical solutions of integrals, and it may be worth a shot before diving into numerical integration.

For the exponential distribution the CDF is:

 P(X \le x) = 1 - e ^ {- \lambda x}, \qquad x \in [0, \infty]

In R it is:

  my.pexp.anal <- function(x, lambda) 1- exp(-lambda*x)
  my.pexp.anal(x=0:5, lambda=2)
## [1] 0.0000 0.8647 0.9817 0.9975 0.9997 1.0000

Cumulative distribution function (CDF) - numerical solution

For most practical purposes, numerical solutions of one- or two- dimensional problems seem to be as good as the analytical solutions, the only disadvantage is the computational demands. Here I used the R's native function integrate() to calculate the exponential CDF numerically:

  my.pexp.numerical <- function(x, lambda)
    # this function allows my CDF to run over vectors <- function(x, lambda) integrate(my.dexp, lambda=lambda, lower=0, upper=x)$value
    # apply the to each element of x
    sapply(x,, lambda)
  my.pexp.numerical(x=0:5, lambda=2)
## [1] 0.0000 0.8647 0.9817 0.9975 0.9997 1.0000

And let's plot the numerical solution over the analytical one:

  curve(my.pexp.anal(x, lambda=2), from=0, to=2, 
        col="green", main="exponential CDF")
  curve(my.pexp.numerical(x, lambda=2), from=0, to=2, 
        lty=2, col="red", add=TRUE)
  legend("bottomright", lty=c(1,2), 
         legend=c("Analytical", "Numerical"),        
         col=c("green", "red"), lwd=2)

plot of chunk unnamed-chunk-6

Practically identical.

Quantile function (QF) - analytical solution

Quantile functions are useful for two things: (1) assessing statistical significance, and (2) generating random numbers (see below). Quantile function ( P^- ) is the inverse of the cumulative distribution function. It means that you have to find a value of  x for a given  P(x) , which is an inverse action to what we usually do with functions. Finding an analytical solution may be quite tricky, and again, if you don't have a mathematician at your service, Sage can help.

In case of our exponential distribution, the quantile function is:

 P^{-}(q) = \frac{-\ln(1-q)}{\lambda}, \qquad q \in [0,1)

In R:

  my.qexp.anal <- function(q, lambda) (-log(1-q))/lambda 
  my.qexp.anal(0.9, 2)
## [1] 1.151

Quantile function (QF) - numerical solution

The inverse may be hard to get for many CDFs, and hence one may need to go for a numerical solution. I found that the easiest way to solve for inverse of univariate CDFs in R is the uniroot() function. Alternatives are optimize() or optim(), but they can be unstable. In contrast, uniroot() is simple, quick, and stable.

Here is my numerical solution for the expoentnial QF:

  my.qexp.numerical <- function(q, lambda)
    # function to be solved for 0
    f <- function(P, fixed)
      lambda <- fixed$lambda
      q <- fixed$q
      # this is the criterion to be minimized by uniroot():
      criterion <- q - my.pexp.numerical(P, lambda)

    # for each element of vector P (the quantiles)
    P <- numeric(length(q))
    for(i in 1:length(q))
      # parameters that stay fixed
      fixed <- list(lambda=lambda, q=q[i])
      # solving the f for 0 numerically by uniroot():
      root.p <- uniroot(f, 
                        upper=100, # may require adjustment
      P[i] <-root.p$root

my.qexp.numerical(0.9, 2)
## [1] 1.151

Let's plot the numerical solution over the analytical one:

  q <- seq(0.01, 0.9, by=0.01)
  plot(q, my.qexp.anal(q, lambda=2), type="l", col="green", lwd=2)
  lines(q, my.qexp.numerical(q, lambda=2), col="red", lty=2)
  legend("bottomright", lty=c(1,2), 
         legend=c("Analytical", "Numerical"),        
         col=c("green", "red"), lwd=2)

plot of chunk unnamed-chunk-9

Random number generator - the inverse transform sampling

Have you ever wondered how R generates random draws from its distributions? How does it convert uniformly distributed pseudo-random numbers to, e.g., normally distributed numbers? For some time I naively thought that there are some giant built-in tables that R uses. Then I found that it's really simple.

The easiest way is the inverse transform sampling. It works like this: you generate a random number from  Uniform(0,1) and plug it into your quantile function (see above). That's it, and that's also why quantile functions can be really useful. Here is how you do it in R with the exponential distribution:

  my.rexp.inverse <- function(N, lambda)
    U <- runif(N, min=0, max=1)
    rnd.draws <- my.qexp.anal(U, lambda)
  my.rexp.inverse(10, 2)
##  [1] 0.10087 0.56874 0.79258 0.01962 1.28166 0.39451 2.17646 0.48650
##  [9] 0.97453 0.33054

Here is a histogram of 10,000 random numbers generated by the inverse transform sampling. The solid smooth line is the exponential PDF from which the random numbers were drawn:

  hist(my.rexp.inverse(10000,2), freq=FALSE, breaks=20,
       xlab="x", main=NULL, col="grey")
  #hist(rexp(10000, rate=2), freq=FALSE, breaks=20)
  curve(dexp(x, rate=2), add=TRUE)

plot of chunk unnamed-chunk-11

Random number generator - rejection sampling

The advantage of rejection sampling is that you don't need the quantile function.
Notable texts on rejection sampling are provided by Jay Emerson, on Quantitations blog and on tsperry's blog. I will try to provide my own description and notation, which will hopefully complement the other posts.

In short, wou will need: (1) the so-called proposal probability densidy function which I will call  f(x) , (2) the PDF from which you want to sample, here called  p(x) , and (3) a constant  m which will satisfy  f(x) \times m > p(x) . In other words, the curve  f(x) \times m must lay entirely above the  p(x) curve, and ideally as close to  p(x) as possible (you will waste less samples). The  f(x) can be any PDF from which you are able to sample.

For my exponential example I will use simple uniform proposal density  f(x) . The following figure illustrates my exponential  p(x) , my proposal  f(x) , and  f(x) \times m :

  x <- seq(0, 5, by=0.01)
  # my exponential PDF:
  p <- dexp(x, rate=2)
  # my PROPOSAL distribution:
  f <- dunif(x, min=0, max=5) # my proposal is uniform and hence
                              # I need to choose an arbitrary upper bound 5
  # the CONSTANT m
  m <- 11
  fm <- f*m

  plot(c(0,5), c(0,2.5), type="n", xlab="x", ylab="density")
  lines(x, fm, type="l", col="red")
  lines(x, f, col="red", lty=2)
  lines(x, p)

  legend("right", lty=c(1,2,1), 
         legend=c("f(x)*m", "f(x)", "p(x)"), 
         col=c("red", "red", "black"), lwd=2)

plot of chunk unnamed-chunk-12

My uniform propsal distribution is not optimal – it will lead to high rejection rate and will require a lot of unnecessary computations. The uniform proposal will also truncate my samples at the max value of the uniform distribution. However, the uniform will still be ok for the illustration purposes here.

In rejection sampling the algorithm of one sampling step is:

  • 1. Sample a point  r_x from  f(x) .
  • 2. Draw a vertical line from  r_x all the way up to  m \times f(x) .
  • 3. Sample a point  r_v from a uniform density along the vertical line.
  • 4. If  r_v \leq p(x) accept the sample, otherwise go to 1.

This step is repeated until the desired number of accepted samples is reached.

Here is my rejection sampler for the exponential PDF:

  my.rexp.reject <- function(N, lambda, max)
    samples <- numeric(N)
    m <- 12
    fxm <- dunif(0, 0, max)*m
    for (i in 1:N) 
      # 1. sample from the propsal distribution:
        rx <- runif(1, 0, max)
      # 2. sample a point along a vertical line 
      #    at the rx point from Uniform(0, f(x)*m):
        rv <- runif(1, 0, fxm)
      # 3. evaluate the density of p(x) at rx
        dx <- dexp(rx, lambda)
      # 4. compare and accept if appropriate
        if (rv <= dx) 
          samples[i] <- rx

This is how 10,000 samples look like, together with the original PDF superimposed:

  x <- my.rexp.reject(N=10000, lambda=2, max=4)
  hist(x, freq=FALSE, breaks=30, col="gray", main=NULL)
  curve(dexp(x, 2), add=T)

plot of chunk unnamed-chunk-14

A suggestion to Windows-based users of R: It may be time to relocate

Do you remember the time when you switched from graphical statistical software to R? I did it eight years ago, and I had hard time doing even a simple regression analysis without constantly searching for help, it was a pain. In desperation I frequently cheated and went back to Statistica for the familiar window-ish feeling. But my skills developed, and with my first 'for loop' came the first euphoria, it was liberating. I no longer had to follow the predefined structures, I could juggle with my data, explore, play, create.

I think that this is the kind of experience that many R users had at some point.

A couple of years ago I started to notice that some R users (including myself) got so proficient and comfortable with R that they started to do almost everything in it, from file manipulations, to web downloads, complex analyses, graphics editing, presentations and report writing. It's as if R is another operating system running on top of Windows, but it's more useful – unlike Windows, everything has a common structure, it's free, open, and everything can be used as a building part of something else. All work can be scripted and revoked any time, and there is the huge R community constantly working on a vast selection of R packages.

Now here is a secret: There is a real operating system that was here long before R, and it has all of the properties above that make R so awesome. In fact, R sort of copycats the structure of it, and adopts some of its syntax. Imagine that you get the same empowering feeling that R gives, but now you get it from the operating system itself, and from all of the software that you have on it: everything is a package, everything is free, well documented, open, and 'scriptable'.

I believe that if you are comfortable with R and it liberates you, and yet you still dwell in Windows, then it's likely that you will be liberated by the operating system that I have in mind. And after the obligatory bumpy start, you will look back at Windows with the same sentiment as you, the R user, look back at the graphical statistical software. And you will wonder: Bloody hell, I used to use THAT?

The system that I have in mind is Linux. It goes really nicely with R.

Our results will be relevant for policy and decision-making and biodiversity management

I used to be involved in a large EU-funded collaborative project. It gave me an opportunity to do fun basic science, and also an opportunity to see the kind of lingo that is used in order to get a big EU project funded. The project promised: "Our results will be relevant for policy and decision-making and biodiversity management."

I was part of a sub-group that worked on fundamental explorations of large scale biodiversity patterns. For the most part I doubted if this is relevant for policy and management; I have always been suspicious towards efforts to force basic science (macroecology) into self-proclaimed applicability. Besides, macroecology operates at very large scales, which further complicates its use in policy.

Yet recently I was asked to stitch up a summary of policy relevance of our results for the European Commission. This is what I came up with:

Biodiversity scaling for nature monitoring and conservation

Our key message is that biodiversity depends on spatial scale (grain, resolution), and that this dependence should be reflected by EU's monitoring and conservation efforts. Specifically:

(1) Assessments of temporal declines or increases of biodiversity are impossible without explicit reference to spatial scale. Biodiversity monitoring should combine local surveys with broad-scale county-, country- and EU-wide assessments. Examples of problems that need special attention: Measuring the effectiveness of EU agri-environment schemes, monitoring spread and numbers of invasive species, monitoring trends in diversity of pollinator species.

(2) So far EU conservation policy has focused mostly on individual species and their numbers. We propose that it should also consider species turnover (beta diversity), which measures homogeneity or heterogeneity of species composition across space. Such heterogeneity at one scale is not necessarily maintained at coarser scale. Hence, if heterogeneity it is to be promoted, it needs to be done at several scales. This could be achieved by a certain degree of contextualization, decentralization and asynchrony of conservation policies at regional and local administrative levels, as well as on the level of specific habitats and microhabitats.

(3) We found that dispersal limitations play major role in maintaining heterogeneity of species composition at large scales. In the face of climate change and because of their limited ability to move (disperse, migrate), species may not be able to track the suitable environmental conditions, and will most likely require our assistance, otherwise they will perish. In contrast, dispersal limitations and barriers may serve well as a protection against the spread of invasive species, and against the resulting biotic homogenization.

You see? All the big words are there. Climate change, extinctions, biodiversity declines, EU, invasive species. I think that it will make the commissioners happy, and it is (mostly) based on real scientific results; the only problem is that it feels a bit too general and broad. But then EU is big and general and centralized, it is a broad-scale institution. Supplying the EU with this kind of recommendations might work. Or maybe they will just fade as an echo in the vast labyrinths of Brussel's administration. I don't know.

Would you save the young or the ancient species?

Yesterday I gave a seminar at John Harte's lab at UC Berkeley. It was a joy. There must be something in the lush Californian climate that makes people nice.

During the discussion John Harte and Andy Rominger jointly pointed me to a problem: Imagine that you have two species. The first one is a young species that originated recently. The second is a very old species. Now both of them are threatened by extinction, but you have resources to save only one of them. Which one would you choose to save?

I replied that it does not matter how old a species is. What matters is how many extant and close  relatives it has -- if the species has many relatives, you can sacrifice it because its evolutionary history will be preserved through the relatives. And then Andy brought up this killer dichotomy:

  • Yes, you may prefer to save the ancient species with few extant relatives (e.g. platypus) because it represents rare and 'evolutionarily distinct' lineage. If rarity and uniqueness is what you value, you save the ancient species.
  • But you may prefer the young species with many extant close relatives, for the whole clade has been diversifying rapidly.  Hence, each species in the clade has a potential to diversify further. In contrast, the ancient and lonely species apparently has not diversified. If diversity is what you value, you save the young species.

There may be a catch somewhere. I am not sure. Interestingly, the otherwise thorough review by Winter et al. (2012) in TREE does not mention this dichotomy. It mentions 'evolutionary potential' of a species, but only as a potential to adapt to environmental change, not as a potential to diversify.

I wonder if this problem is widely known in the conservation community, or whether there is some literature on it. Any ideas? And which species would you save?

Do 'macrosystems ecologists' know about macroecology?

Paper by Levy et al. in Frontiers in Ecology and the Environment announces emergence of a new ecological discipline called macrosystems ecology (MSE). The authors define MSE like this:
MSE studies explore how broad-scale variation in fine-scale characteristics – such as organismal behavior and fitness, nutrient transformations, and water-use efficiency – relate to broad-scale spatial and temporal processes and patterns such as climate change, landscape alteration, and topography. Because MSE research questions are defined at fine-to-broad spatial and temporal scales, the data used to examine such questions must also be at such scales.

They also give "In a nutshell" summary of what they mean by macrosystems ecology:

  • Macrosystems ecology uses new approaches and applies existing methods in novel ways to study ecological processes interacting within and across scales
  • These approaches often include multiple scales, diverse data objects, data-intensive methods, cross-scale interactions, and hierarchical relationships
  • These studies require large volumes and diverse types of data from many sources, encouraging ecologists to build field and laboratory methods, database objects, and the data infrastructure capable of the joint analysis of multiple large data streams
  • Scientists use powerful statistical methods, such as Bayesian hierarchical models, machine learning, and simulations, to find and explain important patterns in complex, multi-scale datasets

In the whole paper there is not a single reference to macroecology. Yet it is macroecology that has been dealing precisely with the cross-scale ecological processes that the authors mention. Spatial scale and spatial scaling relationships have been the central topic of macroecology ever since it was established 20 years ago. See Rosenzweig's and Brown's classics of the field, and Storch et al.'s recent treatment of scale in ecology.

But macroecology went further from simple scaling and pattern recognition. West, Brown & Enquist's Metabolic theory of ecology, Hubbell's Unified neutral theory of biodiversity or Harte's Maximum entropy theory of ecology are examples of theories that describe processes and/or make predictions across scales and across levels of life organisation (some of the predictions suck, I admit). Plus, there are hundreds of macroecological studies that haven't brought such stand-alone theories, yet their contribution has been substantial. There are even macroecological journals, some of them being among the most influential in the fields of ecology and physical geography. How could Levy et al. miss this?

Maybe it is the stated character of the data, i.e. the diverse data objects, complex and multi-scale datasets and data-intensive methods', that makes MSE distinct from macroecology. Well, I am afraid that this is just another manifestation of the 'big data' buzzword frenzy that is all around right now. The problem is that, without good theory, any 'big data research' will reduce to data mining, exploratory analysis and pattern description. Nothing wrong with that, but we had plenty of it 15-5 years ago in macroecology, and the outcome is sobering: we now have plenty of data and heaps of patterns, but we often don't know what do they mean and how they came to be.

One of the consequences is that the old-school macroecologists have also turned to Bayesian inference and hierarchical models (similarly to Levy et al.), and to methods such as path analysis, structural equation models and so on. But, regardless to the amazing size of the data, any such analysis is only as good as the model or theory that is behind it. People do wonders even with relatively small data when they have a good theory. Macroecology still needs more good theory. In the case of MSE, Levy et al. do not point out any notable theory (I suspect that there is none yet).

A note: hierarchical models really do not work very well as a data exploratory tool -- those who tried know.

To conclude: The Levy et al.'s paper is an interesting probe into desires of a bunch of academics that got high on the 'big data' dope (it is all around) and now they would like to make everything big and hierarchical and cross-scale and integrated. I can just wave to them from the distance and shout: "Hey guys, why don't you come over and join our party!? We are getting high on macroecology here, it is a bit messy but there are some valuable lessons for you, it will save you some time. And while you are here, maybe we can help each other with the theory because we also lack it!"


Spatial autocorrelation of errors in JAGS


In the core of kriging, Generalized-Least Squares (GLS) and geostatistics lies the multivariate normal (MVN) distribution – a generalization of normal distribution to two or more dimensions, with the option of having non-independent variances (i.e. autocorrelation). In this post I will show:

  • (i) how to use exponential decay and the multivariate normal distribution to simulate spatially autocorrelated random surfaces (using the mvtnorm package)
  • (ii) how to estimate (in JAGS) the parameters of the decay and the distribution, given that we have a raster-like surface structure.

Both procedures are computationally challenging as the total data size increases roughly above 2000 pixels (in the case of random data generation) or 200 pixels (in the case of the JAGS estimation).

These are the packages that I need:

  library(mvtnorm)   # to draw multivariate normal outcomes
  library(raster)    # to plot stuff
  library(rasterVis) # to plot fancy stuff
  library(ggplot2)   # more fancy plots
  library(ggmcmc)    # fancy MCMC diagnostics
  library(R2jags)    # JAGS-R interface

Here are some auxiliary functions:

# function that makes distance matrix for a side*side 2D array  
  dist.matrix <- function(side)
    row.coords <- rep(1:side, times=side)
    col.coords <- rep(1:side, each=side)
    row.col <- data.frame(row.coords, col.coords)
    D <- dist(row.col, method="euclidean", diag=TRUE, upper=TRUE)
    D <- as.matrix(D)

# function that simulates the autocorrelated 2D array with a given side,
# and with exponential decay given by lambda
# (the mean mu is constant over the array, it equals to
  cor.surface <- function(side,, lambda)
    D <- dist.matrix(side)
  # scaling the distance matrix by the exponential decay
    SIGMA <- exp(-lambda*D)
    mu <- rep(, times=side*side)
  # sampling from the multivariate normal distribution
    M <- matrix(nrow=side, ncol=side)
    M[] <- rmvnorm(1, mu, SIGMA)

# function that converts a matrix to raster and scales its sides to max.ext
  my.rast <- function(mat, max.ext)
      rast <- raster(mat)
      rast@extent@xmax <- max.ext
      rast@extent@ymax <- max.ext

The Model

I defined the model like this: The vector of data  y is drawn from a multivariate normal distribution

 y \sim MVN(\mu, \Sigma)

where  \mu is a vector of  n pixel means and  \Sigma is the symmetrical  n \times n covariance matrix. The (Euclidean) distances between pixels (sites, locations) are stored in a symmetrical  n \times n distance matrix  D . The value of covariance between two locations is obtained by scaling the distance matrix by a negative exponential function:

 \Sigma_{ij} = exp(-\lambda D_{ij})

where  i \in 1:n and  j \in 1:n . I chose the exponential decay as it well fits many empirical data, and it is simple, with just one parameter  \lambda .

Simulating random normal surfaces with autocorrelated errors

First, I explored how tweaking of  \lambda affects the distance decay of covariance and the resulting spatial patterns. I examined  \lambda=0.01 ,  \lambda=0.1 and  \lambda=1 :

    Distance <- rep(seq(0,20, by=0.1), times=3)
    Lambda <- rep(c(0.01, 0.1, 1), each=201)
    Covariance <- exp(-Lambda*Distance) 
    xy <- data.frame(Distance, Covariance, Lambda=as.factor(Lambda))

    ggplot(data=xy, aes(Distance, Covariance)) +


Second, I simulated the surface for each of the  \lambda values. I also add one pattern with a completely uncorrelated white noise (all covariances are 0). I set the mean in each grid cell to 0.

  side=50     # side of the raster # mean of the process

  M.01    <- cor.surface(side=side, lambda=0.01,
  M.1     <- cor.surface(side=side, lambda=0.1,
  M1      <- cor.surface(side=side, lambda=1,
  M.white <-matrix(rnorm(side*side), nrow=side, ncol=side)

  M.list <- list(my.rast(M.01, max.ext=side), 
                 my.rast(M.1, max.ext=side), 
                 my.rast(M1, max.ext=side), 
                 my.rast(M.white, max.ext=side))
  MM <- stack(M.list)
  names(MM) <- c("Lambda_0.01", "Lambda_0.1", "Lambda_1", "White_noise")

  levelplot(MM) # fancy plot from the rasterVis package


Fitting the model and estimating  \lambda in JAGS

This is the little raster that I am going to use as the data:

# parameters (the truth) that I will want to recover by JAGS
side = 10 = 0
lambda = 0.3  # let's try something new

# simulating the main raster that I will analyze as data
M <- cor.surface(side = side, lambda = lambda, =
levelplot(my.rast(M, max.ext = side), margin = FALSE)


# preparing the data for JAGS
y <- as.vector(as.matrix(M)) <- list(N = side * side, D = dist.matrix(side), y = y)

And here is the JAGS code. Note that in OpenBUGS you would use the spatial.exp distribution from GeoBUGS module, and your life would be much easier. Not available in JAGS, so here I have to do it manually:

  # priors
      lambda ~ dgamma(1, 0.1) ~ dnorm(0, 0.01)
      global.tau ~ dgamma(0.001, 0.001)
      for(i in 1:N)
        # vector of mvnorm means mu
        mu[i] ~ dnorm(, global.tau) 

  # derived quantities
      for(i in 1:N)
        for(j in 1:N)
          # turning the distance matrix to covariance matrix
          D.covar[i,j] <- exp(-lambda*D[i,j])
      # turning covariances into precisions (that's how I understand it)
      D.tau[1:N,1:N] <- inverse(D.covar[1:N,1:N])

  # likelihood
      y[1:N] ~ dmnorm(mu[], D.tau[,])
", file="mvnormal.txt")

And let's fit the model:

fit <-  jags(,"lambda", ""),

## module glm loaded

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
##    Graph Size: 10216
## Initializing model

This is how the posteriors of  \lambda and look like:





The results are not very satisfactory. It looks like  \lambda converges nicely around the true value of 0.3. The, which should converge aroun 0, is a totally wobbly. The whole thing seems to have the right direction, but for some reason it cannot get there fully.

I made this post during a flight from Brussels to Philadelphia on the 7th February 2014. There may be errors.

Bayesian Biostatistics

This post contains materials for Bayesian stats course that I taught between 2-4 Feb 2014 at Faculty of Science, Charles University, Prague, Czech Republic. There were around 40 participants. The complete materials and their source codes (Markdown and R) are on a GitHub repository. The lectures can also be accessed directly as follows (I recommend Chrome):


  1. Introduction: Course contents, pros and cons of Bayes, necessary skills.
  2. Normal distribution: Introducing likelihood and deviance on the Normal example.
  3. Poisson distribution: The didactic simplicity of Poisson and its likelihood. Likelihood maximization.
  4. Doing it the Bayesian way: Elements of conditional probability, Bayes theorem, and MCMC sampling.
  5. Bayesian resources: Overview of software, books and on-line resources.


  1. t-test: First model with 'effects', hypotheses testing, derived variables.
  2. Linear regression: Extracting credible and prediction intervals.
  3. ANOVA: Fixed effects, random effects, the effect of small sample, and shrinkage.
  4. Time series analysis: Fitting more complex function. Auto-regression.


  1. Site occupancy model: accounting for imperfect detection.
  2. The rest of probability distributions (binomial, beta, gamma, multivariate normal, negative binomial).
  3. Convergence diagnostics, publishing Bayesian papers.
  4. Concluding remarks.

Poisson regression fitted by glm(), maximum likelihood, and MCMC

The goal of this post is to demonstrate how a simple statistical model (Poisson log-linear regression) can be fitted using three different approaches. I want to demonstrate that both frequentists and Bayesians use the same models, and that it is the fitting procedure and the inference that differs. This is also for those who understand the likelihood methods and do not have a clue about MCMC, and vice versa. I use an ecological dataset for the demonstration.

The complete code of this post is available here on GitHub

The data

I will use the data on the distribution of 3605 individual trees of Beilschmiedia pendula in 50-ha (500 x 1000 m) forest plot in Barro Colorado (Panama). The dataset is freely available as a part of the R's spatstat library.

First, I will load the necessary libraries:


Let's plot the spatial distribution of the individuals in the plot:

plot(bei$x, bei$y, pch = 19, cex = 0.5, main = "Spatial distribution of individuals in the 50-ha Barro Colorado plot", 
    xlab = "x coordinate [m]", ylab = "y coordinate [m]", frame = FALSE)
abline(h = 0, col = "grey")
abline(h = 500, col = "grey")
abline(v = 0, col = "grey")
abline(v = 1000, col = "grey")


The dataset also comes with two rasterized environmental layers: elevation and slope. My goal will be to model density of tree individuals as a function of elevation [meters above sea level]. I am interested in predicting density of the trees (i.e. number n of individuals per unit area). Hence, I will resample the data into a grid of 50 x 50 m:

# coarsening the predictor data into the 50 x 50 m grid by taking the mean
# of the 5 x 5 m grid cells:
elev <- raster(bei.extra[[1]])
# cropping the data so that they have exactly 500 x 1000 cells
ext <- extent(2.5, 1002.5, 2.5, 1002.5)
elev <- crop(elev, ext)
# aggregating the elevation data
elev50 <- aggregate(elev, fact = 10, fun = mean)

# fitting the point data into the 50 x 50 m grid
xy <- data.frame(x = bei$x, y = bei$y)
n50 <- rasterize(xy, elev50, fun = "count")
# replacing the NA values by 0
n50[] <- 0

Initial data visualization

Initial plotting of the data is the necessary first step in any data analysis. So let's first plot the gridded data:

plot(stack(elev50, n50), main = c("Predictor: Mean Elevation in 50x50 m cells", 
    "Response: # of Individuals in 50x50 m cells"), axes = FALSE)


Now let's see how the predictor and the response look plotted against each other.

plot(elev50[], n50[], cex = 1, pch = 19, col = "grey", ylab = "# of Individuals", 
    xlab = "Mean Elevation [m]")


There seems to be a unimodal response of # of individuals to elevation. For this reason I will use a polynomial function rather than the simplest (linear) function to model the response. Also, you can see that the variability of the data increases in intermediate elevations, and I also note that this is count data – it makes it an excellent candidate for Poisson error structure (the larger the mean the larger the variance), or maybe even Negative-binomial error structure (not considered in this post).

Centering and standardization

I find it necessary to center (to 0 mean) and standardize (to variance of 1) variables for MCMC simulations and for likelihood optimization. For models with log link function it really is essential – it makes any algorithm opearting in log-space much more effective. Here I will define my own function scale2(), but you can also use the R's native scale():

scale2 <- function(x) {
    sdx <- sqrt(var(x))
    meanx <- mean(x)
    return((x - meanx)/sdx)

elev50 <- scale2(elev50[])

Finally, some minor tweakings:

pow.elev50 <- elev50^2  # (I will be fitting a polynomial)
n50 <- n50[]

The model

This is the formal definition of the model that I am going to use:

 \log \lambda_i = \beta_0 + \beta_1 Elevation_i + \beta_2 Elevation_i^2

 n_i \sim Poisson(\lambda_i)

The index  i identifies each grid cell (data point).  \beta_0 -  \beta_2 are model coefficients, and  n_i is the observed number of individuals in each grid cell.

The notation roughly reads as: The logarithm of  \lambda_i is a function of the elevation and the regression coefficients. The observed numbers of individuals in each grid cell is an outcome of a Poisson-distributed random process with cell-specific parameter  \lambda_i .

I recommend to write down such formal definition of any statistical model that you are going to use. It will tell you everything about its assumptions and it will be easier to interpret the fitted model.

Fitting the model using glm()

Fitting my model with the glm() function is easy. You just need to specify that the data are drawn from Poisson distribution and that  \lambda_i is modelled in logarithmic space. Specifying family="poisson" will do exactly that:

m.glm <- glm(n50 ~ elev50 + pow.elev50, family = "poisson")

## Call:
## glm(formula = n50 ~ elev50 + pow.elev50, family = "poisson")
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
##  -6.98   -3.28   -1.40    1.32   17.87  
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.19436    0.02032   157.2   <2e-16 ***
## elev50       0.00441    0.02255     0.2     0.84    
## pow.elev50  -0.42186    0.02085   -20.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Dispersion parameter for poisson family taken to be 1)
##     Null deviance: 3953.0  on 199  degrees of freedom
## Residual deviance: 3391.3  on 197  degrees of freedom
## AIC: 4170
## Number of Fisher Scoring iterations: 5

I will then use the fitted model to make a smooth prediction curve of  \lambda_i :

elev.seq <- seq(-3, 2, by = 0.05) <- data.frame(elev50 = elev.seq, pow.elev50 = elev.seq^2)
new.predict <- predict(m.glm, newdata =, type = "response")

And here I plot the data and the predicted  \lambda_i (red line):

plot(elev50, n50, cex = 1, col = "lightgrey", pch = 19, ylab = "# of Individuals", 
    xlab = "Scaled Mean Elevation")
lines(elev.seq, new.predict, col = "red", lwd = 2)


Advantages of glm()

  • Fast.
  • Simple.
  • It immediately gives you AIC, SEs, R2 and the other cool stuff.
  • It works well even on relatively big data.

Disadvantages of glm()

  • Not very flexible.
  • It is tricky to pull out prediction intervals. In my case I could use some combination of bootstrap and qpois(), but it would get quite messy in any case.

Fitting the model by maximum likelihood

First, I will define the log-likelihood function for the polynomial Poisson regression:

LogLike <- function(dat, par) {
    beta0 <- par[1]
    beta1 <- par[2]
    beta2 <- par[3]
    # the deterministic part of the model:
    lambda <- exp(beta0 + beta1 * dat$x + beta2 * (dat$x^2))
    # and here comes the negative log-likelihood of the whole dataset, given the
    # model:
    LL <- -sum(dpois(dat$y, lambda, log = TRUE))

Then I need to set the initial values for the optimization procedure:

beta0 <- rnorm(1)
beta1 <- rnorm(1)
beta2 <- rnorm(1)
par <- c(beta0, beta1, beta2)

I will coerce my data for my LogLike() function:

dat <- data.frame(y = n50, x = elev50)

And now I can run the likelihood maximization using the optim() function. <- optim(par = par, fn = LogLike, dat = dat)

## $par
## [1]  3.194276  0.004546 -0.421969
## $value
## [1] 2082
## $counts
## function gradient 
##      120       NA 
## $convergence
## [1] 0
## $message

Note: I am using the scaled (to zero mean and unit variance) predictor elev50. This is vital in case you are using a GLM with log link function. If you try to run the optim() function on raw (non-scaled) data, it won't work.

And finally, plotting the data and the fitted model:

plot(elev50, n50, cex = 1, col = "lightgrey", pch = 19, ylab = "# of Individuals", 
    xlab = "Scaled Mean Elevation")
new.predict <- exp($par[1] +$par[2] * elev.seq +$par[3] * 
lines(elev.seq, new.predict, col = "red", lwd = 2)


Advantages of likelihood optimization

  • More flexible than glm() - you can modify your models as much as you want and you will be able to fit them.
  • Often faster than MCMC.

Disadvantages of likelihood optimization

  • The optimization algorithm may crash, or it can get stuck at a local optimum.
  • Difficult to get prediction intervals (or any measure of uncertainty).

Fitting the model by MCMC in JAGS

MCMC stands for Markov Chain Monte Carlo sampling. It can be used to estimate posterior distributions of model parameters (i.e. to “fit a model”) in a Bayesian setting. The most common flavors of MCMC are Metropolis-Hastings algorithm and Gibbs sampling.
I will use the MCMC sampler in JAGS to fit the model, which in R is accessed conveniently through the rjags library:


Now I will create the list data for JAGS: <- list(N.cells = length(n50), n50 = n50, elev50 = elev50)

And this is the model written in the JAGS (BUGS) language, which is very similar to R, but it is not the same:

        # priors
        beta0 ~ dnorm(0,0.001)
        beta1 ~ dnorm(0,0.001)
        beta2 ~ dnorm(0,0.001)

        # likelihood
        for(i in 1:N.cells)
          n50[i] ~ dpois(lambda[i])
          log(lambda[i]) <- beta0 + beta1*elev50[i] + beta2*pow(elev50[i],2)
          # this part is here in order to make nice prediction curves:
          prediction[i] ~ dpois(lambda[i])
  ", file="model.txt")

I have actually dumped the code into a file.

Here I specify the parameters that will be monitored during the MCMC sampling:

params <- c("beta0", "beta1", "beta2", "prediction")

Compiling the model:

jm <- jags.model("model.txt", data =, n.chains = 3, n.adapt = 1000)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
##    Graph Size: 1607
## Initializing model

You usually need to throw away the initial samples (the so-called “burn-in” phase):

update(jm, n.iter = 1000)

And here I am sampling from the posteriors and I am saving the samples for inference:

jm.sample <- jags.samples(jm, variable.names = params, n.iter = 1000, thin = 1)

I can plot the Markov chains of the three regression coefficients, and their posterior density plots which are marginal distributions of the chains:

plot(as.mcmc.list(jm.sample$beta0), main = "Beta_0")


plot(as.mcmc.list(jm.sample$beta1), main = "Beta_1")


plot(as.mcmc.list(jm.sample$beta2), main = "Beta_2")


Here I pull out a summary for an individual parameter, e.g.  \beta_2 :


## Iterations = 1:1000
## Thinning interval = 1 
## Number of chains = 3 
## Sample size per chain = 1000 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
##           Mean             SD       Naive SE Time-series SE 
##      -0.422169       0.021159       0.000386       0.000675 
## 2. Quantiles for each variable:
##   2.5%    25%    50%    75%  97.5% 
## -0.465 -0.436 -0.422 -0.408 -0.382

I pull out the predictions and the 95% Prediction Intervals:

predictions <- summary(as.mcmc.list(jm.sample$prediction))
prds <- data.frame(sc50 = scale2(elev50), predictions$quantiles)
prds <- prds[order(prds[, 1]), ]

And here I plot it all:

plot(scale2(elev50), n50, cex = 1, col = "lightgrey", pch = 19, ylab = "# of Individuals", 
    xlab = "Scaled Mean Elevation")
lines(prds[, 1], prds[, 2], lwd = 2)
lines(prds[, 1], prds[, 4], lwd = 2, col = "red")
lines(prds[, 1], prds[, 6], lwd = 2)

legend("topleft", legend = c("95% P.I.", "lambda_i"), col = c("black", "red"), 
    lwd = c(2, 2))


You can see that the estimated parameter values very well match those from glm() and from the ML optimization. The striking result is that the data are clearly over-dispersed. Prediction intervals are really good at showing that – the data simply spread a lot out of the black P.I. boundaries.

Advantages of MCMC

  • Flexible - you can modify your models as much as you want and still effectively fit them.
  • Reliable. It will never get stuck on a local optimum.
  • Great in pulling out uncertainties of all kinds (e.g. in the form of Prediction Intervals).
  • Even though the MCMC procedure is complicated, the inference based on the posterior distributions is very easy and intuitive.

Disadvantages of MCMC

  • Often slow. For more complex models or large datasets it can be a pain.
  • It may be tedious to code and debug.


The three approaches gave roughly the same mean predicted values and the same mean estimates of model parameters. In contrast to glm() and ML otpimization, MCMC enabled me to monitor the full posterior distribution of predictions that included both uncertainty in the model estimation (given mostly by sample size) as well as uncertainty given by the variance of the Poisson distribution.

The model obviously is not ideal: the data are clearly over-dispersed. Negative Binomial or quazi-Poisson models would probably be more appropriate.

An additional next thing to explore would be spatial dependence (spatial autocorrelation).

Making high-resolution biodiversity maps from low-res maps

This post advertises our new Ecological Applications paper which is in press.

Imagine that there would be a tool that could make hi-res images out of low-res ones, just like this:

Such tool would be really useful for creating maps of things for which we only have a very crude (low-res or coarse-grain) spatial information, like this:

In geographical ecology most maps are low-res. Atlases of species distributions are compiled at grains of roughly 10 x 10 km (country-level atlases) or 50 x 50 km (EU-wide atlases), or they have the form of scale-free blob maps that become reliable only after they are fitted to a grid no finer than 100 x 100 km. Then there are the compiled point record data (e.g. GBIF). Ten years ago the founding fathers of "niche modelling" established the dubious paradigm that point records are fine-grain data. This has led to a diarrhoea of studies using the twisted combo of presence/pseudo-absences modelling, probability thresholding and pattern-recognition techniques. They have ignored the fact that point records can be scaled-up to areal units that correctly represent uncertainty about the position of presences and absences of a species. But then there is a problem: the resulting maps are too coarse.

So can we refine the coarse-grain maps?

Two years ago I started to work on it. I knew about the paper by McInerny & Purves (2011) who came up with a statistically correct way to infere species' niches from unobserved (latent) fine-grain variation in coarse-grain environmental data. I felt that there has to be a way to apply the approach the other way round: One should be able to make the unobserved (latent) fine-grain species distribution conditional on the observed fine-grain environment and the observed coarse-grain presences and absences, so that we can get this (the maps show distribution of American three-toed woodpecker):


With the help of Jonathan Belmaker, Adam Wilson and Walter Jetz I found the solution. We have published it here. We also have a study in review which we did over larger span of scales, and which adds spatial autocorrelation, informative priors and prediction uncertainty into the picture.

But then I thought, wait, if it is possible to do this with individual species distributions, then maybe... maybe we can refine biodiversity per se. We more or less know how number of species scales with area. Maybe we can use the species-area relationship (SAR) to map the (unobserved) fine-grain patterns of species richness conditional on the (observed) fine-grain environmental conditions, and through SAR also conditional on the observed coarse-grain richness.

And I did it. This is how it looks like for the map of species richness of South African birds (from the paper in press in Ecological Applications. ):


It works. Of course the approach now needs to be refined, critically examined, tested and improved by broader community. It could be implemented in a more user-friendly software tool (now you need to know OpenBUGS, JAGS, or likelihood optimization). Yet the most exciting prospect is that there are other phenomena that in theory could be "downscaled", especially outside of ecology. And maybe someone with a sense for business (not me) can commercialize the method.

I am opened for suggestions and collaboration.



Overview of alternative approaches to downscaling:

  • McPherson, J. M., W. Jetz, and D. J. Rogers. 2006. Using coarse-grained occurrence data to predict species distributions at finer spatial resolutions: possibilities and limitations. Ecological Modelling 192: 499–522.

Example uses of the alternative approaches:

  • Araújo, M. B., W. Thuiller, P. H. Williams, and I. Reginster. 2005. Downscaling European species atlas distributions to a finer resolution: implications for conservation planning. Global Ecology and Biogeography 14:17–30.
  • Bombi, P., and M. D’Amen. 2012. Scaling down distribution maps from atlas data: a test of different approaches with virtual species. Journal of Biogeography.
  • Niamir, A., A. K. Skidmore, A. G. Toxopeus, A. R. Muñoz, and R. Real. 2011. Finessing atlas data for species distribution models. Diversity and Distributions 17:1173–1185.


  • Azaele, S., S. J. Cornell, and W. E. Kunin. 2012. Downscaling species occupancy from coarse spatial scales. Ecological Applications 22:1004–1014
  • Hurlbert, A. H., and W. Jetz. 2007. Species richness, hotspots, and the scale dependence of range maps in ecology and conservation. PNAS 104:13384 –13389.
  • Kunin, W. E. 1998. Extrapolating species abundance across spatial scales. Science 281:1513 –1515.
  • McInerny, G. & Purves, D. 2011. Fine-scale environmental variation in species distribution modelling: regression dilution, latent variables and neighbourly advice. Methods in Ecology and Evolution 2: 248-257.

Self citations:

  • Keil, P., J. Belmaker, A. M. Wilson, P. Unitt, and W. Jetz. 2013. Downscaling of species distribution models: a hierarchical approach. Methods in Ecology and Evolution 4:82–94.
  • Storch, D., P. Keil, and W. Jetz. 2012. Universal species-area and endemics-area relationships at continental scales. Nature 488:78–81.
  • Keil, P. & Jetz, W. (in press) Downscaling the environmental associations and spatial patterns of species richness. Ecological Applications.

Do simple models lead to generality in ecology? Opinion of a simpleton

Evans et al. have a paper in Trends in Ecology and Evolution with this abstract:

Modellers of biological, ecological, and environmental systems cannot take for granted the maxim ‘simple means general means good’. We argue here that viewing simple models as the main way to achieve generality may be an obstacle to the progress of ecological research. We show how complex models can be both desirable and general, and how simple and complex models can be linked together to produce broad-scale and predictive understanding of biological systems.

I noticed the paper because of Bob O'Hara's post. It took Bob three pages of complex text about giraffes and parrots to deal with the thing - the TREE paper is iffy and inspiring at the same time. But what surprises me is the complete absence of any statistical perspective in the original paper and in the post. So here is my take:

Of course simple models are not always more general. To get the obvious extremes and eccentricities out of the way: the simplest model is a constant (an 'invariant'). Surely not very useful in ecology which is nothing but variation. Also, some simple models are just completely off the context, which I guess is the Bob's example of modelling head-perching behavior of parrots with a model of giraffe (bi)cycling.

In statistical practice, models can indeed rely on simplifications that are too radical: fitting a line through a clearly hump-shaped data will almost always lead to wrong predictions and the model cannot be generalized. A more complex polynomial would be better. In contrast, making the model overly complex it is called overfitting, and it also leads to poor generality. An example is when one tries to capture stochastic variation by a deterministic model.

Importantly, neither simplicity alone nor model fit alone guarantee generality, but both together do. Evans et al. do not mention it (for unknown reasons), but statistics has a direct way to measure the presence of the simplicity-fit tandem. It is called out-of-sample prediction performance and it can be calculated by crossvalidation, or approximated by AIC, BIC, DIC and similar. In short, it measures how well can a model be generalized to data that were not used to fit the model. Accidentally, I happen to have a very simple post on that.

So to answer the question that Evans et al. have in their title: Generality of any model, measured by its out-of-sample prediction performance, may follow various relationships with complexity measured as the number of estimated parameters. What I usually observe is a hump-shaped relationship between generality and complexity. Simple models may not always be more general, but generality will always only decrease above a certain complexity threshold.

I believe that Albert Einstein got it right:

Make everything as simple as possible, but not simpler.

The last three words are absolutely crucial.


Evans MR, Grimm V, Johst K, Knuuttila T, de Langhe R, Lessells CM, Merz M, O’Malley MA, Orzack SH, Weisberg M, Wilkinson DJ, Wolkenhauer O, & Benton TG (2013). Do simple models lead to generality in ecology? Trends in Ecology & Evolution. DOI: 10.1016/j.tree.2013.05.022

The joy and martyrdom of trying to be a Bayesian

Some of my fellow scientists have it easy. They use predefined methods like linear regression and ANOVA to test simple hypotheses; they live in the innocent world of bivariate plots and lm(). Sometimes they notice that the data have odd histograms and they use glm(). The more educated ones use generalized linear mixed effect models.

A complete workflow from the initial data massage to model fitting and the output of the results can easily fit one page of an R script. Even if more is needed, it rarely takes more than a second to run.

Now here is a warning: Do not go beyond that if you plan to be happy. Stay with lm() and glm(), and use mixed-effect models only when the referees bully you too much. If the referees of your paper start to use buzzwords such as 'hierarchical', 'prior' or 'uncertainty', withdraw and promptly send the manuscript to another journal. You will publish faster and more, and you will get cited more because more people will actually understand your results.

I was a fool to go beyond that. Initially, I just wanted to understand what mixed effect models REALLY are. Maybe you know the confusion: What the hell is the difference between random and fixed effects? How on earth am I supposed to ever confidently specify a correct mixed effect model structure using the bloody formula representation in lme4 and nlme libraries in R? Why is it so difficult to get expert advice on that? Why are all the relevant textbooks written for statisticians and not for humans? How should I interpret the results?

I ended up crunching through book after book. And I found myself studying probability distributions, figuring out what exactly their role is in statistical models. I had understood the difference between data and models, I had finally got a grip on likelihood, and I realized that there are at least two definitions of probability itself.

At a certain moment I had a satori. There was a little click and it all unfolded backwards. And I started to see things with new eyes. It was like being on a trip. I saw t-test, I saw ANOVA, I saw linear regression, Poisson log-linear regression, logistic regression, and finally I saw what some call mixed-effect models. But I saw them as arbitrary and modifiable categories, building blocks, templates. They were all just probability distributions connected by algebra.

The world of my ideas started to beg for an unrestricted intercourse with the world of the data on my screen. I felt like I was liberated and able to translate any hypothesis into an exact formal representation, and that these representations can be properly parameterized and fit to the data because there has been MCMC.

Once I was able to see probability as the intensity of belief I no longer saw any controversy in the usage of informative priors. So straightforward! P-values, test statistics, null hypotheses, randomization tests, standard errors and bootstrap ended up in garbage - not for being incorrect, they just seemed too ad hoc. Machine learning, neural networks, regression trees, random forests, copulas and data mining seemed like primitive (and dangerous) black boxes used by those who had not seen the light yet.

It was the time of the joy of boundless possibilities.

This was all several manuscripts, four conferences, heaps of papers and thousands of lines of code ago. Almost two years of reality ago.

Recently, I have realized that:
I spend days massaging the data so that they fit the OpenBUGS or JAGS requirements. I spend weeks to months thinking about how exactly I should implement my models. I spend days to weeks trying to make OpenBUGS or JAGS run without crashes - JAGS at least gives me a hint what went wrong, OpenBUGS is almost impossible to debug. It costs me more effort to explain my models and results to people. In manuscripts, where I used "I fitted a generalized linear model (poisson family, log link) to the data using glm() in R", I now have a page of description of model structure, an extra paragraph describing how my chains converged, and I plague my manuscripts with equations. Even if the model is not that complex, it puts readers off, including my co-authors. Referees who never used latent variables and hierarchical models have a hard time seeing through it. I have to spend a lot of energy and time explaining my methods in responses to referees. Even a simple re-run or correction of my analyses can take days or weeks. As a consequence, the publishing process is slower, dissemination of my results is less effective, and I expect to be less cited. Oh, and the usage of informative priors seems suspicious to almost everybody.

But don't get me wrong. I still love it! The joy of seeing my little posterior distributions popping out is enormous. I still think that it is all worth it: it is the price that I pay for having all exactly defined (hopefully) and transparent (hopefully), and with the uncertainty properly quantified (hopefully). And since I have always been an idealist, it comforts me that I have at least ideological superiority. Pragmatically and practically speaking, it has been a martyrdom.

I guess that all emerging technologies and paradigms are like that. The starts are clumsy and bumpy. When computers appeared they were one big pain to work with: you had to translate the code into holes in a punched card, you mailed (not e-mailed!) it to a computing center, and in several weeks you would receive an envelope with a punched card that represented the results (perhaps a least-squares solution of a simple linear regression). Imagine debugging that! And you know what computers are today. With Bayesian hierarchical modelling it seems similar. STAN is a promising next step. I believe that the martyrdom is only temporary, and that it will pay off in the long run.