Species Distribution Models on the right track. Finally.

By | September 2, 2014

Species Distribution Models (SDM) a.k.a. Niche Models have always been a busy pile of confusion, ideology and misguided practices, with the real mess being the “presence only” SDMs. Interestingly, when you go to conservation or biogeography symposiums, you can hear the established SDM gurus starting their talks with: “During the last ten years SDMs have greatly matured” or “SDMs are useful tool to guide conservation management and to predict responses to climate change” and so on. You can even read this in the introduction of many papers.

I come from a post-communist country and hence these phrases remind me of political propaganda that I used to hear as a kid; aparatchiks used to declare that everything works great and everybody is happy. But just because someone says so, or because there is a crowd of SDM users and thousands of papers on SDM, it does not mean that things work well.

So what exactly is wrong with current SDMs?

  • First, with a few exceptions, the use of pseudo-absences cannot be justified neither in model fitting, nor in model evaluation. Pseudo-absence is then just a code word for data fabrication.

  • Second, geographic bias in sampling effort must be dealt with. More specifically, even when we use a model that is suited for presence-only data (such as a point process model), the results will be heavily biased unless we account for the fact that species are more often observed around places accessible to humans (e.g. roads).

  • Third, detectability of species must be dealt with. If not accounted for, the results will always be compromised by the species being cryptic and difficult to observe (most species are), which will lead to underestimated probability of occurrence or habitat suitability. To complicate things even more, some species are easily observable in some environments, and cryptic in others.

Statisticians have been aware of these problems (e.g. Phillips et al. 2008, Yackulic et al. 2013), but because there had been only partial or ad-hoc solutions, and because the SDM crowd is comfortable with whatever technique that is user-friendly and R-packaged, everybody has been hammering “GBIF-like” data with “dismo-like” techniques (MaxEnt, random forests, neural networks, GAMs, ...) without a second thought.

To name some of the bits and pieces that do try to address some of the problems: To deal with the geographic bias in presence data, Phillips et al. (2008) suggest to select the background “absence” data with the same geographic bias as is in the presence data – a solution that makes some sense, but it is not an integral part of a statistical model, and hence the solution ignores the uncertainty that comes with it. The so-called occupancy modelling (e.g. Royle & Dorazio's book, MacKendzie et al.'s book) has had a whole set of statistical tools to deal with detectability of species, but the field has rarely touched the large-scale presence-only SDMs. Finally, it has been shown that the popular presence-only MaxEnt technique is in its core equivalent to point process models (Renner & Warton 2013 and a related blog post) – a finding that paves the road to the use of presence-only data in statistical parametric models, and subsequently in occupancy models.

And into all this, finally, comes Robert M. Dorazio with his new paper in Global Ecology and Biogeography. Dorazio puts all the pieces together to introduce a general model that addresses all three of the SDM problems outlined above, and he does it in a statistically rigorous and model-based (parametric) way. In short:

  • Dorazio's model is suitable for presence-only data because of the use of a point process component.

  • The model can explicitly incorporate information on distances to the nearest road (or other correlates of sampling effort).

  • The model has separated the observation process from the latent ecological processes, addressing the issue of detectability of species.

  • The technique allows incorporation of data from systematic surveys. I have always felt that a dozen of systematically surveyed locations were worth hundreds of presence-only incidental observations. Now we have a way to benefit from both kinds of data in a single model.

Other highlights of the approach are:

  • Because it is parametric, we can estimate full posterior distributions of the parameters, enabling probabilistic statements about ecological hypotheses, and enabling the use of model selection techniques based on likelihood (such as AIC, DIC).

  • The approach is flexible and can be extended. The model outlined by Dorazio can indeed fail in many cases just because it omits some important ecological processes (e.g. species interactions, dispersal limitations, non-linear responses to environment, spatial autocorrelation …). However, these can easily be incorporated on top of the core structure.

  • Because the model can be fitted in Bayesian framework, prior information on known habitat preferences can be beneficially used to improve the model (see my paper here for example).

To summarize: Thanks to Dorazio, we finally have a core of rigorous statistical tool that overcomes many of the critical and previously ignored problems with SDM. We had to endure almost 15 years of confusing literature to get here, and I feel that now the field has a potential to be on the right track to maturity. And so it is up to us to test the approach, improve it, and write R pacakges.


4 thoughts on “Species Distribution Models on the right track. Finally.

  1. Adam Wilson

    The only new thing there, I think, is that he combined two data types, one of which had enough information to estimate p(observation|presence), which is almost the same as requiring absences... But this is the direction we need to go (using all the data we have), so it's a nice step forward. hSDM (http://cran.r-project.org/web/packages/hSDM/index.html) also has a poisson point process model degraded by multiple imperfect observations (hSDM.ZIP.iCAR, which also estimates spatial effects and handles habitat transformation), but I'm not sure how it would perform if you had no zero observations. I think it would have the same unidentifiability problems as Dorazio explains in his paper (you can't get absolute intensities if you don't know p(observation|presence)). But the paper exaggerates a bit to call the model "presence only" when it also requires data collected using "a protocol that is informative of both abundance and detectability", which is to say, a form of absence data. I also found the paper a bit enigmatic, for example: "I have established that Pr(N (B) = n) = exp(−μ(B))(μ(B))n n!" which sounds like something exotic and new on first pass, when it could have said "I modelled the number of individuals as a poisson process" or even "n~Poisson(μ(B))". This forced me to confirm that he was actually just talking about a standard distribution that most simply call by name...

    But I do like the combining data bit, that's new, and I think it's where we should be headed...

    1. Petr Keil Post author

      Hi Adam,
      I am not that familiar with hSDM, and I can see that I should be. Great tip, thanks!

  2. Vitezslav Moudry

    Hi Petr,
    great post. I started to think that there is "nobody" with your opinion and "everybody" is good overlooking the shortcommings. But, I am not that optimistic. For example, there is still no way to evaluate models (see Meynar and Kaplan 2012 - The effect of a gradual response to the environment on species distribution modeling performance).

  3. Johnathan04

    I think your website needs some fresh content.

    Writing manually takes a lot of time, but there
    is tool for this time consuming task, search for:
    ssundee advices unlimited content for your blog


Leave a Reply

Your email address will not be published. Required fields are marked *