Simple template for scientific manuscripts in R markdown

By | March 12, 2015

I've made a really simple template for the classical manuscript format for R markdown and knitr. Here are the resulting .pdf and .html.

The template contains the four usual components of any scientific manuscript:

  • equations (using LaTeX syntax)
  • table with caption (done by kable package, but you can also use xtable)
  • figure with caption
  • citations and references (done by knitcitations package)

The template uses Methods in Ecology and Evolution reference style, which is stored in the mee.csl file.

The template does not have line numbers nor wide line spacing. To add these you will have to edit the .tex file, i.e. you will need to learn a little bit of LaTeX.

How to use the template?

  1. Go to the GitHub repository.
  2. Save the .Rmd and .csl files to your local folder.
  3. Open the .Rmd file with R studio.
  4. Edit freely.
  5. Hit the Knit html or Knit PDF button.

GAM splines now easy in JAGS and OpenBUGS. An example on 2D spatial data

By | March 11, 2015

Last week I met Simon Wood, creator of mgcv package, which is THE tool for fitting Generalized Additive Models (GAM) in R.

Simon brought my attention to function jagam which he has just added to mgcv. The function allows to transform the ‘spline’ or ‘smooth’ component of GAM model formula into BUGS code, meaning that the flexibility of GAMs is now available for routine MCMC model fitting.

I often deal with geographically structured (spatially explicit) data, and so I am excited by the prospect of using jagam to build spatially explicit hierarchical models.

Why should spatial statisticians care about jagam?

Because fitting of spatial splines in JAGS and OpenBUGS has so far been a pain. Yet from my experience and from personal communication with others (e.g. C. Dormann, B. O’Hara, C. Beale, S. Wood), splines are a well-behaved way to model spatial autocorrelation that can be easily examined and visualized separatedly from the rest of the model.

Splines can also be a handy alternative to the popular Conditional Autoregressive Models (CAR) that are available in OpenBUGS, but unavailable in JAGS. Hence, the clarity and portability of JAGS is now available to spatial modellers.

The aim of this post

Here I will demonstrate the jagam function in action. I will fit a simple Poisson GAM (with X and Y coordinates as predictors) to spatially explicit count data. I will also check if I get the same expected values from mgcv and JAGS, given the same model.

These are the packages that I will need:

  library(mgcv)   # fits GAMs
  library(spatstat) # the source of the example data
  library(raster) # for operations with rasters
  library(R2jags) # interface between R and JAGS

The data

I will use example dataset bei from spatstat package. The data are positions of 3605 individual trees of Beilschmiedia pendula (Lauraceae) in a 1000 by 500 metre rectangular sampling region in the tropical rainforest of Barro Colorado Island. The data are stored in a point process pattern ppp object.

Let’s plot the data:

  par(mai=c(0.5,0.3,0.3,0))
  plot(bei, cex=0.1, main=NULL)

I will fit the data into a raster of 25 x 50 grid cells; each grid cell gives the count of individual trees that fall within the cell:

  # cropping the data so that they have exactly 500 x 1000 cells
  ext <- extent(0, 1000, 0, 500)         # spatial extent of the raster
  empty <- raster(ext, nrow=25, ncol=50) # empty raster
  
  # aggregating the point data into the raster
  xy <- data.frame(x = bei$x, y = bei$y)
  rst <- rasterize(xy, empty, fun = "count")
  
  # replacing the NA values by 0
  rst[is.na(rst)] <- 0
  
  # extracting the cell values and their coordinates to a data.frame
  coord <- xyFromCell(rst,1:ncell(rst))
  count <- extract(rst, 1:ncell(rst))
  all.data <- data.frame(coord, count=count)

This is the resulting rasterized dataset, with point locations of the trees plotted on top. The color gradient shows counts of trees in each grid cell.

  plot(rst, axes=FALSE)
  points(xy, cex=0.1)

Standard GAM in mgcv

This is the standard way to fit X- and Y- splines in mgcv:

  # the gam model with s() indicating that I fit splines
  space.only <- gam(count~s(x, y), data=all.data, family = "poisson")
  # extraction of the predictions
  preds.mgcv <- as.vector(predict(space.only, type = "response"))

  # putting the predictions into a raster
  rst.mgcv <- rst
  rst.mgcv[] <- preds.mgcv

This is the predicted surface on a map:

  plot(rst.mgcv, axes=FALSE)
  points(xy, cex=0.1)

The jagam function in action

The main point here is that the new jagam function takes the GAM formula and converts it into a piece of BUGS code. The resulting code can then be run in JAGS, or even in OpenBUGS. Here it is in action:

  jags.ready <- jagam(count~s(x, y), 
                      data=all.data, 
                      family="poisson", 
                      file="jagam.bug")

The jagam function does two things: (1) it creates an object that contains the data in the list format that can be readily used in the jags function (package R2jags), and (2) it writes the BUGS model definition into a file. That makes it really easy to fit GAM splines in JAGS. The idea is that more complex hierarchical structures can then be added directly into the BUGS code.

Let’s have a look into the jagam.bug file:

  readLines("jagam.bug")
##  [1] "model {"                                                        
##  [2] "  eta <- X %*% b ## linear predictor"                           
##  [3] "  for (i in 1:n) { mu[i] <-  exp(eta[i]) } ## expected response"
##  [4] "  for (i in 1:n) { y[i] ~ dpois(mu[i]) } ## response "          
##  [5] "  ## Parameteric effect priors CHECK tau is appropriate!"       
##  [6] "  for (i in 1:1) { b[i] ~ dnorm(0,0.001) }"                     
##  [7] "  ## prior for s(x,y)... "                                      
##  [8] "  K1 <- S1[1:29,1:29] * lambda[1]  + S1[1:29,30:58] * lambda[2]"
##  [9] "  b[2:30] ~ dmnorm(zero[2:30],K1) "                             
## [10] "  ## smoothing parameter priors CHECK..."                       
## [11] "  for (i in 1:2) {"                                             
## [12] "    lambda[i] ~ dgamma(.05,.005)"                               
## [13] "    rho[i] <- log(lambda[i])"                                   
## [14] "  }"                                                            
## [15] "}"

Fitting the model in JAGS

Here I fit the Bayesian model by calling jags function from package R2jags. I will monitor the expected values (mu) in each grid cell of the raster.

  model.fit <- jags(data=jags.ready$jags.data, 
               model.file="jagam.bug",
               parameters.to.save=c("mu"),
               n.chains=3,
               n.iter=1000,
               n.burnin=500)
## module glm loaded
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
##    Graph Size: 42982
## 
## Initializing model
  # extracting the fitted means
  preds.mu <- as.vector(model.fit$BUGSoutput$mean$mu)

  # inserting the fitted means to the raster
  rst.jags <- rst
  rst.jags[] <- preds.mu

Let’s plot the JAGS results:

  plot(rst.jags, axes=FALSE)
  points(xy, cex=0.1) # adding the positions of individual trees

It looks almost identical to the mgcv output. Let’s have a closer look.

mgcv vs JAGS

Here I compare the modelled counts from the two models.

  plot(preds.mgcv, preds.mu,
       xlab="Counts from mgcv",
       ylab="Counts from JAGS")
  abline(a=0, b=1, col="red", lwd=2)

The predicted counts are not identical, JAGS predicts relatively higher counts than mgcv. I am not sure if this is important – maybe it has something to do with my priors, I don’t know. But I guess that I don’t care that much, as long as the predictions are roughly similar.

12 nifty tips for scientists who use computers

By | February 16, 2015

Simple things are good. Here is a list of 12 things that I find simple and useful, yet not many of my colleagues use them. The list is R-biased.

Knitr. Intuitive tool to integrate R and text to make reports with fancy fonts, figures, syntax-highlighted R code and equations. If you use R studio, then learning Knitr is remembering a bunch of simple commands and knowing where the ‘knit’ button is. My dream is that, in the near future, every published paper will have an appendix consisting of an .html file produced by Knitr, which will contain all of the data processing and analytical steps. Look for the .Rmd file format in R studio.

Focus. A distraction free text editor -- there is nothing else than you and the text on your screen. Great if you need to 100% focus on writing.

Solarized color theme. Some combinations of colors on the screen are unhealthy. Black font on white background is about as healthy as gazing into a light bulb; imagine doing this for several hours almost every day for the rest of your life. Luckily, there is a color theme that is designed to be gazed into: it's the Solarized theme. It has several unique properties such as selective contrast, symmetric lightness, both dark and light versions, and it is available for Firefox, Visual Studio, R Studio, Emacs, Vim, Gnome Terminal, Gedit and others.

Rcpp. As Hadley Wickham puts it in his excellent post, sometimes R is just not fast enough and you can’t speed things up with apply-like functions and vectorizations (e.g. in stochastic simulations). Rcpp is an R to C++ interface that speeds things up considerably, and it only requires elementary knowledge of C++. Check out the post, and you will see that the Rcpp’s “syntactic sugar” and its data formats are accessible even to C++ dummies. Rcpp has solved all of my remaining issues with R's speed.

Remarkable. Markdown editor. Simple.

R presentations. Yes, you can make presentations in R (example here). They are minimalistic, can be saved as .html (making them portable), and they can be easily provided on-line; useful for R-related teaching. Look for the .Rpres file format in R studio.

Google URL shortener. Have you ever wrestled with an inconveniently long URL, trying to fit it into a manuscript, into a comment in MS Word, teaching materials, or just into an e-mail? This on-line utility from Google can help.

LibreOffice Draw. It is like Inkscape, CorelDraw or Adobe Illustrator, but it is easier to use, and it is free. Although lightweight, it has exactly the functionality one needs for drawing and editing scientific schemes, plots and illustrations. You can learn it in 5 minutes.

LaTeX equations. LaTeX is great, but it can also be a nightmare, especially when used in collaborative projects. However, the LaTeX equation syntax has a potential even if you don't code your manuscripts. LaTeX equations can be used in Knitr (see above), and there is a plug-in for LibreOffice that supports LaTeX equations. I also use uncomplied LaTeX equations: e.g. when commenting on collaborative manuscripts, in reviews for journals, in e-mails – always when I need to fit a mathematical idea into a piece of plain text.

Dropbox file history. Did you know that Dropbox saves complete history of every file in the Dropbox folder for up to a month? Kind of a data backup that saved my day several times already, after I accidentally overwrote contents of an important file. Also quite handy for collaborative projects -- it works as a primitive version control system, without the technical challenge of Git.

Nature and Science yearly subscription. Nature is $119 for post-docs and $199 for humans, Science is $50 for post-docs and $99 for humans; EU prices are slightly higher. I think that it’s a good deal, considering that you get a printed issue every week. I love listing through an actual paper issue of Science or Nature when I have spare 10 minutes. The first part of the journal is news and popular science, keeping me up to date with what's hot.

SSH. Easy and safe way to connect computers. This effectively eliminates the need to buy expensive laptops. Here is a visionary suggestion: Buy only cheap laptops, and if you need computational power, connect the laptop to a second-hand, bulky, noisy, and powerful server in your closet through SSH. Or actually, how about dumping the laptop and using Raspberry Pi instead. Or even better, you can dump your server too, or make it more useful by converting it into a grow-box for hemp; then you can buy a dozen of second-hand Playstations 3, use their GPUs to make a Playstation cluster with 96 cores, and access it through SSH from your Raspberry Pi – all for $1000.

Bayesian Biostatistics 2015

By | February 1, 2015

Authors: Petr Keil, Jan Smyčka

This post contains materials for Bayesian stats course (2-4 Feb 2015 at Faculty of Science, Charles University, Prague, Czech Republic).

The complete materials and their source codes (Markdown and R) are on my GitHub repository. The lectures can also be accessed directly as follows:

DAY 1

  1. Introduction: Course contents, pros and cons of Bayes, necessary skills.
  2. Normal distribution: Introducing likelihood and deviance on the Normal example.
  3. Poisson distribution: The didactic simplicity of Poisson and its likelihood. Likelihood maximization. Probability mass function.
  4. How to calculate posterior probability (by Jan Smyčka): The principle of MCMC sampling.

DAY 2

  1. Bayesian resources: Overview of software, books and on-line resources.
  2. First real model in JAGS: Fitting Poisson distribution to forest larvae count data.
  3. T-test: First model with 'effects', hypotheses testing, derived variables.
  4. Linear regression: Extracting credible and prediction intervals.
  5. Time series analysis: Fitting more complex function. Auto-regression.

DAY 3

  1. ANOVA part 1 and part 2: Fixed effects, random effects, the effect of small sample, and shrinkage.
  2. Site occupancy model: accounting for imperfect detection.
  3. Other probability distributions (binomial, beta, gamma, multivariate normal, negative binomial).
  4. Convergence diagnostics, publishing Bayesian papers.
  5. Concluding remarks.

Bayesian PCA

By | January 5, 2015

Authors: Jan Smycka, Petr Keil

This post introduces experimental R package bPCA which we developed with Jan Smycka, who actually came with the idea. We do not guarantee the very idea to be correct and there certainly are bugs – we invite anyone to show us wrong, or to contribute.

Rationale of bPCA

Here is a summary of classical (‘non-Bayesian’) PCA from Wikipedia:

PCA can be thought of as fitting an n-dimensional ellipsoid to the data, where each axis of the ellipsoid represents a principal component. If some axis of the ellipse is small, then the variance along that axis is also small, and by omitting that axis and its corresponding principal component from our representation of the dataset, we lose only a commensurately small amount of information.

To find the axes of the ellipse, we must first subtract the mean of each variable from the dataset to center the data around the origin. Then, we compute the covariance matrix of the data, and calculate the eigenvalues and corresponding eigenvectors of this covariance matrix. Then, we must orthogonalize the set of eigenvectors, and normalize each to become unit vectors. Once this is done, each of the mutually orthogonal, unit eigenvectors can be interpreted as an axis of the ellipsoid fitted to the data.

The bold terms above should evoke Multivariate Normal (MVN) distribution. Jan’s idea was that we can think about the data as being MVN-distributed. If we make such assumption, we can fit the MVN to the data using MCMC, get posterior distributions of means and covariances, and subsequently also posterior distributions of PCA loadings, scores and eigenvalues.

Potential advantages

  • Prior information about associations between variables can be provided (through covariance matrix).
  • We can assess stability (uncertainty) of a PCA, especially when only small sample sizes are available.
  • The posterior distributions for a PCA scores and loadings can be extracted for further use, e.g. anywhere where propagation of uncertainty is of interest.

Potential limitations

  • The algorithm is extremely slow for data with many variables (over 100)
  • The assumption of the underlying MVN can pose a problem.
  • In the summary and plotting functions we may be breaking some important ties between values in the MCMC chains. The cleanest solution (which we avoid) would perhaps be to calculate the eigenvalues, loadings and scores all in the JAGS sampler.

Installing the bPCA package

These two commands will install the package from the GitHub repository directly to your computer:

library(devtools)
install_github("bPCA", username="petrkeil")

To load the package:

library(bPCA)

Note: bPCA depends on R2jags, MASS, Matrix, coda, reshape, lattice.

Example use of bPCA package

Here we apply bPCA to the first four columns in the built-in dataset iris.

  summary(iris[,1:4])
##   Sepal.Length   Sepal.Width    Petal.Length   Petal.Width 
##  Min.   :4.30   Min.   :2.00   Min.   :1.00   Min.   :0.1  
##  1st Qu.:5.10   1st Qu.:2.80   1st Qu.:1.60   1st Qu.:0.3  
##  Median :5.80   Median :3.00   Median :4.35   Median :1.3  
##  Mean   :5.84   Mean   :3.06   Mean   :3.76   Mean   :1.2  
##  3rd Qu.:6.40   3rd Qu.:3.30   3rd Qu.:5.10   3rd Qu.:1.8  
##  Max.   :7.90   Max.   :4.40   Max.   :6.90   Max.   :2.5

We fit the MVN distribution to the data using sim.bPCA function. This function is esentially a wrapper around a piece of JAGS code. Each value of mean and covariance is represented by MCMC chain. Note: The function sim.bPCA also provides an option to supply prior covariance matrix.

So here we go:

bPCA.fitted <- sim.bPCA(iris[,1:4], n.chains=3, n.iter=1000, n.burnin=500)
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
##    Graph Size: 195
## 
## Initializing model

We can extract posterior distributions of eigenvalues and percentages of explained variance, and show them as boxplots:

eigenvalplots.bPCA(bPCA.fitted, iris[,1:4])

plot of chunk unnamed-chunk-5

## $Eigenvalues
##        V1              V2              V3               V4        
##  Min.   : 2.98   Min.   :0.166   Min.   :0.0495   Min.   :0.0169  
##  1st Qu.: 3.93   1st Qu.:0.222   1st Qu.:0.0711   1st Qu.:0.0214  
##  Median : 4.25   Median :0.240   Median :0.0765   Median :0.0231  
##  Mean   : 4.40   Mean   :0.247   Mean   :0.0773   Mean   :0.0233  
##  3rd Qu.: 4.61   3rd Qu.:0.261   3rd Qu.:0.0828   3rd Qu.:0.0249  
##  Max.   :71.65   Max.   :2.521   Max.   :0.1158   Max.   :0.0336  
## 
## $Exp.var
##        V1             V2             V3              V4        
##  Min.   :88.6   Min.   :2.90   Min.   :0.103   Min.   :0.0335  
##  1st Qu.:91.9   1st Qu.:4.73   1st Qu.:1.501   1st Qu.:0.4513  
##  Median :92.5   Median :5.24   Median :1.672   Median :0.5065  
##  Mean   :92.5   Mean   :5.30   Mean   :1.688   Mean   :0.5091  
##  3rd Qu.:93.2   3rd Qu.:5.81   3rd Qu.:1.855   3rd Qu.:0.5596  
##  Max.   :96.9   Max.   :8.93   Max.   :2.760   Max.   :0.8391

We can see that the boxplots are pretty narrow, especially for the explained variability, indicating stability and robustness (low uncertainty) of the results, which comes with the large sample size of the iris dataset.

Now it is getting a bit wild: We can make a PCA biplot for 5%, 50% and 97.5% quantiles of the posterior distributions:

biplots.bPCA(bPCA.fitted, iris[,1:4], axes.to.plot=1:2, scale=0.1)

plot of chunk unnamed-chunk-6 You can try to play around with the scale parameter, which controls the relative width of the plotted arrows.

We can also extract MCMC chains (posterior distributions) for PCA loadings from the bPCA.fitted object and summarize them by quantiles (5%, 50% and 97.5%):

loadings.chain <- get.loadings.chain.bPCA(bPCA.fitted, iris[,1:4])

And we can summarize the posterior distributions of the PCA loadings by quantiles (5%, 50% and 97.5%):

summary.loadings.bPCA(loadings.chain, vars.to.get=1:4, axes.to.get=1:2)

plot of chunk unnamed-chunk-8

## $`0.025`
##               Comp.1  Comp.2
## Sepal.Length -0.3778 -0.7076
## Sepal.Width  -0.1170 -0.8034
## Petal.Length -0.8538 -0.2152
## Petal.Width  -0.3600 -0.1539
## 
## $`0.5`
##                Comp.1   Comp.2
## Sepal.Length  0.35944 -0.61570
## Sepal.Width  -0.08496 -0.70617
## Petal.Length  0.85619  0.15163
## Petal.Width   0.35708  0.03422
## 
## $`0.975`
##               Comp.1 Comp.2
## Sepal.Length 0.38875 0.7299
## Sepal.Width  0.06304 0.7700
## Petal.Length 0.86535 0.2088
## Petal.Width  0.37403 0.1411

And finally, here we extract and summarize (by quantiles) the MCMC chains for PCA scores from the bPCA.fitted object:

scores.chain <- get.scores.chain.bPCA(bPCA.fitted, iris[,1:4])
summary.scores.bPCA(scores.chain, axes.to.get=1:2)
## $`0.025`
##      Comp.1   Comp.2
## 1   -2.9945 -0.42636
## 2   -3.0220 -0.26703
## 3   -3.1968 -0.24904
## 4   -3.0526 -0.41919
## 5   -3.0390 -0.43360
## 6   -2.5890 -0.83825
## 7   -3.1278 -0.19983
## 8   -2.9350 -0.26937
## 9   -3.1923 -0.68180
## 10  -2.9799 -0.20667
## 11  -2.8150 -0.74704
## 12  -2.9195 -0.13083
## 13  -3.0930 -0.32803
## 14  -3.5298 -0.62226
## 15  -2.9551 -1.28148
## 16  -2.6988 -1.43877
## 17  -2.9336 -0.91308
## 18  -2.9587 -0.41733
## 19  -2.5089 -0.96688
## 20  -2.8954 -0.61933
## 21  -2.6200 -0.48977
## 22  -2.8515 -0.53664
## 23  -3.5246 -0.26053
## 24  -2.6117 -0.19862
## 25  -2.6615 -0.13543
## 26  -2.8144 -0.23504
## 27  -2.7785 -0.23201
## 28  -2.8721 -0.47107
## 29  -2.9495 -0.41841
## 30  -2.9388 -0.29616
## 31  -2.8948 -0.30064
## 32  -2.7204 -0.51132
## 33  -2.9557 -0.92059
## 34  -2.9089 -1.19445
## 35  -2.9444 -0.21309
## 36  -3.1756 -0.18107
## 37  -2.9330 -0.70611
## 38  -3.1107 -0.37718
## 39  -3.2867 -0.59402
## 40  -2.8992 -0.33360
## 41  -3.0811 -0.37141
## 42  -3.1590 -1.03161
## 43  -3.3032 -0.45204
## 44  -2.7161 -0.29379
## 45  -2.5167 -0.53797
## 46  -3.0221 -0.34264
## 47  -2.8453 -0.60550
## 48  -3.1465 -0.32943
## 49  -2.8514 -0.68201
## 50  -3.0119 -0.21535
## 51  -1.3544 -0.77324
## 52  -0.9839 -0.39528
## 53  -1.5238 -0.59575
## 54  -0.2482 -0.88436
## 55  -1.1261 -0.15818
## 56  -0.6631 -0.48704
## 57  -1.1463 -0.37089
## 58  -1.0575 -1.06439
## 59  -1.0924 -0.31648
## 60  -0.3286 -0.78917
## 61  -0.8199 -1.31909
## 62  -0.5461 -0.17291
## 63  -0.2857 -0.61527
## 64  -1.0173 -0.19642
## 65  -0.4827 -0.31457
## 66  -0.9862 -0.54663
## 67  -0.6861 -0.43422
## 68  -0.2783 -0.39260
## 69  -0.9584 -0.61611
## 70  -0.2919 -0.64001
## 71  -1.1538 -0.18503
## 72  -0.3963 -0.13155
## 73  -1.3208 -0.40556
## 74  -0.9519 -0.25286
## 75  -0.7605 -0.22630
## 76  -0.9509 -0.40839
## 77  -1.3793 -0.34063
## 78  -1.6020 -0.36191
## 79  -0.8444 -0.23378
## 80  -0.6128 -0.42286
## 81  -0.3763 -0.75971
## 82  -0.4977 -0.73371
## 83  -0.2528 -0.37217
## 84  -1.4004 -0.50287
## 85  -0.6105 -0.57284
## 86  -0.8563 -0.29098
## 87  -1.2754 -0.49006
## 88  -0.8350 -0.44652
## 89  -0.2866 -0.33572
## 90  -0.2512 -0.74019
## 91  -0.4798 -0.73345
## 92  -0.9275 -0.10595
## 93  -0.2729 -0.46296
## 94  -1.0123 -1.06837
## 95  -0.3746 -0.56662
## 96  -0.3625 -0.27651
## 97  -0.4028 -0.35773
## 98  -0.6848 -0.09396
## 99  -1.2136 -0.81660
## 100 -0.3258 -0.40892
## 101 -2.5713 -0.18013
## 102 -1.4324 -0.67011
## 103 -2.6626 -0.46756
## 104 -2.0020 -0.28095
## 105 -2.3867 -0.15590
## 106 -3.4503 -0.69482
## 107 -0.5311 -1.28247
## 108 -2.9790 -0.49010
## 109 -2.3466 -0.35346
## 110 -2.9888 -0.92759
## 111 -1.7069 -0.34849
## 112 -1.8304 -0.31074
## 113 -2.2086 -0.32870
## 114 -1.3618 -0.86831
## 115 -1.6072 -0.66105
## 116 -1.9468 -0.24438
## 117 -1.9890 -0.15275
## 118 -3.5840 -1.32729
## 119 -3.8369 -0.41337
## 120 -1.3120 -0.83754
## 121 -2.4772 -0.50360
## 122 -1.2186 -0.71013
## 123 -3.5489 -0.61123
## 124 -1.4162 -0.28498
## 125 -2.3237 -0.45995
## 126 -2.6697 -0.68501
## 127 -1.2880 -0.26053
## 128 -1.3251 -0.20711
## 129 -2.1520 -0.31536
## 130 -2.4393 -0.58100
## 131 -2.8874 -0.50626
## 132 -3.3338 -1.50967
## 133 -2.1877 -0.32492
## 134 -1.4748 -0.22420
## 135 -1.8005 -0.59277
## 136 -3.1378 -0.82452
## 137 -2.1887 -0.28358
## 138 -1.9458 -0.16154
## 139 -1.2017 -0.25808
## 140 -2.1556 -0.48238
## 141 -2.3572 -0.30983
## 142 -1.9740 -0.51809
## 143 -1.4324 -0.67011
## 144 -2.6091 -0.41233
## 145 -2.4668 -0.44260
## 146 -1.9861 -0.29961
## 147 -1.5468 -0.45892
## 148 -1.8046 -0.18531
## 149 -1.9449 -0.25447
## 150 -1.4185 -0.38366
## 
## $`0.5`
##        Comp.1    Comp.2
## 1   -2.671377 -0.259318
## 2   -2.699948  0.127838
## 3   -2.875261  0.091951
## 4   -2.732388  0.265896
## 5   -2.715839 -0.267358
## 6   -2.267067 -0.689012
## 7   -2.807008  0.038656
## 8   -2.612510 -0.104769
## 9   -2.871063  0.523541
## 10  -2.658185  0.066666
## 11  -2.492835 -0.587827
## 12  -2.599748 -0.002079
## 13  -2.772043  0.185107
## 14  -3.209090  0.452340
## 15  -2.629543 -1.115099
## 16  -2.371597 -1.283516
## 17  -2.609313 -0.751890
## 18  -2.635087 -0.252319
## 19  -2.184339 -0.817121
## 20  -2.574345 -0.458948
## 21  -2.296517 -0.332557
## 22  -2.530829 -0.378571
## 23  -3.202559 -0.070955
## 24  -2.289342 -0.045918
## 25  -2.342584  0.012187
## 26  -2.491869  0.099518
## 27  -2.455699 -0.075516
## 28  -2.549565 -0.309779
## 29  -2.626040 -0.251855
## 30  -2.618391  0.146485
## 31  -2.573522  0.156408
## 32  -2.396037 -0.351815
## 33  -2.633389 -0.757759
## 34  -2.584196 -1.035418
## 35  -2.622615  0.073934
## 36  -2.851514 -0.016561
## 37  -2.611149 -0.532321
## 38  -2.786797 -0.209118
## 39  -2.966101  0.432322
## 40  -2.575980 -0.170671
## 41  -2.756894 -0.202845
## 42  -2.831526  0.888092
## 43  -2.983116  0.283589
## 44  -2.392183 -0.135986
## 45  -2.196245 -0.388486
## 46  -2.700490  0.200079
## 47  -2.524641 -0.449335
## 48  -2.826260  0.173801
## 49  -2.529763 -0.521761
## 50  -2.689465 -0.048874
## 51   1.268302 -0.633451
## 52   0.915357 -0.280192
## 53   1.448554 -0.454749
## 54   0.171293  0.794701
## 55   1.074534 -0.030987
## 56   0.628135  0.378864
## 57   1.077573 -0.249134
## 58  -0.733562  0.968358
## 59   1.029810 -0.179587
## 60  -0.007162  0.680486
## 61  -0.491335  1.230741
## 62   0.497407  0.064422
## 63   0.253865  0.509018
## 64   0.971504  0.085857
## 65  -0.161857  0.219421
## 66   0.912530 -0.421342
## 67   0.645744  0.305669
## 68   0.223734  0.299880
## 69   0.932890  0.504053
## 70   0.042361  0.550371
## 71   1.101161  0.031869
## 72   0.343859  0.037803
## 73   1.284913  0.287485
## 74   0.908184  0.146444
## 75   0.700927 -0.104657
## 76   0.885841 -0.283379
## 77   1.318325 -0.189180
## 78   1.541962 -0.222255
## 79   0.799263  0.125325
## 80  -0.292353  0.334558
## 81  -0.058123  0.669880
## 82  -0.175684  0.646515
## 83   0.125068  0.281464
## 84   1.366494  0.377788
## 85   0.573948  0.431905
## 86   0.789565 -0.161653
## 87   1.204981 -0.362714
## 88   0.803343  0.331304
## 89   0.232008  0.228170
## 90   0.155279  0.646494
## 91   0.453024  0.633965
## 92   0.876799  0.009900
## 93   0.218582  0.370715
## 94  -0.687917  0.975420
## 95   0.344313  0.467736
## 96   0.318666  0.174674
## 97   0.362647  0.255025
## 98   0.628375 -0.001028
## 99  -0.891548  0.720374
## 100  0.285907  0.312411
## 101  2.513993 -0.017827
## 102  1.401073  0.525489
## 103  2.599619 -0.290050
## 104  1.957259  0.130347
## 105  2.335410  0.003607
## 106  3.379995 -0.480774
## 107  0.507080  1.137062
## 108  2.916433 -0.288510
## 109  2.307418  0.190578
## 110  2.897674 -0.730553
## 111  1.644758 -0.201914
## 112  1.789552  0.169510
## 113  2.149511 -0.171880
## 114  1.332655  0.727648
## 115  1.570282  0.475636
## 116  1.886920 -0.076666
## 117  1.935312 -0.018567
## 118  3.466451 -1.114330
## 119  3.780691 -0.178488
## 120  1.287930  0.719659
## 121  2.410029 -0.330961
## 122  1.184991  0.548114
## 123  3.484630 -0.383773
## 124  1.374922  0.162619
## 125  2.257178 -0.288434
## 126  2.596037 -0.502984
## 127  1.244322  0.138039
## 128  1.276839  0.070291
## 129  2.108444  0.157925
## 130  2.371822 -0.402584
## 131  2.827070 -0.305660
## 132  3.211227 -1.310769
## 133  2.144511  0.164440
## 134  1.430821  0.101343
## 135  1.767732  0.451554
## 136  3.059497 -0.620381
## 137  2.126724 -0.095723
## 138  1.889082 -0.025442
## 139  1.154406  0.117955
## 140  2.090047 -0.327417
## 141  2.296565 -0.136267
## 142  1.904761 -0.364884
## 143  1.401073  0.525489
## 144  2.545625 -0.228936
## 145  2.401245 -0.257039
## 146  1.928156 -0.144625
## 147  1.514905  0.333636
## 148  1.749335 -0.041428
## 149  1.883553 -0.075459
## 150  1.375362  0.230825
## 
## $`0.975`
##     Comp.1  Comp.2
## 1   2.7209 0.44289
## 2   2.7630 0.30172
## 3   2.9357 0.27139
## 4   2.7973 0.44087
## 5   2.7638 0.44578
## 6   2.3039 0.84233
## 7   2.8652 0.21497
## 8   2.6664 0.28450
## 9   2.9451 0.70411
## 10  2.7218 0.23763
## 11  2.5352 0.76290
## 12  2.6552 0.14467
## 13  2.8370 0.35944
## 14  3.2795 0.64695
## 15  2.6549 1.29832
## 16  2.3885 1.43890
## 17  2.6436 0.92330
## 18  2.6840 0.43231
## 19  2.2207 0.98106
## 20  2.6166 0.62115
## 21  2.3481 0.51324
## 22  2.5738 0.54063
## 23  3.2517 0.26344
## 24  2.3428 0.20624
## 25  2.4008 0.15423
## 26  2.5562 0.26648
## 27  2.5074 0.24240
## 28  2.5985 0.48893
## 29  2.6773 0.43585
## 30  2.6809 0.31586
## 31  2.6366 0.32441
## 32  2.4444 0.53355
## 33  2.6688 0.92609
## 34  2.6093 1.20232
## 35  2.6847 0.24085
## 36  2.9070 0.20396
## 37  2.6566 0.72740
## 38  2.8375 0.39013
## 39  3.0357 0.61786
## 40  2.6294 0.35106
## 41  2.8060 0.38525
## 42  2.9177 1.07874
## 43  3.0494 0.47405
## 44  2.4403 0.29198
## 45  2.2399 0.54073
## 46  2.7639 0.37436
## 47  2.5681 0.61354
## 48  2.8887 0.35392
## 49  2.5725 0.69657
## 50  2.7442 0.23392
## 51  1.6259 0.77721
## 52  1.2699 0.38337
## 53  1.8026 0.59511
## 54  0.5120 0.90289
## 55  1.4236 0.15780
## 56  0.9728 0.49476
## 57  1.4335 0.34394
## 58  0.8205 1.08165
## 59  1.3783 0.31694
## 60  0.3240 0.79577
## 61  0.5895 1.34399
## 62  0.8498 0.17806
## 63  0.5962 0.66153
## 64  1.3227 0.20592
## 65  0.2651 0.32794
## 66  1.2657 0.54676
## 67  0.9975 0.43328
## 68  0.5748 0.41577
## 69  1.2721 0.63917
## 70  0.3759 0.66086
## 71  1.4552 0.17508
## 72  0.6934 0.14742
## 73  1.6267 0.42088
## 74  1.2575 0.26630
## 75  1.0498 0.22889
## 76  1.2360 0.40861
## 77  1.6658 0.34556
## 78  1.8936 0.35100
## 79  1.1518 0.23905
## 80  0.3614 0.45265
## 81  0.2846 0.78422
## 82  0.2713 0.76330
## 83  0.4751 0.39045
## 84  1.7108 0.50471
## 85  0.9241 0.56841
## 86  1.1480 0.24912
## 87  1.5581 0.48878
## 88  1.1443 0.48011
## 89  0.5863 0.34242
## 90  0.4964 0.75552
## 91  0.7947 0.74466
## 92  1.2280 0.11246
## 93  0.5684 0.48309
## 94  0.7767 1.08867
## 95  0.6907 0.57685
## 96  0.6710 0.28806
## 97  0.7150 0.36698
## 98  0.9787 0.09422
## 99  0.9692 0.83704
## 100 0.6379 0.42038
## 101 2.8701 0.15128
## 102 1.7488 0.66326
## 103 2.9513 0.45023
## 104 2.3069 0.27417
## 105 2.6861 0.14832
## 106 3.7321 0.68620
## 107 0.8508 1.27931
## 108 3.2659 0.48569
## 109 2.6512 0.35755
## 110 3.2577 0.87889
## 111 2.0014 0.31212
## 112 2.1367 0.30647
## 113 2.5004 0.30524
## 114 1.6769 0.86390
## 115 1.9210 0.64564
## 116 2.2409 0.19808
## 117 2.2829 0.12614
## 118 3.8236 1.29230
## 119 4.1255 0.41425
## 120 1.6293 0.85344
## 121 2.7664 0.46661
## 122 1.5331 0.69862
## 123 3.8341 0.61355
## 124 1.7216 0.28982
## 125 2.6141 0.41909
## 126 2.9529 0.66579
## 127 1.5925 0.25969
## 128 1.6272 0.20080
## 129 2.4590 0.30765
## 130 2.7253 0.58384
## 131 3.1748 0.50864
## 132 3.5691 1.48892
## 133 2.4953 0.31995
## 134 1.7774 0.23394
## 135 2.1071 0.60132
## 136 3.4170 0.81262
## 137 2.4814 0.22744
## 138 2.2398 0.12679
## 139 1.5062 0.24873
## 140 2.4452 0.46039
## 141 2.6505 0.27505
## 142 2.2603 0.48980
## 143 1.7488 0.66326
## 144 2.9007 0.36825
## 145 2.7576 0.39275
## 146 2.2785 0.26892
## 147 1.8586 0.46212
## 148 2.0979 0.15503
## 149 2.2382 0.19798
## 150 1.7287 0.37263

The cathedral, the bazaar, and the SNAFU principle

By | November 21, 2014

I've recently been exploring foundational ideas of Free and Open-Source Software (FOSS) culture, and I've found them relevant not only for software development, but also for academia. Here is something that I picked up for inspiration:

If you have no idea who Richard Stallman is, I recommend his TEDx talk on Free software, free society. I especially appreciate his point that teaching kids proprietary software is like teaching them to smoke – I arrived to the same conclusion years ago when I learned to use ESRI ArcGIS (a very expensive proprietary software), only to realize that after leaving university I couldn't afford to use it any more. And only then I realized that there are FOSS alternatives such as QGIS that are free and actually work better! So I wasted all that time learning to use something expensive that makes me dependent, while I should have been learning something that is free, open, and makes me independent.

I also recommend Eric S. Raymond's classic “The Cathedral and the Bazaar”. The piece is a bit archaic, but still, it used to be very influential at the time, and for a reason. It came up with fresh view on how creative people can cooperate in non-hierarchical and non-controlled way (the bazaar style), and how this can effectively produce good stuff. Raymond shows it as an alternative to cathedral building corporate style of management, and he illustrates it using credible real-world examples.

I immediately saw university as a place where the bazaar style of cooperation should be especially encouraged, as long as the goal is to raise students to independence and to their full creative potential.

Also in Raymond's book, I found exceptionally perceptive the idea of SNAFU principle (standing for “Situation Normal, All Fucked Up”) :

True communication is possible only between equals, because inferiors are more consistently rewarded for telling their superiors pleasant lies than for telling the truth. Creative teamwork utterly depends on true communication and is thus very seriously hindered by the presence of power relationships. The open-source community, effectively free of such power relationships, is teaching us by contrast how dreadfully much they cost in bugs, in lowered productivity, and in lost opportunities.

Quoting Wikipedia:

In any hierarchy (business, government, military etc) people and employees inevitably distort the truth of reports when dealing with their superiors, in order to avoid any punishment for relaying bad news. As a result, the superiors often operate from a distorted view of the situation, sometimes resulting in poor results.

I don't think that society can be made hierarchy-free – this is something that we have learned the hard way already. However, Raymond's book made me think that people engaged in any creative cooperation, or in any cooperation that relies on accurate information transfer, should treat each other as equal. But do they? And do they in academia? Do professors treat PhD students as equal when working on a paper together? Do undergrads feel equal when asking a professor for an advice? Do principal investigators give unbiased reports of their achievements to grant agencies? And what is the best way to equalize power-biased communication channels in academia?

On Theory in Ecology – Reading Marquet et al. (2014)

By | October 27, 2014

Marquet et al. have essay in Bioscience entitled “On theory in ecology”, with the main message being we need more good theory; I agree 100%. The paper also presents an overview of important ecological theories and some good points about why theory is important. Notable one:

“Theory, etymologically, comes from the ancient Greek theoria, which means “contemplation” or “a viewing”. In that sense, a theory is a way of looking at the world and not necessary a way of knowing how the world is.”

This sounds like a fundamental insight, yet it would perhaps deserve a deeper critical elaboration, maybe with an example – I'd really like to understand exactly what ancient Greeks were up to.

Another important point:

“Efficient theories in ecology provide a known standard against which to measure natural phenomena. By a standard we mean a prediction of how the world would work if only the first principles of the theory are at work... without standards, no deviations or gaps in knowledge would be apparent, which would lead to scientific stagnation.”

All reviewers of scientific manuscripts, bear this in mind before dismissing a theory just because it seems simplistic, naïve, or ignoring local specifics – maybe it is still valuable as a standard.

I was disappointed by the first two sections of the paper where theory is defined; specifically, the central definition on the first page is:

“We define a theory as a hierarchical framework that contains clearly formulated postulates, based on minimal set of assumptions, from which a set of predictions logically follows.”

First, it is not clear what is meant by “prediction”. I know that we can do “predictions” by (1) calculating probability of a hypothetical observation, given the theory, or (2) we can simulate new data by “drawing” them as artificial outcomes of the theoretical process; or we can (3) take some part of the theory and deduce (predict) other parts of the theory. Am I the only one who finds the concept of prediction ambiguous?

Second, it is not clear what is “framework”. I asked one of the authors (David Storch), and he explained that theory is not just its formal definition, it is also the interpretation of the formal definition, and so the formal definition and its interpretation together are the framework of the theory. Pity that it's not in the paper.

After a lengthy argument with David we've discovered of a third aspect of theory: There is also the dark unknown area beyond each theory, where the observations and phenomena are not yet captured by the theory. However, in these cases we often have some intuition about the processes in the unknown territory, for example we can assume (using induction) that some unspecified random additive processes generate the observed pattern, leading to normal distribution of measurements. Or we have a set of candidate explanations that were not yet tested and compared. All of these guesses and intuitions should also be included as parts of the theoretical framework.

I disagree with:

“Ecologists and other scientists often use the terms model and theory indistinguishably … but they are fundamentally different.”

and

“Inductively revealed patterns do not, themselves, constitute a theory, and neither do statistical representations of data or model-fitting exercises.”

This is very 90'. In the 90' the toolset of available statistical models was limited to general or generalized linear models (t-test, regression, ANOVA, logistic regression, …), and of course these can only serve as a crippled representations of ecological theories. Crippled, but not fundamentally. It certainly is possible to derive statistical representation of data from underlaying principles; moreover, since the end of the 90' it has actually been possible to derive very complex statistical representations of data.

What has changed since the 90'? There's been a development (Clark, 2007; Bolker, 2008; Royle & Dorazio, 2008) in which ecological models have been liberated from the simplistic structures such as bivariate regression, now these structures are used as building blocks of what can be called complex models, ... or simple theories. Statistics has been transformed from a set of disparate numerical recipes into a universal construction set of probability distributions and linear algebra. Models of this new generation are often hierarchical (sometimes also called multilevel), right in line with the Marquet et al.'s (2014) concept of theory being implicitly hierarchical.

I propose that this new statistics has the potential to bring ecology closer to the state that physics has seen for ages: In physics, projects of any size, whether theoretical or experimental, require custom software, and scientists achieve a high level of quantitative expertise and everyone is partly a programmer and partly a mathematician (Bialek & Botstein, 2004) – quantitative skills are the essence of science, because they are THE way to formalize and test theories.

This contrasts with biology, where most researchers have only basic quantitative skills, test mostly verbal theories using pre-defined methods in commercial software packages, and require consultations with biostatisticians. Because of the lack of formal mathematical representations, the theories are fragmented, andecdotal, and people outside the specific fields hardly understand them. Sadly, divergent biologists who like to demonstrate their quantitative skills can even be viewed as statistical machos or academic hipsters.

To conclude, I am with Marquet et al. (2014) concerning the importance of quantitative theories based on first principles. However, I think that the authors have missed important recent developments in statistics that can help to expand the role of quantitative theory in ecology. There is an emerging new generation of scientists that are fluent in this new statistics, and the next step is to bring them together with the world-class theoreticians such as Marquet et al. (2014).


PS: A couple of days after the publication of this post Jeremy Fox wrote an exhaustive treatment of the Marquet. et al.'s paper. Check it out!

PPS: I now have a personal definition of theory which I am happy with: "Theory is a set of models plus their relationships with each other, with other theories, and with observations. Model is a representation of an idea about how the observations came to be" (thanks to David and an anonymous physicist for inspiration).

Acknowledgement

I am grateful to David Storch for friendly review and discussions.

References

  • Bialek, W. & Botstein, D. (2004) Introductory science and mathematics education for 21st-century biologists. Science, 303, 788–790.
  • Bolker, B.M. (2008) Ecological models and data in R. Princeton University Press, Princeton.
  • Clark, J.S. (2007) Models for ecological data: An Introduction. Princeton University Press, Princeton.
  • Marquet, P.A., Allen, A.P., Brown, J.H., Dunne, J.A., Enquist, B.J., Gillooly, J.F., Gowaty, P.A., Green, J.L., Harte, J., Hubbell, S.P., O’Dwyer, J., Okie, J.G., Ostling, A., Ritchie, M., Storch, D. & West, G.B. (2014) On theory in ecology. BioScience, biu098.
  • Royle, J.A. & Dorazio, R.M. (2008) Hierarchical modelling and inference in ecology, Academic Press.

The man in the academic arena

By | October 20, 2014

Lately I went through a couple of ordinary academic failures. I had one manuscript rejected in three statistical journals in a row. I had another one rejected in Science, PNAS and PLoS Biology in a prompt sequence. Interestingly, among all of the six submissions only Science actually sent it out for review (and then rejected). Dammit.

Today, I gave a Monday seminar on "Processes vs events" at CTS, and it was rejected... by the audience. I thought I'd prepared everything well, but I messed up some key definitions at the beginning, and after four slides the audience just could not stand it and I couldn't continue. There I was, a self-taught statistical enthusiast speaking to professional theoretical physicists and mathematicians, and I wasted their time.

The obligatory lesson is: I should work harder, think smarter, explain myself clearer, and learn from my errors. But that does not cheer one up. I need something that will raise my spirits high, so that I can work tomorrow with some extra bit of energy to make my work shine. Something cheesy with plenty of pathos, something so optimistic that only an American can write. I need Theodore Roosevelt's excerpt from the speech "Citizenship In A Republic" delivered at the Sorbonne, in Paris, France on 23 April, 1910:

It is not the critic who counts; not the man who points out how the strong man stumbles, or where the doer of deeds could have done them better. The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood; who strives valiantly; who errs, who comes short again and again, because there is no effort without error and shortcoming; but who does actually strive to do the deeds; who knows great enthusiasms, the great devotions; who spends himself in a worthy cause; who at the best knows in the end the triumph of high achievement, and who at the worst, if he fails, at least fails while daring greatly, so that his place shall never be with those cold and timid souls who neither know victory nor defeat.

Yeah! I feel much better now. Back to writing papers.

Center for Theoretical Study, Prague: more intense than ivy league

By | October 8, 2014

I have recently been lucky to relocate from Yale to Center for Theoretical Study in Prague, Czech Republic. The institute brings together philosophers, mathematicians, physicists, sociologists, economists, biologists and others; it is similar to Santa Fe Institute or Princeton Institute for Advanced Study, and its aim is to stimulate interdisciplinary approaches to science, encouraging new ways of interaction and cooperation between disciplines.

The institute hosts the toughest kind of seminar that I have seen; it is called Thursday seminar. Guest speaker has one hour to present, but the audience reserves the right interrupt him at any time. And people do it a lot – sometimes the speaker doesn't even get beyond the first half, and his/her basic assumptions and premises are already in ruins. Follows another hour of discussion where what remains is disintegrated and buried under monologues from the audience. The guest is then brought to a pub, filled up with heavy Czech food and beer (yes, we drink beers for lunch), and subjected to another round of academic inspection.

Apart from the Thursday seminars, irregular lunches, and occasional random encounters in the coffee corner, we all meet again on Monday seminars where we contemplate a common unifying theme. Since we are in a center for theoretical study, the theme is anything but applied.

This year's theme is Process vs. event”.

When I joined, the Monday seminar series had already been in progress, and after I'd figured out what's up, I immediately got my usual but deceptive feeling that “I've got it right and they are all confused” (blame it to my youth), and I could feel the rising urge to come out with my most beloved heavy caliber: statistics. Here is how I very briefly presented it:

Statistics clearly distinguishes models (theories, hypotheses), which is what we think, and data, which is what we see. Models are the processes, or the ideas in Platonic sense, further classifiable to stochastic or deterministic. Data are the events, sometimes also called outcomes of the processes. Statistics tells the story about how the data came to be by making statements about the P(model|data), or P(data|model). That's it.

The following discussion was passionate, direct, with hard edges and egos smoking hot. After few minutes I lost track about the fate of what remained of my own contribution, or if anybody actually cared, but I'd learned that some people are definitely less confused than I initially thought.

Here are some notable ideas:

Philosopher Alexandr Matoušek noted that seeing is a peculiar term – our mind can see an idea (model, theory) more clearly than the actual event (data). This somehow resonates with my recent realization that theory is not necessarily a way of knowing how the world is, but it is rather the way of looking at the world (I have this from Marquet et al.'s recent piece in Bioscience). I need to explore these connections more.

Alexandr Matoušek and mathematician Petr Kůrka also pointed to the problem that, ultimately, we will have to deal with stochastic processes in the light of the problem of free will (and the problem of whether it exists). This is a key unresolved issue, and it can generate some intense discussions in the future.

Novelist and philosopher Michal Ajvaz presented an attempt to outline the dichotomy between processes vs. events using dimensions (time, space, horizontal, vertical) and scale. To unify this view with the statistical view will be a major and important undertaking, and I am really curious to see what comes out of it.

Paleoecologist Petr Pokorný pointed to the potential connection between rarity of an event and its significance (I think that there is no such connection). He also brought up the butterfly effect story, which turned out to be thought-provoking, but it also generated considerable misunderstanding. I reckon that this story can serve as a particularly useful illustration of our ideas, if interpreted carefully.

Sociologist Zdeněk Konopásek reminded me that quantitative science, and especially statistics, can generate substantial suspicion among some social scientists – for them, statistics potentially reduces complex stories to mere counting, without the understanding of what's really going on. I tend to defend statistics. First, no matter how deep an understanding, it is worthless if it's not formalized and communicated, and statistics (and mathematics) is a formal way to represent and communicate understanding. Second, statistics enables to separate general principles from anecdote, and while anecdotes can be entertaining, they are hard to build on. Finally, modern statistics is not only inductive and reductionist – it can also relate an individual observation to an existing model, and investigate probability (likelihood) of this observation, given the model; this guarantees the possibility for a single unusual observation to shoot down the whole model.

Mathematician Marie Větrovcová brought me to the notion of complex conditional processes in which models are conditional on the data, parts of the models are conditional on other parts of the model, but data are never conditional on other data. I'd like to explore this idea in a more formal way.

The seminar left me excited and in a state of intellectual discomfort, which I guess is a good sign.

 

Species Distribution Models on the right track. Finally.

By | September 2, 2014

Species Distribution Models (SDM) a.k.a. Niche Models have always been a busy pile of confusion, ideology and misguided practices, with the real mess being the “presence only” SDMs. Interestingly, when you go to conservation or biogeography symposiums, you can hear the established SDM gurus starting their talks with: “During the last ten years SDMs have greatly matured” or “SDMs are useful tool to guide conservation management and to predict responses to climate change” and so on. You can even read this in the introduction of many papers.

I come from a post-communist country and hence these phrases remind me of political propaganda that I used to hear as a kid; aparatchiks used to declare that everything works great and everybody is happy. But just because someone says so, or because there is a crowd of SDM users and thousands of papers on SDM, it does not mean that things work well.

So what exactly is wrong with current SDMs?

  • First, with a few exceptions, the use of pseudo-absences cannot be justified neither in model fitting, nor in model evaluation. Pseudo-absence is then just a code word for data fabrication.

  • Second, geographic bias in sampling effort must be dealt with. More specifically, even when we use a model that is suited for presence-only data (such as a point process model), the results will be heavily biased unless we account for the fact that species are more often observed around places accessible to humans (e.g. roads).

  • Third, detectability of species must be dealt with. If not accounted for, the results will always be compromised by the species being cryptic and difficult to observe (most species are), which will lead to underestimated probability of occurrence or habitat suitability. To complicate things even more, some species are easily observable in some environments, and cryptic in others.

Statisticians have been aware of these problems (e.g. Phillips et al. 2008, Yackulic et al. 2013), but because there had been only partial or ad-hoc solutions, and because the SDM crowd is comfortable with whatever technique that is user-friendly and R-packaged, everybody has been hammering “GBIF-like” data with “dismo-like” techniques (MaxEnt, random forests, neural networks, GAMs, ...) without a second thought.

To name some of the bits and pieces that do try to address some of the problems: To deal with the geographic bias in presence data, Phillips et al. (2008) suggest to select the background “absence” data with the same geographic bias as is in the presence data – a solution that makes some sense, but it is not an integral part of a statistical model, and hence the solution ignores the uncertainty that comes with it. The so-called occupancy modelling (e.g. Royle & Dorazio's book, MacKendzie et al.'s book) has had a whole set of statistical tools to deal with detectability of species, but the field has rarely touched the large-scale presence-only SDMs. Finally, it has been shown that the popular presence-only MaxEnt technique is in its core equivalent to point process models (Renner & Warton 2013 and a related blog post) – a finding that paves the road to the use of presence-only data in statistical parametric models, and subsequently in occupancy models.

And into all this, finally, comes Robert M. Dorazio with his new paper in Global Ecology and Biogeography. Dorazio puts all the pieces together to introduce a general model that addresses all three of the SDM problems outlined above, and he does it in a statistically rigorous and model-based (parametric) way. In short:

  • Dorazio's model is suitable for presence-only data because of the use of a point process component.

  • The model can explicitly incorporate information on distances to the nearest road (or other correlates of sampling effort).

  • The model has separated the observation process from the latent ecological processes, addressing the issue of detectability of species.

  • The technique allows incorporation of data from systematic surveys. I have always felt that a dozen of systematically surveyed locations were worth hundreds of presence-only incidental observations. Now we have a way to benefit from both kinds of data in a single model.

Other highlights of the approach are:

  • Because it is parametric, we can estimate full posterior distributions of the parameters, enabling probabilistic statements about ecological hypotheses, and enabling the use of model selection techniques based on likelihood (such as AIC, DIC).

  • The approach is flexible and can be extended. The model outlined by Dorazio can indeed fail in many cases just because it omits some important ecological processes (e.g. species interactions, dispersal limitations, non-linear responses to environment, spatial autocorrelation …). However, these can easily be incorporated on top of the core structure.

  • Because the model can be fitted in Bayesian framework, prior information on known habitat preferences can be beneficially used to improve the model (see my paper here for example).

To summarize: Thanks to Dorazio, we finally have a core of rigorous statistical tool that overcomes many of the critical and previously ignored problems with SDM. We had to endure almost 15 years of confusing literature to get here, and I feel that now the field has a potential to be on the right track to maturity. And so it is up to us to test the approach, improve it, and write R pacakges.

 

Is my brilliant idea any good? I am not sure, so I've pre-printed it on PeerJ

By | July 24, 2014

As a scientist, what should I do when I encounter a seemingly fundamental problem that also seems strangely unfamiliar? Is it unfamiliar because I am up to something really new, or am I re-discovering something that has been around for centuries, and I have just missed it?

This is a short story about an exploration that began with such a problem, and led to this manuscript. It began one day when I was pondering the idea of probability:

Scientists often estimate probability (P) of an event, where the event can be a disease transmission, species extinction, volcano eruption, particle detection, you name it. If the P is estimated, there has to be some uncertainty about the estimate -- every estimate is imperfect, otherwise it is not an estimate. I felt that the uncertainty about P has to be bounded: I couldn't imagine a high estimate of P, say P=0.99, associated with high uncertainty about it. The reason is that an increased uncertainty about P means broader distribution of probability density of P, which means that the mean (expected value) of the probability density shifts towards P=0.5. Inversely, as P approaches 0 or 1, the maximum possible uncertainty about P must decrease. Is there a way to calculate the bounds of uncertainty exactly? Have anybody already calculated this?

First, I did some research. Wikipedia: nothing. I hit the books: Jaynes's Probability theory, Johnson et al's Continuous univariate distributions, statistical textbooks, all the usual suspects, nothing. Web of Science: A bunch of seemingly related papers, but no exact hit. ArXiv: More seemingly related semi-legible papers by physicists and statisticians, but not exactly what I was looking for. After some time I put the search on hold.

Then one day, for a completely different reason, I was skimming through John Harte's Maximum Entropy Theory and Ecology, and it struck me. Maybe there is a MaxEnt function that can give the maximum uncertainty about P, given that we only have the value of P and nothing else. Back to the web and, finally, I found something useful by UConn mathematician Keith Conrad: a MaxEnt probability density function on a bounded interval with known mean.

I adjusted the function for my problem, and I started to implement it in R, drafting some documentation on side, and I realized that I have actually started working on a paper!

Then the doubts came. I discussed the problem with my colleagues, but I hadn't learnt much -- many well-intentioned suggestions but no advice hit the core of the problem, which is: Is the idea any good? And anyway, as a wee ecologists with little formal mathematical education, can I actually dare to enter the realm of probability theory where the demi-gods have spoken pure math for centuries?

Then I took the courage and sent my draft to two sharp minds that I admire for the depth of their formal thinking (and for their long beards): Arnost Sizling the mathematician, and Bob O'Hara the biostatistician. For a week nothing happened, then I received two quite different and stimulating opinions, a semi-sceptical review from Arnost, and a semi-optimistic email from Bob. I tried to incorporate their comments, but I am still unsure about the whole thing.

I don't have a clue whether and where such thing should be published. It feels too long and complex for a blog post, yet somehow trivial for a full-length research paper; it is perhaps too simple for a statistical journal, yet too technical for a biological journal.

And so I've decided to pre-print it on PeerJ for everybody to see. I guess that it could be the perfect place for it. If it is really new then it is citable and I can claim my primacy. If it is all bullshit, then people will tell me (hopefully), or they will perhaps collaborate and help me to improve the paper, becoming co-authors. It is an experiment, and I am curious to see what happens next.

Tailoring univariate probability distributions

By | June 26, 2014

This post shows how to build a custom univariate distribution in R from scratch, so that you end up with the essential functions: a probability density function, cumulative distribution function, quantile function and random number generator.

In the beginning all you need is an equation of the probability density function, from which everyting else can be derived sometimes analytically, and always numerically. The analytical solutions may require some math skills, or may be impossible to find, but the numerical solutions are always feasible, require no math literacy, and coding them in R is easy with the uniroot() and integrate() functions.

Throughout the post I will use a simple exponential distribution as an example.

Probability density function (PDF)

You will need to know what probability density (or mass) function is, and what is the difference between probability density and probability. I found the best intro to be this Khan's video (10 min). Here is a couple of additional points:

  • Probability density is relative probability.
  • Probability density function evelauated for a given data point is the point's likelihood.
  • Probability density can be greater than one.

Example: Exponential PDF

The formula for the exponential probability density function (PDF) is:

 p(x) = \lambda e ^ {-\lambda x}, \qquad x \in [0, \infty]

In literature, small  p usually denotes probability density, while capital  P is used for probability.

We can code it in R:

  my.dexp <- function(x, lambda) lambda*exp(-lambda*x)
  my.dexp(x=0:5, lambda=2)
## [1] 2.0000000 0.2706706 0.0366313 0.0049575 0.0006709 0.0000908

And compare it with R's native PDF (it uses rate instead of  \lambda ):

  dexp(x=0:5, rate=2)
## [1] 2.0000000 0.2706706 0.0366313 0.0049575 0.0006709 0.0000908

Here is a plot of my PDF using the R's built-in function curve():

  curve(my.dexp(x, lambda=2), from=0, to=2, main="Exponential PDF")

plot of chunk unnamed-chunk-3

Cumulative distribution function (CDF) - analytical solution

This function gives, for a given point  x , the area under the PDF curve all the way down to the left of the point  x . This area is the probability that an outcome  X will have value lower or equal to  x :

 P(X \le x) = \int_0^x \! f(\pi \rightarrow X) \, \mathrm{d}\pi

I liked this stepbil's video (9 min) that shows how to flip between PDF and CDF, and why do we need the  \pi in the integral. I am really lame with calculus and what really works for me is Sage – it is free, it finds analytical solutions of integrals, and it may be worth a shot before diving into numerical integration.

For the exponential distribution the CDF is:

 P(X \le x) = 1 - e ^ {- \lambda x}, \qquad x \in [0, \infty]

In R it is:

  my.pexp.anal <- function(x, lambda) 1- exp(-lambda*x)
  my.pexp.anal(x=0:5, lambda=2)
## [1] 0.0000 0.8647 0.9817 0.9975 0.9997 1.0000

Cumulative distribution function (CDF) - numerical solution

For most practical purposes, numerical solutions of one- or two- dimensional problems seem to be as good as the analytical solutions, the only disadvantage is the computational demands. Here I used the R's native function integrate() to calculate the exponential CDF numerically:

  my.pexp.numerical <- function(x, lambda)
  {
    # this my.int function allows my CDF to run over vectors
    my.int <- function(x, lambda) integrate(my.dexp, lambda=lambda, lower=0, upper=x)$value
    # apply the my.int to each element of x
    sapply(x, FUN=my.int, lambda)
  }
  my.pexp.numerical(x=0:5, lambda=2)
## [1] 0.0000 0.8647 0.9817 0.9975 0.9997 1.0000

And let's plot the numerical solution over the analytical one:

  curve(my.pexp.anal(x, lambda=2), from=0, to=2, 
        col="green", main="exponential CDF")
  curve(my.pexp.numerical(x, lambda=2), from=0, to=2, 
        lty=2, col="red", add=TRUE)
  legend("bottomright", lty=c(1,2), 
         legend=c("Analytical", "Numerical"),        
         col=c("green", "red"), lwd=2)

plot of chunk unnamed-chunk-6

Practically identical.

Quantile function (QF) - analytical solution

Quantile functions are useful for two things: (1) assessing statistical significance, and (2) generating random numbers (see below). Quantile function ( P^- ) is the inverse of the cumulative distribution function. It means that you have to find a value of  x for a given  P(x) , which is an inverse action to what we usually do with functions. Finding an analytical solution may be quite tricky, and again, if you don't have a mathematician at your service, Sage can help.

In case of our exponential distribution, the quantile function is:

 P^{-}(q) = \frac{-\ln(1-q)}{\lambda}, \qquad q \in [0,1)

In R:

  my.qexp.anal <- function(q, lambda) (-log(1-q))/lambda 
  my.qexp.anal(0.9, 2)
## [1] 1.151

Quantile function (QF) - numerical solution

The inverse may be hard to get for many CDFs, and hence one may need to go for a numerical solution. I found that the easiest way to solve for inverse of univariate CDFs in R is the uniroot() function. Alternatives are optimize() or optim(), but they can be unstable. In contrast, uniroot() is simple, quick, and stable.

Here is my numerical solution for the expoentnial QF:

  my.qexp.numerical <- function(q, lambda)
  { 
    # function to be solved for 0
    f <- function(P, fixed)
    {
      lambda <- fixed$lambda
      q <- fixed$q
      # this is the criterion to be minimized by uniroot():
      criterion <- q - my.pexp.numerical(P, lambda)
      return(criterion)
    }

    # for each element of vector P (the quantiles)
    P <- numeric(length(q))
    for(i in 1:length(q))
    {
      # parameters that stay fixed
      fixed <- list(lambda=lambda, q=q[i])
      # solving the f for 0 numerically by uniroot():
      root.p <- uniroot(f, 
                        lower=0, 
                        upper=100, # may require adjustment
                        fixed=fixed)
      P[i] <-root.p$root
    }
    return(P)
  }

my.qexp.numerical(0.9, 2)
## [1] 1.151

Let's plot the numerical solution over the analytical one:

  q <- seq(0.01, 0.9, by=0.01)
  plot(q, my.qexp.anal(q, lambda=2), type="l", col="green", lwd=2)
  lines(q, my.qexp.numerical(q, lambda=2), col="red", lty=2)
  legend("bottomright", lty=c(1,2), 
         legend=c("Analytical", "Numerical"),        
         col=c("green", "red"), lwd=2)

plot of chunk unnamed-chunk-9

Random number generator - the inverse transform sampling

Have you ever wondered how R generates random draws from its distributions? How does it convert uniformly distributed pseudo-random numbers to, e.g., normally distributed numbers? For some time I naively thought that there are some giant built-in tables that R uses. Then I found that it's really simple.

The easiest way is the inverse transform sampling. It works like this: you generate a random number from  Uniform(0,1) and plug it into your quantile function (see above). That's it, and that's also why quantile functions can be really useful. Here is how you do it in R with the exponential distribution:

  my.rexp.inverse <- function(N, lambda)
  {
    U <- runif(N, min=0, max=1)
    rnd.draws <- my.qexp.anal(U, lambda)
    return(rnd.draws)
  }
  my.rexp.inverse(10, 2)
##  [1] 0.10087 0.56874 0.79258 0.01962 1.28166 0.39451 2.17646 0.48650
##  [9] 0.97453 0.33054

Here is a histogram of 10,000 random numbers generated by the inverse transform sampling. The solid smooth line is the exponential PDF from which the random numbers were drawn:

  hist(my.rexp.inverse(10000,2), freq=FALSE, breaks=20,
       xlab="x", main=NULL, col="grey")
  #hist(rexp(10000, rate=2), freq=FALSE, breaks=20)
  curve(dexp(x, rate=2), add=TRUE)

plot of chunk unnamed-chunk-11

Random number generator - rejection sampling

The advantage of rejection sampling is that you don't need the quantile function.
Notable texts on rejection sampling are provided by Jay Emerson, on Quantitations blog and on tsperry's blog. I will try to provide my own description and notation, which will hopefully complement the other posts.

In short, wou will need: (1) the so-called proposal probability densidy function which I will call  f(x) , (2) the PDF from which you want to sample, here called  p(x) , and (3) a constant  m which will satisfy  f(x) \times m > p(x) . In other words, the curve  f(x) \times m must lay entirely above the  p(x) curve, and ideally as close to  p(x) as possible (you will waste less samples). The  f(x) can be any PDF from which you are able to sample.

For my exponential example I will use simple uniform proposal density  f(x) . The following figure illustrates my exponential  p(x) , my proposal  f(x) , and  f(x) \times m :

  x <- seq(0, 5, by=0.01)
  # my exponential PDF:
  p <- dexp(x, rate=2)
  # my PROPOSAL distribution:
  f <- dunif(x, min=0, max=5) # my proposal is uniform and hence
                              # I need to choose an arbitrary upper bound 5
  # the CONSTANT m
  m <- 11
  fm <- f*m

  plot(c(0,5), c(0,2.5), type="n", xlab="x", ylab="density")
  lines(x, fm, type="l", col="red")
  lines(x, f, col="red", lty=2)
  lines(x, p)

  legend("right", lty=c(1,2,1), 
         legend=c("f(x)*m", "f(x)", "p(x)"), 
         col=c("red", "red", "black"), lwd=2)