This post is inspired by a question by Dylan Craven that he raised during my Bayesian stats course.
My aim here is to demonstrate that, in Bayesian setting, one can make powerful inference about within-group ANOVA means even for groups with sample size , i.e. groups with only a single measurement. The main reason this is possible is the assumption of constant variance () among groups: groups with sufficiently large increase credibility of the estimate, which limits the credible interval of in the group with . Conversely, when the assumption is relaxed and is allowed to vary, groups with sample size of 1 are no longer that useful.
I will use modified artificially-generated data from the example from Marc Kery’s Introduction to WinBUGS for Ecologists, page 119 (Chapter 9 - ANOVA). The (artificial) data are supposed to describe snout-vent lengths in five populations of Smooth snake (Coronella austriaca); I modified the data so that the first population is only represented by a single observation, as indicated by the arrow in the figure below.
Loading the data from the web:
snakes <- read.csv("http://www.petrkeil.com/wp-content/uploads/2017/02/snakes_lengths.csv")
Plotting the data:
par(mfrow=c(1,2), mai=c(0.8,0.8, 0.1, 0.1))
plot(snout.vent ~ population, data=snakes, ylab="Length [cm]")
arrows(x1=1.2, y1=58, x0=2, y0=61, col="red", lwd=2, angle=25, length=0.2)
boxplot(snout.vent ~ population, data=snakes, ylab="Length [cm]", xlab="population", col="grey")
First, I will model the data using a traditional fixed-effect ANOVA model. For a given snake in population the model can be written as:
Where is a mean of -th population (), and identifies individual measurements.
I am interested in a simple question: Is there a statistically significant difference between population 1 and 2? Will I be able to infere the difference even when population 1 has a sample size of 1?
I will fit the model using MCMC in JAGS. Hence, I will prepare the data in a list format:
snake.data <- list(y=snakes$snout.vent,
x=snakes$population,
N=nrow(snakes),
N.pop=5)
Loading the library that enables R to communicate with JAGS:
library(R2jags)
JAGS Model definition:
cat("
model
{
# priors
sigma ~ dunif(0,100) # (you may want to use a more proper gamma prior)
tau <- 1/(sigma*sigma)
for(j in 1:N.pop)
{
mu[j] ~ dnorm(0, 0.0001)
}
# likelihood
for(i in 1:N)
{
y[i] ~ dnorm(mu[x[i]], tau)
}
# the difference between populations 1 and 2:
delta12 <- mu[1] - mu[2]
}
", file="fixed_anova.txt")
And here I fit the model:
model.fit.1 <- jags(data = snake.data,
model.file = "fixed_anova.txt",
parameters.to.save = c("mu", "delta12"),
n.chains = 3,
n.iter = 20000,
n.burnin = 10000,
DIC = FALSE)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 41
## Unobserved stochastic nodes: 6
## Total graph size: 107
##
## Initializing model
Plotting parameter estimates with mcmcplots
:
library(mcmcplots)
plot(snakes$snout.vent ~ snakes$population, ylim=c(35, 65), col="grey",
xlab="Population", ylab="Body length")
caterplot(model.fit.1, parms="mu", horizontal=FALSE,
reorder=FALSE, add=TRUE, labels=FALSE, cex=2, col="red")
The red dots and bars show posterior densities of . Grey dots are the data.
So what is the difference between population 1 and 2?
denplot(model.fit.1, parms="delta12", style="plain", mar=c(5,5,2,1), main="",
xlab="Difference between means of populations 1 and 2",
ylab="Posterior density")
There is a clear non-zero difference (the posterior density does not overlap zero). So we can conclude that there is a statistically significant difference between populations 1 and 2, and we can conclude this despite having only for population 1.
We can have a look what happens when the assumption of constant is relaxed. In other words, now every population has its own , and the model is:
Here is JAGS definition of such model:
cat("
model
{
# priors
for(j in 1:N.pop)
{
mu[j] ~ dnorm(0, 0.0001)T(0,) # Note that I truncate the priors here
sigma[j] ~ dunif(0,100) # again, you may want to use proper gamma prior here
tau[j] <- 1/(sigma[j]*sigma[j])
}
# likelihood
for(i in 1:N)
{
y[i] ~ dnorm(mu[x[i]], tau[x[i]])
}
# the difference between populations 1 and 2:
delta12 <- mu[1] - mu[2]
}
", file="fixed_anova_relaxed.txt")
Let’s fit the model:
model.fit.2 <- jags(data = snake.data,
model.file = "fixed_anova_relaxed.txt",
parameters.to.save = c("mu", "delta12"),
n.chains = 3,
n.iter = 20000,
n.burnin = 10000,
DIC = FALSE)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 41
## Unobserved stochastic nodes: 10
## Total graph size: 140
##
## Initializing model
Here are parameter estimates plotted with mcmcplots
library(mcmcplots)
plot(snakes$snout.vent ~ snakes$population, ylim=c(0, 100), col="grey",
xlab="Population", ylab="Body length")
caterplot(model.fit.2, parms="mu", horizontal=FALSE, reorder=FALSE, add=TRUE,
labels=FALSE, cex=2, col="red")
Clearly, the posterior density of is essentially the priror. In other words, the single observation has little power on its own to overdrive the prior.
We can plot the posterior density of the difference
denplot(model.fit.2, parms="delta12", style="plain", mar=c(5,5,2,1), main="",
xlab="Difference between means of populations 1 and 2",
ylab="Posterior density")
And we see that it is all over the place. Huge uncertainty, no clear inference is possible.
I have shown that one can make powerful inference about population means even if the sample size for the population is super small (e.g. 1). This is possible when all of the following conditions are met:
Bayesian approach to ANOVA must be used.
We must have sufficient data from other populations.
We must be confident that the assumption of constant variance among the groups is justified.
The course is full. Here is syllabus with instructions. Complete raw codes (Markdown and R) and materials see the course's GitHub repository.
Introduction: Course contents, pros and cons of Bayes, necessary skills.
Normal distribution: Introducing likelihood on the Normal example.
Poisson distribution: Likelihood maximization. Probability mass function. AIC and deviance.
The Bayesian way and the principle of MCMC sampling.
Simplest model in JAGS: Estimating lambda parameter of a simple Poisson model.
Bayesian resources: Overview of software, books and on-line resources.
t-test: First model with 'effects', hypotheses testing, derived variables.
Linear regression part 1: Traditional GLM, MLE, and manual estimation.
Linear regression part 2: Bayesian version in JAGS, credible and prediction intervals.
ANOVA part 1: Model definition.
ANOVA part 2: Fixed vs. random-effect. Effect of small sample, shrinkage.
Site occupancy model: logistic regression, imperfect observation, latent variables.
Useful probability distributions (binomial, beta, gamma, multivariate normal, negative binomial).
Publishing Bayesian papers.
Concluding remarks.
My take:
Things get trickier when statements get specific. Example: „Our planet has warmed up during the last 100 years.“ Is this a fact? The first complication is the multitude of meanings that hide beyond such a simple statement. How exactly do we measure global temperature? Where? And how do we aggregate it from local measurements to the global scale? Different approaches will give different quantitative results. Second, measurements are not always precise. Yet even if I consider the multitude of meanings and the measurement errors, and given all I've read so far, I'd assign the statement probability above 0.999: This is my subjective confidence about whether the planet has warmed up. It is quite high, so perhaps I might be tempted to speak about a fact. On the other hand, it is less of a fact than the previous statements. Also, this fact is politically irrelevant.
How about a more relevant: „Humans have caused global warming by excessive exploitation of fossil fuels, and subsequent increased concentration of greenhouse gases in the atmosphere.“ There are surely many assumptions, definitions, interpretations and imprecisions that will influence my confidence in this statement. Given what I know (I am an ecologist, not a climate expert), I would give the statement probability of 0.95. Now is this a fact? Would 0.85 also be a fact? Personally, I'd say that labelling this statement as fact is unnecessary radical.
Finally, here is something from the core of the current political debate: „If all countries dramatically reduce greenhouse gas emissions right now, the climate change will slow down.“ And here I have doubts, since this requires forecasting, and forecasts tend to be messy. I would lean towards thinking that the statement is somehow reasonable, so I give it probability of 0.7, yet I would definitely not label it as fact. If I were a politician, I would probably be willing to bet my career on it, but partly because I think that career choices are always a bit of a gamble, not because of a strong conviction (plus, I think that switching to clean energy could be fun since it would piss off Putin).
To summarize, science does not exactly need facts, there are always better words to be used. And in some cases science should actually avoid facts as something unhelpful or exaggerated. The language of probability is what we should use instead.
However, one argument for working with facts is that the general public doesn't care about how academics think – facts are a useful simplification for the masses, executives, and politicians. You scientists can do your smarty-pants gibberish, but you must also give us some hard facts that we can work with.
Thoughts on that: Asking scientists to provide facts is just shifting the responsibility for decisions from executives and politicians to scientists. Second, I doubt that labelling arguments as scientific facts makes them more persuasive – a discussion where one side claims to own the facts is, unfortunately, prone to end up as a fight. It is better to persuade with logos, ethos, and pathos, rather than with labels. Third, demagogues, ideologues and populists have their alternative facts; a considerable part of global population is willing to kill for all kinds of random bullshit, that is how certain they are about their facts. Feeding them stuff labelled as scientific facts is like pouring gasoline to fire.
I suggest a solution, although a time consuming one: We need to serve scientific method, critical thinking, and probability theory to masses in smaller doses, from an early age. If pupils are able to get algebra and reading, they are able to get probability. If high school students are able to read Shakespeare, they are able to understand what a peer-reviewed article is. Scientific method must be put on the same level as scientific findings, languages, history, math or literature. Scientist must do more to popularize not only their results (or facts), but also the way they work and think. It is the only way how the general public, and subsequently the politicians and decision makers, will ever take us seriously.
PS (27/2/2017): In response to this post, Andy Gonzales pointed me to this article in The New Yorker.
PPS (27/2/2017): Uncertainty does not end with probability. There can actually be uncertainty about the probability itself -- we can put credible intervals around probability: I can say that probability of something is 0.8, plus minus 0.1.
]]>
Experimental macroecology needs better justification
One entire morning was dedicated to experimental macroecology. Presented were results from small-grain manipulative experiments, sometimes replicated over large extents, sometimes not. However, it all felt like a session at a regular ecological (or botanical) meeting: I missed connections to large-scale biogeography and to the currently unresolved issues in macroecology.
One of the justifications for experimental macroecology was that it directly tests mechanisms, which can improve macroecological predictions. However, I haven’t seen a plausible demonstration that this really works, and I remain skeptical – I still believe that macroecological patterns emerge as a result of a whole array of idiosyncrasies, and I find it more useful to study large-scale (and especially large-grain) behavior of ecological systems using state variables and statistical models, rather than using reductionism and experiments.
I am not saying that experimental macroecology is bullshit. I just think that it needs to more directly demonstrate (theoretically and empirically) that it can actually deliver.
Dynamic macroecology takes off, but it needs better terminology
Another major block of talks involved dynamic macroecology. John Harte announced a new dynamic version of maximum entropy theory of ecology, there were talks on both present and deep time extinctions, on invasions, and on local-scale community dynamics. There was a lot of emphasis on processes and mechanisms.
In his earlier talk Brian McGill tried to clarify the role of grain and extent in experimental macroecology. This distinction is absolutely crucial for the progress of the field, and we need to get this right in order to understand each other. I suggest that similar service should be done to the terms process and mechanism, which are key for dynamic macroecology. We should learn the exact difference between processes and mechanisms; plus, when thinking along the temporal dimension, we should learn how processes and mechanisms link to concepts such as pattern, event, causality, correlation, and drivers.
There is always excitement when temporal macroecological data are presented – such data are rare and there is the promise that they will get us close to processes and mechanisms, which ecologists consider more important than patterns or correlations. But are they really?
To me, a process is, quite simply, a temporal change of any kind. A process can be deterministic or stochastic, it can involve causal chains but it does not have to. Moreover, having a process captured at a single location does not necessarily enable to map it in space (similarly, a spatial pattern may indicate little about the temporal process behind). Hence, I don’t see processes as more important or fundamental than, for example, static spatial gradients. They seem kind of similarly important. I also don’t see much of an added value in process-based models over statistical models.
This contrasts with mechanisms (“machines”), which somehow invoke the notion of causality – in a machine you push a piston, it turns the crankshaft, which turns the wheel, action, reaction. In ecology, a population hits carrying capacity, this reduces individual fecundity and increases mortality, and population growth slows down. Hence, mechanisms seem to have an added value over processes: mechanisms always involve causes and forces.
Yet even mechanisms are potentially treacherous: one person’s mechanism is another person’s correlation, or pattern. For example, macroecologists may perceive a tendency of a species to maintain viable populations only in a given temperature as a mechanism driving species' distribution, but for physiologists it’s a correlation (between temperature and population viability).
All in all, it’s a bit of a mess, and a clarification is due.
Acknowledgements: Some of the ideas presented here are based on discussions with David Currie, and with members of Center for Theoretical Study in Prague.
]]>In the attic of my grandmother's house there is a box with my old personal stuff. Inside there is a smaller box with a bunch of 3.5 floppy disks. One of them has a fading handwritten label "Bachelor thesis - data".
To get to the data I could use the 3.5 drive in a 90's minitower resting in my wife's parents' cellar. The only problem is that it won't boot. Even if I fix that, will the 3.5 disk still be readable? Further, if my children, 20 years into the future, wanted to have a look at the data, will they even know what a minitower is? For them the 3.5 floppy disk will perhaps be as impenetrable as the Egyptian hieroglyphs were before the discovery of the Rosetta Stone.
I guess that almost all science that has ever been stored on floppy disks, magnetic tapes, or punched cards, is now practically inaccessible.
In this context, Jeremy Fox asks:
… in what concrete ways do you feel worse off because you can’t access the data that past ecologists stored in a now-unreadable format on 5.25″ inch floppies? Do you ever have occasion to curse their lack of foresight?
Well, I don’t feel worse off, and I don’t curse past ecologists for using floppies. I am ok today because most of the 20^{th} century’s science was actually printed, and hard copies have one advantage over binary data: they are readable without any special device – humans can access them directly. I can still get Taylor & Woiwod’s 1982 estimates of mean-variance scaling of insect populations from their grand table printed over several pages of Journal of Animal Ecology.
Today we consider electronic data to be somehow more secure and accessible than in the floppy nineties. We are so confident about the solidity of the cloud that we have been moving the entire scientific infrastructure to it. Countless papers are published, heaps of data are deposited online, new journals emerge, and it all exists solely as electromagnetic bits that degrade over time, or on flash drives that degrade over writing cycles. And in order to read the stuff, we always need a rather sophisticated device: a personal computer with up-to-date software.
But keeping software up-to-date can be a problem when hardware ages. Operating systems change, and there is no guarantee that our current hardware will be operable in 50 years. Example: Proprietary operating systems (such as MS Windows) tend to give up on old hardware support and do not enable downgrading to older versions. Further, file formats evolve. There was no .xlsx when I was a kid, now everybody uses it and nobody cares about .dbf. Although this does not seem as an issue now (we can still open .dbf with Excel and other programs), I can imagine that Excel2050 will drop the .dbf support because it will be of little commercial value, or just to make things simpler.
In the past, libraries received hard copies of the journals that they subscribed to, creating a globally distributed hard-copy backup of scientific knowledge, with no restriction on future use. The stuff was accessible even when the subscription ended. In contrast, the current trend is towards online access only; libraries pay for temporary access to remote repositories on publisher’s servers. When an institution can no longer afford access to Elsevier’s journals, everything is all of a sudden inaccessible.
It bothers me that nobody seems to care about what happens, in the long run, to all the electronic science produced today. Will my grandchildren be able to open the data that have been uploaded to the clouds? Will the clouds still be there? Will people have access to Wiley's online library or to JSTOR? Will these companies still exist? Will future generations be able to read .pdf or .xlsx formats? Will they, in a hypothetical post-nuclear (post-global warming, post-biodiversity loss) future have computers at all? And how about grandchildren of our grandchildren?
Empires fall, climate changes, catastrophes happen. That is the historical experience. Americans are just about to vote if they give access to their nuclear arsenal to a maniac. And the saying goes: Just because you’re paranoid doesn’t mean they aren’t after you. Sometimes I even imagine a planet-of-the-apes future where a laptop is found, covered with moss, mysterious, with whole libraries stored (but probably corrupted and fragmented) on a tiny little object inside, invisible and inaccessible.
Maybe I am exaggerating this. Jeremy Fox writes:
The “needs” and “wants” of long-distant future people aren’t well-defined today. What they “need”, and “want”, and what they do to meet those needs and wants, will depend on what we do today – but in complex and unpredictable ways. At the timescales you’re talking about, there’s nothing but Rosetta Stones. Nothing but sheer luck.
Perhaps. We can’t foresee the distant future. But I still believe that we can make some relatively small and easy precautionary steps, which can (with a bit of luck) help our grandchildren to access our science.
I suggest:
Motivation for this post came from an exchange under a post by Brian McGill on Dynamic Ecology.
]]>License: This is a public domain work. Feel free to do absolutely whatever you want with the code or the images, there are no restrictions on the use.
Figure 1A:
Figure 1B:
Figure 2A:
Figure 2B:
]]>I haven’t found a simple ‘one-liner’ that’d do such plots in R. In fact, I have always found R’s treatment of logarithmic axes a bit dull - I want the fancy gridlines!
loglog.plot
To provide the log-linear gridlines and tickmarks, I have wrtitten function loglog.plot
.
To load the function from my GitHub repository:
source("https://raw.githubusercontent.com/petrkeil/Blog/master/2016_07_05_Log_scales/loglogplot.r")
Arguments:
xlim, ylim
- Numeric vectors of length 2, giving the x and y coordinates ranges on linear scale.
xlog, ylog
- Logical value indicating if x and y axes should be logarithmic (TRUE) or linear (FALSE). In case the linear scale is chosen, no gridlines are drawn.
xbase, ybase
- Base of the logarithm of the respective axes. Ignored if linear axis is specified.
...
- Further arguments to the generic R function plot
.
Value:
Empty R base graphics plot, ready to be populated using lines
, points
and alike.
Here I plot three power functions: one sub-linear (exponent = 0.8), one linear (exponent = 1) and one supra-linear (exponent = 1.2).
par(mfrow=c(1,2))
x <- seq(1, 1000, by=10)
# left panel - both axes linear
plot(x, x, ylim=c(0,4000))
points(x, x^0.8, col="blue")
points(x, x^1.2, col="red")
# right panel - loglog plot
loglog.plot(xlab="x", ylab="y", ylim=c(1, 10000))
points(log10(x), log10(x))
points(log10(x), 0.8*log10(x), col="blue")
points(log10(x), 1.2*log10(x), col="red")
In this example I plot a lognormal probability density function, and I only plot the tickmarks and gridlines along the x-axis. The y-axis is linear.
par(mfrow=c(1,2))
x <- 1:1000
# left panel - linear
plot(x, dlnorm(x, meanlog=4), ylim=c(0, 0.012),
col="red", ylab="probability density")
# right panel - loglog plot
loglog.plot(ylog=FALSE, ylim=c(0,0.012), xlim=c(0.1, 1000),
xlab="x", ylab="probability density", )
points(log10(x), dlnorm(x, meanlog=4), col="red")
// add bootstrap table styles to pandoc tables $(document).ready(function () { $('tr.header').parent('thead').parent('table').addClass('table table-condensed'); });
The course is full.
We are organizing a 3 day intensive course on open-source GIS high-performance analytical methods, with Giuseppe Amatulli (Yale University) as the main teacher, and Petr Keil (iDiv) as a teaching assistant. Date and place: 29 June - 1 July 2016, 'Red Queen' room, iDiv, Leipzig, Germany.
Over the past decade there has been an explosion in the availability of geographic data for environmental research, for both static and temporal analyses. Examples are remotely sensed data or large biodiversity databases. We are now able to tackle key ecological and environmental questions with unprecedented rigor and generality. Leveraging these data streams requires new tools and increasingly complex workflows. The course introduces a set of free and open source software (BASH, AWK ,GDAL, GRASS, R, Python, PKTOOLS, OFGT) for performing spatio-temporal analyses of big environmental data in Linux environment. We also introduce multi-core, cloud and cluster computation procedures.
See website of a sister course for more information.
Max 15
The course will take place in 'Red Queen' room at iDiv (Deutscher platz 5e, Leipzig, Deutschland).
The price is 350 EUR per participant, and it covers: 20 hours of class training, student supervision, course auditing, course material, lectures, sample data and a USB flash drive inclusive of Linux Virtual Machine with all the software installed.
The price does not cover food and drinks -- the reason is that we prefer to offer a cheaper course with less catering, rather than fully catered but more expensive course (we'd have to raise the price to ~400 EUR to have catering).
]]>
A quote Hutchinson wrote in 1975 about the importance of Natural history may shed some light on why I think that more fundamental than statistics, any good quantitative ecologist needs to be a good natural historian first…The quote is as relevant today (or more so) than when he wrote it 40 years ago:
Modern biological education, however, may let us down as ecologists if it does not insist…that a wide and quite deep understanding of organisms, past and present, is as basic a requirement as anything else in ecological education. It may be best self-taught, but how often is this difficult process made harder by a misplaced emphasis on a quite specious modernity. Robert MacArthur really knew his warblers.
I replied that dismissing statistics as specious modernity is unjust. Yes, there are bad statistical applications out there, and statistic can be used to conceal, divert attention, exaggerate, or to just make things too complex, but that is a problem of bad statisticians, not an inherent flaw of statistics.
However, neither of us (including Hutchinson) provided good arguments for why natural history or statistics should really be that fundamental. We haven’t continued the conversation, but the issue kept my brain busy nevertheless. Now I have this to say:
Any understanding of organisms begins with observation of nature or experimentation with it, or both. The observations and experiments then reveal rules (patterns, trends, laws) and exceptions to the rules, we are surprised or bored, we learn, we do more observations and experiments, we are surprised or bored, we learn, and so on. Children do it. People who do this professionally are scientists.
Now, in this process, where exactly is natural history and where is statistics? I’d say that Natural history, the deep understanding of nature, is the pile of rules and exceptions that one accumulates during the cycle of observation, experimentation and learning. I’d also say that Statistics is the way to identify the rules and exceptions among all the observations, and it does not really matter if this happens in children’s brain, or in R.
From this perspective, it seems that both me and Jon were kind of right, because both statistics and natural history are really important; however, their relative importance is perhaps a matter of taste. But are they also fundamental (= foundational)? Or is there something even more fundamental that keeps them both alive? Some infinite and renewable source of energy driving the cycle of observation, experimentation and learning?
I propose that such fundamental thing that keeps both natural history and statistics alive is curiosity. Simple curiosity. Maybe curiosity is the most fundamental basic skill that we, as scientists, should really profess and master (with a pinch of imagination and critical thinking). Deep knowledge and the ability to separate rules from exceptions will then come as a byproduct.
PS: In a recent post, Brian McGill argues that we are scientists because we count. I think that it mixes nicely with my proposition that we are scientists because we are curious.
PPS: Yesterday, Jon added that he did not want to question the relevance of statistics – rather, he has been worried about the lack of hypotheses and actual thinking in many of the modern statistical analyses. I partly second that worry. However, I’d also be cautious here: hypothesis-free exploratory analysis does have its merits, if done right.
]]>The course is full.
DAY 1
DAY 2
DAY 3
RELATED TOPICS AND POSTS
]]>