Not all proportion data are binomial outcomes

da_vinci_proportions

It really is trivial. Not every proportion is frequency. There are things that have values  bounded between 0 and 1 and yet they are neither probabilities, nor frequencies. Why do I even bother to write this? Because some kinds of proportions should be treated as unbounded continuous variables, and should be analyzed using appropriate statistical machinery (e.g. assuming normal error structure). This may not be entirely clear after reading the chapter in Michael Crawley's The R Book (2007) that deals with proportions (Chapter 16: "Proportion data") and that focuses exclusively on the proportions which are frequencies.

Proportion is frequency when we count numbers of binary outcomes of a bernoulli-distributed random process (e.g. coin toss). If one is a frequentist he can say that the proportion (or frequency) of heads in the total number of flips is equal to the bias of the coin, or he can directly link the frequency to the probability that the coin is equal. Coin tosses are a dull example, so here are other kinds of data in which proportions are frequencies and which follow the same distribution: percentage mortalities, infection rates of diseases, proportions of patients responding to treatments, sex ratios and so on (examples taken from Crawley, 2007).

These data should be modeled with the assumption of binomial error structure, for example by using logistic regression. Here is an example of such data (black dots; the data are artificial) and model (red line):

prop_recovered

Proportion is not frequency when we use the proportion to standardize and relativize continuous data. For example, length of a male leg covers lower proportion of the total body height than length of a female leg. Or: Percentages of weight gains or losses after a medical treatment. Or: Proportional decrease of population of an endangered species resulting from proportional destruction of an area of a rain forest. And so on.

Interestingly, these proportions can sometimes have interpretable negative values (e.g. negative percentage weight loss is weight gain). Also, it is not as clear as in the previous case what error structure should we assume. I would guess that in most cases it would be the distribution of the original, non-proportional and "non-standardized" variable.

Here is an example of proportional weight loss of patients (black dots; the data are artificial) after a drug treatment. In this case normal linear regression model is fitted:

prop_drugs

As I've said, it is quite trivial. However, do let me know if I am trivially mistaken here.

 

5 thoughts on “Not all proportion data are binomial outcomes

  1. One alternative model for such bounded responses -- e.g., proportions or concentrations etc. -- is assuming a beta-distributed response. Such beta regressions also use a link function to the unit interval -- e.g., logit, probit, log-log etc. -- and can hence be almost linear around the center and non-linear in the tails. Furthermore, the beta-distribution can be skewed or almost symmetric and heteroskedasticity can be naturally incorporated. In R, the "betareg" package provides the model (see http://www.jstatsoft.org/v34/i02/ and http://www.jstatsoft.org/v48/i11/). Extensions to additive models are also provided in "gamlss" and "gamboostLSS".

  2. Perhaps you mean that one should carefully consider whether a random variable is discrete or continuous? The length of the leg as a proportion of total height is a continuous variable and even though it is expressed as a ratio, there is no frequency proportion involved. The probabilities associated with discrete random variables can be directly expressed as (limits) proportions while those of continuous random variables cannot unless we are dealing with finite intervals. I disagree with you on the point that coin tossing is a "dull example". See Feller, "Introduction to Probability Theory and its Applications", vol. 1 Chs. III & VIII (1968 edition)

  3. Your examples of non-frequency proportions from standardizing are technically ratios, not proportions. Ratios can have very messy error distributions because they are functions of numerator variance, denominator variance, and covariance between num & denom. Unless the numerator variance >> denominator variance and covariance, Gaussian can be a very bad assumption.
    One can often avoid the problems with ratios as dependent variables by moving the denominator to the right side as a covariate. For counts of individuals under varying effort, instead of analyzing a density (count divided by effort), quadrat size or observation time becomes an offset on the right side of the formula with a forced coefficient==1, and then the count response might be Poisson (technically the offset is log(QuadSize) or log(ObsTime) as there's a log-link). You might still have overdispersion, but you don't have the ratio problem. Similarly, you might want to analyze leg length by including total length as a covariate on the right side (whether with a fixed coefficient, an estimated linear coefficient, or a smoothed GAM).

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>