If you want to know how to make extra bucks, search for: best adsense alternative Dracko's tricks ]]>

it took me a while to come back ...

Re: "even spread of x-values": It may well be that I am confounding things, but I stick to my claims. As you write yourself the scaling of variables (x and y) is a convention, not god-given. So transforming X is just rescaling to different units (e.g. from proton concentration to pH). Same goes for any other transformation (think Fahrenheit and Celsius, although those are linear transformation and will not change the distribution). So, if my X are (for whatever reason) not uniformly distributed, I can re-scale them to my convenience (e.g. square-root).

And, as you know as well as I, sampling cannot guarantee uniformity (as my example with island areas was supposed to indicate). Would you rather ignore all islands in an archipelago for which you have y-values just to make sure that the areas are uniform? Of course you (and I) would not.

You are of course completely right that such transformations change the relationship (from additive to multiplicative).

Re: "clumping does increase weights of points outside": It sure does. An example may say than 1000 words:

x <- c(1:10,42,35)

y <- c(9,9,10,7,7,7,7,6,7,5,33,40)

fm2 <- glm(y ~ log(x), family=poisson)

plot(x,y, las=1, pch=16, cex.lab=1.5, log="x")

influence.measures(fm2)

# or separately:

cooks.distance(fm2)

hatvalues(fm2)

Clearly the far-away points have a higher weight (e.g. hat value). That's all I meant. In a linear regression the line goes through the centre (\bar x, \bar y), and the further the distance of a data point from this point, the larger its leverage (sic!). Clumping moves the centre towards the cluster, giving more weight to out-of-cluster points.

Re: "regression trees are not parametric": People seem to use the word "non-parametric" in two instances (at least): 1. when the RESPONSE doesn't follow a specific distribution, and 2. when the model doesn't return "obvious" parameters. Trees do have parameters, so in my definition they could well be seen as parametric. Also, depending on which criterion you (not you really, but the algorithm) uses to decide on where to put the split, you can very well use a likelihood-based approach. If you think of CARTs with variance as criterion for splitting, you imply a normal distribution.

Also, I think even you would call a threshold-model a "model", and a CART is just a recursive threshold model. I share your unease with press-the-button-hey!-machine-learning-approaches, but I would not go so far to exclude them from the club of "models".

Thanks for writing your great blogs!

Carsten

]]>I once again find myself personally spending way too much

time both reading and commenting. But so what, it was still worthwhile! ]]>

Oh come on, if just the means are not the same among groups, but variances (and other parameters) are (and there is enough data to estimate the within group variance elsewhere - if all populations are of size one, you are doomed, of course), you do not need to simulate and sample at all - this is why we call these "parametric tests" in the frequentist world. 🙂

With good command of calculus, you can profile the probability of overlaps yourself, others (like me) just "look into the tables" (for the simple case). That was a low hanging fruit.

Challenge: with more than one point you can aim for estimating differences in, let us say, within population kurtosis (e.g. stabilizing vs. disruptive selection) - for suchlike problems, the tools in the frequentists' drawer are not that sharp (or that well known in general), while (I suppose) it may be relatively straightforward to do it under Bayesian framework.

Bonus/malus point for the frequentist approach in the case from the blog: you do not get the shrinkage because of the (wannabe uninformative, yet still present) prior.

summary(lm(snout.vent~as.factor(population),data=snakes))

```
```Call:

lm(formula = snout.vent ~ as.factor(population), data = snakes)

Residuals:

Min 1Q Median 3Q Max

-6.334 -1.466 0.176 1.784 6.906

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 57.100 3.114 18.338 < 2e-16 ***

as.factor(population)2 -15.814 3.266 -4.842 2.43e-05 ***

as.factor(population)3 -11.180 3.266 -3.423 0.00156 **

as.factor(population)4 -2.636 3.266 -0.807 0.42488

as.factor(population)5 1.949 3.266 0.597 0.55438

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

`Residual standard error: 3.114 on 36 degrees of freedom`

Multiple R-squared: 0.8507, Adjusted R-squared: 0.8341

F-statistic: 51.29 on 4 and 36 DF, p-value: 2.208e-14