## Journal of Theoretical Biology vs. People magazine: controversy, scandal, and newborn gender ratios

Journal of Theoretical Biology vs. People magazine: controversy, scandal, and newborn gender ratios

When reading of the transgressions of other researchers, it is always tempting to smirk and glibly assume you would never have committed so egregious an error, which was precisely my reaction to the brouhaha and heated exchange of letters that followed a series of reports in the Journal of Theoretical Biology by Satoshi Kanazawa, from the London School of Economics (afterall, a well-respected institution, as is the journal in question, with an impact score of 2.3). However, Andrew Gelman recently wrote an eloquent commentary on this controversy, with a useful discussion on the statistical interpretation of ‘small effects’, which makes me wonder seriously about how often this type of error arises (American Scientist, Jul-Aug 2009). I confess that my initial reaction was in part a response to the all-too-obvious publicity-mongering in Kanazawa’s series of articles, with pithy titles like “Big and tall parents have more sons” (2005) or the almost People-esque “Beautiful parents have more daughters” (2007), the latter followed by a book entitled “Why beautiful people have more daughters”. Having duly disclosed my own prejudices, Gelman has nevertheless raised some important questions about the interpretation of ‘small effects’, which even when real are likely to on the order of 1%. For example, Kanazawa’s study on the association of parental attractiveness and newborn gender is worth considering. Using data from the National Longitudinal Study of Adolescents for a group of 2972 parents, the study analyzed the relationship between the sex of respondents children and the interviewers subjective assessment of parental ‘beauty’ on a five point scale. In their analysis, the authors chose to compare children born to parents in the highest attractiveness category (52% female) to those born in the lower four categories (44% female), concluding that there was a statistically significant 8% difference. This was cited by Kanazawa (an evolutionary psychologist) as evidence in favor of the Trivers-Willard hypothesis, a theoretical model which postulates that genetic traits benefiting one sex will lead to a larger number of that sex being born.

Most of us are bored to tears when statisticians discuss type I or type II errors, so let me emphasize that these are not at issue here. The errors in statistical reasoning which slipped past the reviewers in fact go by the name of type M (magnitude) and type S (sign) errors, which are less familiar. They are however related to the more familiar type I or type II errors, so let me briefly refresh your memory. In almost any statistical comparison, we are interested in comparing an observed outcome with that expected under some particular null hypothesis (e.g. attractive and unattractive parents are equally likely to have girl children), to quantify the probability that the observed outcome could arise purely by random chance. With typical random variation, we know that a result roughly 2 standard errors from the true value will arise by chance in 5% of trials, and this is our usual cut-off or p-value for interpreting statistical significance (the standard error is just the standard deviation of the parameter being estimated, here the proportion of female births). It also means that we will incorrectly reject our null hypothesis 5% of the time simply because of random variations. This is the so-called type I error, which we all know and dread. The cut-off value (p=0.05) is then just the probability of rejecting the null hypothesis when it is in fact true. In contrast, a type II error arises if we accept our null hypothesis when it is in fact false, which is bound up in the notion of statistical power (the probability of correctly rejecting the null hypothesis when it is false). Most of us are aware of power calculations, since we perform them routinely to determine the numbers required to observe an effect we wish to study, usually insisting on 90-95% power to detect an effect when present, which means that we are willing to accept a type II error of 5-10%. As grant and journal reviewers, we often admonish each other on the subject of underpowered studies that risk missing real effects because of inadequate study numbers.

So what did Kanazawa do to provoke such ire? One difficulty is obvious on casual review. The decision to compare one category of parental attractiveness with the average of the other four is reasonable, but arbitrary. Many other comparisons could be done, each with a 5% chance of spurious conclusions. This raises the familiar multiple comparisons problem, which needs to be accounted for, since the overall probability of a spurious conclusion is now more than 5% . This is just old-fashioned type I error. But underpowered studies have another serious problem. With inadequate study numbers, the precision of the estimates will be poor, with larger standard errors (SE). For example, if we are measuring blood pressure in a group of school children, our sample mean has a standard error of σ/√n, where σ is the standard deviation of blood pressure in the population and n is the size or the sample. In general, the precision of any estimates will improve (i.e. smaller SE) as sample size increases. In designing studies, we usually accept a type I error risk of 5% (p=0.05), which means that estimates differ from true values by 2 SE at least 5% of the time. If the standard errors are larger because the study was underpowered, our errors will be all the more glaring, which is a so-called type M (magnitude) error. Put simply, we will still be wrong 5% of the time, but our mistakes will be all the more embarassing, since any result that makes it past the usual “statistical significance filter” (i.e. p=0.05 or ± 2 SE) will be inflated by lack of power. In the case of the attractive parents, Gelman puts the standard error at 4.3%. Even if the null hypothesis is true (no effect), we will wrongly observe an effect larger than 8.6% at least 5% of the time. Another way of expressing this danger is in terms of the 95% confidence interval on the estimate, which is here given by [-3.9, 13.3]. Since any real effect is likely to be small (1% or less), this study sheds little light on either its magnitude or sign, with negative or positive values both eminently plausible.

The fact that we cannot even make reasonable inferences on the sign of the observed effect is known formally as a type S (sign) error. While perhaps obvious from consideration of the 95% confidence interval, it is worth mentioning, since some authors and textbooks do appear to believe that we are on safe ground if we ignore the magnitude of the estimate and interpret only its direction. As Gelman points out, for a small effect size (say 0.3%), any result making it through the significance filter has a 3% chance of being a significant positive effect and a 2% chance of being a significant negative effect. So there is still a 2/5 chance of getting the sign wrong even when the effect is real (type S error = 40%), and the direction of the estimate has provided almost no useful information. Even a true difference of 3% (which is not in keeping with the literature on the subject) will have a type S error rate of 24%. And due to the lack-of-power, there is still only a 10% chance of achieving statistical significance even with a real effect of this magnitude. For those who believe that Bayesian analysis will overcome parodoxes like this arising from classical statistical models, Gelman (a noted Bayesian analyst) demolishes that shibboleth, too. Regardless of the prior distribution chosen to encode previous knowledge on the subject, the errors differed little from those just outlined.

While this example may appear trite, the article is worth reading and quite entertaining (his follow-up study on the sexes of children born to couples making the “most beautiful people” list from People magazine leaves me green with envy for not having thought of it first :). He rightly points out several important issues that we do not often consider. Firstly, it is true that our research efforts are focusing increasingly on effects of smaller magnitude, as an inevitable consequence of progress. However, while most of us are familiar with the phenomenon of large epidemiologic studies that generate results which are statistically but not biologically significant, we don’t put anywhere near the same emphasis in our teaching on the dangers of underpowered studies, particularly the way in which our usual statistical-significance criteria become what Gelman characterizes as ‘a machine for exaggerating claims’, an unfortunate tendency which the media will (as here) inevitably distort and which will effectively undermine the credibility of the entire scientific enterprise in the public eye.

A. Sharma MD, FRCP(C), with comments from

C. Rodd MD, FRCP(C)

McGill University, Montreal

- asharma's blog
- Login or register to post comments

## Comments

## underpowered studies

? a real world example

## underpowered studies - an example for the sceptical

Sorry for the delay in replying to your post. I hope you will forgive a nephrologist for citing a nephrologic example, but the July 2 NEJM had a good illustration. In it, Mauer et al have made an important contribution to our understanding of the role of renin-angiotensin blockade in the treatment of diabetic microvascular complications (N Engl J Med 2009; 361: 40-51). They have also been candid in acknowledging the extent to which their study was underpowered, in part because it was not appreciated at the time the study was designed that patients selected on the basis of normalbuminuria and normal GFR despite an average diabetes duration of 10-11 years might in fact be less susceptible to complications. To account for any ‘real effects’ on renal function that might have been missed (type II error), they take great care to calculate bounds on any differences that might have failed to achieve significance due to lack of power. Nevertheless, as discussed in a aforementioned review that appeared in the same week (Gelman and Weakliem, American Scientist, 97: 310-6), lack of power has other consequences that should be considered, particularly when the expected effects are small.

The so-called type M (magnitude) error is not particularly difficult to understand: The precision of any estimate is compromised by insufficient study numbers, resulting in inflated standard errors (SE). For example, if measuring blood pressure (BP), the SE of the sample mean is simply σ /√n, where σ is the BP standard deviation in the population and n the sample size. In general, the precision of the estimate will improve as n increases, with the SE going to zero for arbitrarily large n. The usual type I error rate of 5% (p=0.05) means that 5% of findings will be spurious; that is a given. However, in an underpowered study with inflated standard errors, any result that makes it past the usual ‘statistical-significance filter’ (i.e. falls outside the 95% CI of ±1.96 SE) will be magnified. Gelman and Weakliem quite rightly characterize this as ‘a machine for exaggerating claims’.

In this instance, the authors define their benchmark as a 2-step progression in the retinopathy index, observed in 37.8% of those on placebo. The imprecision is reflected in the relatively wide 95% CI for the odds ratio for progression vs placebo (0.14-0.85 for enalapril, p=0.02. By analogy with the bounds they calculated for differences that might have been missed, 0.85 is the bounding value for the benefits they describe, which means that the probability of progression on enalapril may be as high as 34.1%, which is not nearly as dramatic as the reported figure. The inherent imprecision might have been more readily apparent to the casual reader had the authors defined success in terms of 'halting progression', in which case the confidence interval for the odds ratio would be given by the reciprocals as 1.17-7.1 i.e. barely greater than 1 with a great deal of uncertainty.

Obviously, such caveats are not unique to this study. With progress, it is inevitable that we will be interested in smaller effects. As this happens, we will have to be more aware of type M and S errors and accord them the same care and attention now given to the danger of missed effects in underpowered studies.

Atul Sharma