Monday, January 5, 2009

Voodoo Correlations in Social Neuroscience



The end of 2008 brought us the tabloid headline, Scan Scandal Hits Social Neuroscience. As initially reported by Mind Hacks, a new "bombshell of a paper" (Vul et al., 2009) questioned the implausibly high correlations observed in some fMRI studies in Social Neuroscience. A new look at the analytic methods revealed that over half of the sampled papers used faulty techniques to obtain their results.

Edward Vul, the first author, deserves a tremendous amount of credit (and a round of applause) for writing and publishing such a critical paper under his own name [unlike all those cowardly pseudonymous bloggers who shall go unnamed here]. He's a graduate student in Nancy Kanwisher's Lab at MIT. Dr. Kanwisher1 is best known for her work on the fusiform face area.

Credit (of course) is also due to the other authors of the paper (Christine Harris, Piotr Winkielman, and Harold Pashler), who are at the University of California, San Diego. So without further ado, let us begin.

A Puzzle: Remarkably High Correlations in Social Neuroscience

Vul et al. start with the observation that the new field of Social Neuroscience (or Social Cognitive Neuroscience) has garnered a great deal of attention and funding in its brief existence. Many high-profile neuroimaging articles have been published in Science, Nature, and Neuron, and have received widespread coverage in the popular press. However, all may not be rosy in paradise:2
Eisenberger, Lieberman, and Williams (2003), writing in Science, described a game they created to expose individuals to social rejection in the laboratory. The authors measured the brain activity in 13 individuals at the same time as the actual rejection took place, and later obtained a self-report measure of how much distress the subject had experienced. Distress was correlated at r=.88 with activity in the anterior cingulate cortex (ACC).

In another Science paper, Singer et al. (2004) found that the magnitude of differential activation within the ACC and left insula induced by an empathy-related manipulation was correlated between .52 and .72 with two scales of emotional empathy (the Empathic Concern Scale of Davis, and the Balanced Emotional Empathy Scale of Mehrabian).
Why is a correlation of r=.88 with 13 subjects considered "remarkably high"? For starters, it exceeds the reliability of the hemodynamic and behavioral (social, emotional, personality) measurements:
The problem is this: It is a statistical fact... that the strength of the correlation observed between measures A and B reflects not only the strength of the relationship between the traits underlying A and B), but also the reliability of the measures of A and B.
Evidence from the existing literature suggests the test-retest reliability of personality rating scales to be .7-.8 at best, and a reliability no higher than .7 for the BOLD (Blood-Oxygen-Level Dependent) signal. If each of these measures was [impossibly] perfect, then the highest possible correlation would be sqrt(.8 * .7), or .74.

This observation prompted the authors to conduct a meta-analysis of the literature. They identified 54 papers that met their criteria for fMRI studies reporting correlations between the BOLD response in a particular brain region and some social/emotional/personality measure. In most cases, the Methods sections did not provide enough detail about the statistical procedures used to obtain these correlations. Therefore, a questionnaire was devised and sent to the corresponding authors of all 54 papers:
APPENDIX 1: fMRI Survey Question Text

Would you please be so kind as to answer a few very quick questions about the analysis that produced, i.e., the correlations on page XX. We expect this will just take you a minute or two at most.

To make this as quick as possible, we have framed these as multiple choice questions and listed the more common analysis procedures as options, but if you did something different, we'd be obliged if you would describe what you actually did.

The data plotted reflect the percent signal change or difference in parameter estimates (according to some contrast) of...

1. ...the average of a number of voxels.
2. ...one peak voxel that was most significant according to some functional measure.
3. ...something else?

etc.....

Thank you very much for giving us this information so that we can describe your study accurately in our review.
They received 51 replies. Did these authors suspect the final product could put some of their publications in such a negative light?

SpongeBob: What if Squidward’s right? What if the award is a phony? Does this mean my whole body of work is meaningless?

After providing a nice overview of fMRI analysis procedures (beginning on page 6 of the preprint), Vul et al. present the results of the survey, and then explain the problems associated with the use of non-independent analysis methods.
...23 [papers] reported a correlation between behavior and one peak voxel; 29 reported the mean of a number of voxels. ... Of the 45 studies that used functional constraints to choose voxels (either for averaging, or for finding the ‘peak’ voxel), 10 said they used functional measures defined within a given subject, 28 used the across-subject correlation to find voxels, and 7 did something else. All of the studies using functional constraints used the same data to select voxels, and then to measure the correlation. Notably, 54% of the surveyed studies selected voxels based on a correlation with the behavioral individual-differences measure, and then used those same data to compute a correlation within that subset of voxels.
Therefore, for these 28 papers, voxels were selected because they correlated highly with the behavioral measure of interest. Using simulations, Vul et al. demonstrate that this glaring "non-independence error" can produce significant correlations out of noise!
This analysis distorts the results by selecting noise exhibiting the effect being searched for, and any measures obtained from such a non-independent analysis are biased and untrustworthy (for a formal discussion see Vul & Kanwisher, in press, PDF).
And the problem is magnified in correlations that used activity in one peak voxel (out of a grand total of between 40,000 and 500,000 voxels in the entire brain) instead of a cluster of voxels that passed a statistical threshold. Papers that used non-independent analyses were much more likely to report implausibly high correlations, as illustrated in the figure below.


Figure 5 (Vul et al., 2009). The histogram of the correlations values from the studies we surveyed, color-coded by whether or not the article used non-independent analyses. Correlations coded in green correspond to those that were achieved with independent analyses, avoiding the bias described in this paper. However, those in red correspond to the 54% of articles surveyed that reported conducting non-independent analyses – these correlation values are certain to be inflated. Entries in orange arise from papers whose authors chose not to respond to our survey.

Not so coincidentally, some of these same papers have been flagged (or flogged) in this very blog. The Neurocritic's very first post 2.94 yrs ago, Men are Torturers, Women are Nurturers..., complained about the overblown conclusions and misleading press coverage of a particular paper (Singer et al., 2006), as well as its methodology:
And don't get me started on their methodology -- a priori regions of interest (ROIs) for pain-related empathy in fronto-insular cortex and anterior cingulate cortex (like the relationship between those brain regions and "pain-related empathy" are well-established!) -- and on their pink-and-blue color-coded tables!
Not necessarily the most sophisticated deconstruction of analytic techniques, but it was the first...and it did question how the regions of interest were selected. And of course how the data were interpreted and presented in the press.
SUMMARY from The Neurocritic : Ummm, it's nice they can generalize from 16 male undergrads to the evolution of sex differences that are universally valid in all societies.

As you can tell, this one really bothers me...
And what are the conclusions of Vul et al.?
To sum up, then, we are led to conclude that a disturbingly large, and quite prominent, segment of social neuroscience research is using seriously defective research methods and producing a profusion of numbers that should not be believed.
Finally, they call upon the authors to re-analyze their data and correct the scientific record.



Footnotes

1 Kanwisher was elected to the prestigious National Academy of Sciences in 2005.

2 The authors note that the problems are probably not unique to neuroimaging papers in this particular subfield, however.

References

Eisenberger NI, Lieberman MD, Williams KD. (2003). Does rejection hurt? An FMRI study of social exclusion. Science 302:290-2.

Singer T, Seymour B, O'Doherty J, Kaube H, Dolan RJ, Frith CD. (2004). Empathy for pain involves the affective but not sensory components of pain. Science 303:1157-62.

Singer T, Seymour B, O'doherty JP, Stephan KE, Dolan RJ, Frith CD. (2006) Empathic neural responses are modulated by the perceived fairness of others. Nature 439:466-9.

Edward Vul, Christine Harris, Piotr Winkielman, & Harold Pashler (2009). Voodoo Correlations in Social Neuroscience. Perspectives on Psychological Science, in press. PDF

Vul E, Kanwisher N. (in press). Begging the question: The non-independence error in fMRI data analysis. To appear in Hanson, S. & Bunzl, M (Eds.), Foundations and Philosophy for Neuroimaging. PDF

No comments:

Post a Comment