A recent article published the Journal of Food Science has generated considerable buzz online in the various whisky forums, due to how it has been characterized in the popular press. Plenty of websites like Tech Times and e-Science News have picked up the story, often with inflammatory headlines (e.g., “Bourbon or rye? You can’t tell the difference”). Even mainstream media has picked up on the action, including Fox News in the US and the Daily Mail in the UK.
If you read the enthusiast commentary out there, you will find much indignation at those headline statements. But is that really what the article shows? Here is a link to the abstract of the article by Jake Lahne et al: Replication Improves Sorting-Task Results Analyzed by DISTATIS in a Consumer Study of American Bourbon and Rye Whiskeys (J Food Sci. 2016 Apr 18. doi: 10.1111/1750-3841.13301)
As you can probably tell from the article title, this study is not going to be a detailed analysis of bourbon flavour. If you peruse the abstract, you will see that this is really a scientific analysis to compare how a new statistical method for analysis of sorted study data performs against an older method. It also introduces a new variable of subject scoring replication, to see how that affects the results.
Unfortunately, some over-reaching comments have been made about this article, so I thought it would be a good idea to dissect out what conclusions you can actually draw about American bourbons and ryes from this analysis.
I have a copy of the full article, and have reviewed the methodology in some detail. I find it a generally well-described exploration of a new statistical method. But it allows you to draw almost no inferences about the ability to discriminate bourbons and ryes. The main problems boil down to the reference set of whiskies chosen, who scored them, and how.
Before getting started, I should point out that personal bias is hard to account for here. Many enthusiasts believe they have great power to detect and differentiate differences between whiskies. But the history of blind sensory sorting studies tells us that we commonly greatly inflate our own abilities in this regard.
On the one hand, whisky enthusiasts are likely to approach any such reported study with a pre-conceived bias, looking for flaws in the design or conclusions that support their existing world view. But equally of concern, designers of such studies could similarly choose to design or analyze their results in such a way as to support a pre-existing bias on their own part (namely, that people over-estimate their ability to differentiate). The bias knife cuts both ways.
My goal here is to fairly and objectively review the design and analysis of this particular study, to see if there are any obvious sources of concern, and whether the authors’ conclusions are evidence-based and limited to the analysis findings.
How to Classify Whisky (or Anything Else)
As explained on this site, the “gold standard” for sorting sensory input into discrete groups first starts with descriptive labels assigned by expert reviewers, based on an underlying physiochemical basis, scored for an exhaustive sample collection (see my Early Flavour Classifications page for more info). This is followed by a statistically-valid cluster analysis, to group the intensity of these distinct characteristics into an appropriate number of clusters. Finally, a principal component analysis allows you determine which dimensions of the cluster analysis are key to discriminating the core characteristics of the group, in a statistically meaningful way. For these last two points, see my Modern Whisky Map page for more info.
While the above has been done for single malt-style whiskies (described on those pages above), I am not aware of such a comprehensive analysis being done for American Bourbon/Rye whiskies. And that is certainly not what this article by Lahne and colleagues sets out to do.
The Lahne Study Design
This paper uses a “short-cut” method – a very small sample of whiskies, sorted by a very small panel (not identified for expertise), asked to simply free-sort (i.e., apply whatever characterization they want, without any descriptive features). This does not compare to the first step described above.
The reason for this is that they are really only seeking to validate a novel cluster and dimensional analysis method, and NOT provide a definite answer to issue of bourbon/rye classification. In other words, they are validating a process for doing the last two steps above, not the first.
Here are the top-line reasons why you should not get too worked up about this article in terms of the ability to discriminate ryes from bourbons:
- Participants were not asked to separate bourbons from ryes, but rather to free sort into whatever number and type of groupings they felt like
- Participants did not necessarily have any experience with whisky (selected only for being “nonrejectors of whiskey by aroma”).
- Participants were drawn from a University campus environment, with a mix of students, staff and faculty. Note the mean age was 42, but the median age was 31. When combined with the standard deviation of 19 yrs, this is a real tip-off as to the spread of age and likely experience with whisky.
- Consistent with Scotch panel reviewing norms, participants only smelled the whiskies (no tasting was performed).
- Similarly, whiskies were diluted 1:1 with distilled water, to limit and mask the effects of high alcohol content (i.e., presented only at 20-25% ABV for smelling)
- A very limited number of whiskies were used – only 5 bourbons and 5 ryes – without explicit consideration of the rye content in their mashbills (I will come back to this point of whisky selection in more detail later)
Note that nothing that I have said above is intended as a criticism of the analysis itself. The above are simply statements as to the participant and task nature of the study. That said, many enthusiasts – with some justification – will reject the use of naive sorters, free sorting, and lack of tasting to separate whiskies in this study.
On the point of smell-only sorting, I should clarify that while it is common in many Scotch whisky panels to only nose the whisky, this is done simply to prevent reviewer fatigue and potential intoxication. While it has been argued that many (though not all) of the characteristics of Scotch whisky can be recognized by smell alone, this presumes an expert panel with extensive experience (which is not the case here). Further, there is at least anecdotal evidence to suggest that the effect of rye on American whisky flavour is not limited to scent (i.e., many find rye flavours more pronounced on tasting than nosing). As such, I find the authors stated claim in this article that it is unlikely that actual tasting would have changed the grouping results is unreasonable and not exactly evidence-based.
In terms of the free sorting, the authors attempt to justify this method by stating that results from such studies “are often equivalent to more exhaustive, traditional methods” (i.e. the ones I explained in the section above, for this site and Scotch whiskies). That may be true, but my experience of whisky analysis makes me seriously doubt it (I would really need to do an independent review of the literature to verify that claim). But it is most certainly NOT true if you draw a biased small sample that is not representational of the overall dataset.
This is the basis of all inferential statistics – if you are going to draw from a population, you must try to be as representational as possible and control for obvious confounds. I will discuss this issue of the specific whisky selection in detail below, as there is good recent to doubt their selection, based on earlier scientific studies and results presented in this analysis.
Consistent with the stated goals of this paper, I find the actual statistical analysis method used to be well described and justified, and is likely appropriate for further large scale studies (as they propose). However, you simply CANNOT make meaningful inferences about the ability to discriminate rye and boubons from a study with the sampling and sorting design used here (i.e., it is not designed to address that question). Any over-arching claims to contrary are not supported by the evidence in the study.
The Real Issue
Now, I could stop there, and draw this commentary to a close. Indeed you may want to stop reading at this point, unless you really care about scientific study design. 🙂
The issue of bias is an important consideration among both the general enthusiast community and in the scientific community. It is worth exploring in detail, given some red flags in this particular study. Let me start with the whiskey analysis results in this paper, and then show why their conclusions about bourbon vs rye are (at best) misleading based on the sample selection.
The authors note that US law only requires (among other things) that the mashbill for bourbons be 51% corn, and that of ryes be 51% rye. They also note that producers do not commonly reveal the exact mashbill composition. As such, it is possible that the bourbons and ryes in their samples could differ by only a couple of percentage points of rye content. This would certainly be a confound.
But there is actually a lot of information available out there about the proportion of rye in many mashbills. Indeed, it is interesting that 4 of the 5 bourbons they used are considered as “low-rye” by enthusiasts. Here is the actual list of what they used (with distiller/owner identified):
- Jim Beam Black Bourbon (Clermont/Beam)
- Old Forester Straight Bourbon (Brown-Forman/Brown-Forman)
- Old Crow Straight Bourbon (Clermont/Beam)
- Elijah Craig 12yo Bourbon (Bernheim/Heaven Hill)
- Buffalo Trace Bourbon (Buffalo Trace/Sazerac)
- Rittenhouse Rye (Bernheim/Heaven Hill)
- Sazerac Rye (Buffalo Trace/Sazerac)
- Bulleit Rye (MGP/Diageo)
- Knob Creek Rye (Clermont/Beam)
- Jim Beam Rye (Clermont/Beam)
While there is no official designation of low-rye vs high-rye, I expect most of us would consider all the bourbons except for Old Forester to be particularly low-rye (i.e., all 4 are believed to be <15% rye content).
This brings up a critical point – despite a general lack or reporting by producers, you could still set out to choose whiskies that evenly span the continuum of known rye content fairly easily, from what is reported for available whiskies. In other words, you could assemble samples from known low-rye bourbons (<12% rye), high-rye bourbons (15%>x<35%), sub-maximal ryes (51%>x<100%), and 100% ryes. The authors have not done this – indeed, they do not even discuss this as a possibility.
To start, let’s see what their analysis method actually produced with this particular set of whiskies. The principal component analysis (PCA) in their study found that 47% of the total variance can be explained by 3 dimensions, as follows:
- The first dimension (21% of the variance) separates 3 whiskies from the others – all 3 produced by Jim Beam (JB Black, JB Rye, and Old Crow Bourbon).
- The second dimension (14% of the variance) does not separate by rye vs bourbon (the authors claim), but best correlates to age and ABV.
- The third dimension (12% of the variance) separates Bulleit Rye from the other 7 whiskies that cluster together in the first dimension.
On the basis of these three key dimensions, the authors (seemingly) reasonably conclude that producer, age and ABV have a greater influence on self-selecting of whisky into groups than does mashbill (i.e., the traditional method of producers and enthusiasts).
So what is wrong here? The main problem is that we have potentially a huge selection bias in their choice of whiskies, based on the existing data available to these researchers.
Before I explain how they choose their whiskies, it is worth noting that Jim Beam made up 4 out of 10 whiskies sampled above (again, sorted by diluted scent alone). Is it really so surprising that naive sorters choose to group these together out of the whole set? Can we really infer from this (and the Bulleit finding) that producer is the key discriminant? Not in such a limited and biased small sample of whiskies we can’t. Again, I will come back to why this is so at the end, when I discuss their justification for the selection.
Another problem is their interpretation of the second dimension. The authors state that age and ABV correlate best for this dimension, but those correlations are actually very weak statistically. Note as well that there is not a big age or ABV difference between most of these whiskies to start with, and the study is hardly powered to look at these variables. Going through the results, I have to say these conclusions for the second dimension of the PCA seem very tenuous based on the actual analysis in the paper.
But here is the kicker – if you pull Buffalo Trace from the analysis, the second dimension correlates almost perfectly for bourbon vs rye (!). Buffalo Trace is an outlier in the group, clustering strongly to the ryes. Without it there, you would have a nearly perfect correlation of rye to bourbon on the second dimension of the PCA.
What this means is if they had chosen to substitute another whisky for Buffalo Trace in the (incredibly tiny) bourbon sampling, they would likely have found a completely different result. Indeed, without Buffalo Trace in the mix (i.e., looking at only the other 9 whiskies), they most certainly would have concluded that rye vs bourbon is a main discriminator.
Why Did They Choose These Whiskies?
The authors main justification for their specific sampling of whiskies is that they were selected from ones used in a previous study to “span the space of nonvolatile constituents found in whiskies.” They cite as the sole reference a paper by the second author on this study: Collins et al, Profiling of nonvolatiles in whiskeys using ultra high pressure liquid chromatography quadrupole time-of-flight mass spectrometry (UHPLC-QTOF MS).
Now, first off, you might be thinking it is a bit odd to use a study of “nonvolatile constituents” as the characterization system to pick a subset of whiskies for a smelling-only sensory sorting study (!)
I will say that the earlier Collins et al HPLC/MS paper appears to be a well-designed and analyzed study looking at a larger number of American whiskies (63). Indeed, the analysis is even more thorough and robust that this paper. But the actual findings in that earlier paper seriously call into question the claim made here that 5 ryes and 5 bourbons are going to “span” that space.
Specifically, the Collins paper found that when removing craft whiskies, there is a difference between bourbons and ryes in terms of their nonvolatiles – but with significant overlap between the groups. So, depending on which specific whiskies you sampled for a subsequent smaller-scale study, you could produce any result you wanted (i.e., no difference, or a massive difference between bourbons and ryes – depending on which ones you picked).
Note that the Collins paper does not identify the individual whiskies, so there is no way for the reader to ascertain the selection bias this time around. But the authors had access to all this information.
Is there any reason to doubt their claim that they have chosen a reasonable “span”? Unfortunately, there is. One particular interesting finding in the Collins paper is that while the whiskies of any given producer tend to cluster together (regardless of rye composition), there were very clear differences between producers in their PCA. In particular, there is one massive discriminator in the first dimension, where one producer was a huge outlier from all the others (who differentiate from each other to a varying extents in a second dimension).
Given this unequal pattern, how exactly did Lahne et al draw a representative span of producers? If they included that one outlier producer from the earlier study, they would have heavily biased this study for the first dimension of their PCA. In particular, I wonder if that outlier was Jim Beam, since the pattern of an extreme outlier in the PCA is reproduced almost exactly here. If that outlier producer was Beam, then they have deliberately stacked the deck in this study by using a known outlier for 40% of the whiskies examined here.
But even if that is not the case, I don’t see how they could have chosen “evenly” among such divergent producers. Again, 4 of the 10 whiskies used in this study came from a single producer. That seems very surprising, given the strong variance between virtually all the producers reported in the earlier study.
There is a fundamental issue of lack of transparency here. The only way to verify their selection in this study is for the identity of the whiskies in the earlier Collins HPLC/MS study to be publicly revealed, at least for the current set of whiskies studies here. That way, we can all see exactly how they choose to assemble their smaller subset in this study, and verify its supposed representational basis.
Wrapping It Up
The key point that I made early in this commentary is that the participant and sampling design clearly prevents you from drawing any meaningful conclusions about the ability of people to discriminate rye from bourbon (i.e., that is NOT what this study was designed to test for).
But the bigger underlying problem here is the apparently non-representational basis of the whiskies they choose to study. Again, they had access to much more nonvolatile constituent information on these whiskies than they present publicly. And the reported levels of variance from their earlier work calls into question the very idea that a such a small set could possibly be representational here, as they claim.
Moreover, reviewing the results of this study, it is clear that the opposite finding (that is, a clear dimension of rye-to-bourbon differentiation) would have been obtained had 1-2 specific whiskies not been included. Given this, and the authors awareness of the distribution from earlier studies, it is critical that they provide a transparent explanation for their selection criteria, to show a clear absence of selection bias.
Moving forward for any further studies of ryes and bourbons, I would encourage these authors to move beyond their nonvolatile analysis, and consider known information on actual mashbill composition. While incomplete for all producers, there is enough information out there as to reasonably assign a range of American whiskies across a continuum of actual rye content. Further, they also need to test their assertion that actual tasting would not influence the results of any sorting paradigm, given the lack of evidence for this stance in the case of rye in bourbon.