Category Archives: Statistics

Can You Tell the Difference Between Bourbon and Rye?

A recent article published the Journal of Food Science has generated considerable buzz online in the various whisky forums, due to how it has been characterized in the popular press.  Plenty of websites like Tech Times and e-Science News have picked up the story, often with inflammatory headlines (e.g., “Bourbon or rye? You can’t tell the difference”). Even mainstream media has picked up on the action, including Fox News in the US and the Daily Mail in the UK.

If you read the enthusiast commentary out there, you will find much indignation at those headline statements.  But is that really what the article shows?  Here is a link to the abstract of the article by Jake Lahne et al: Replication Improves Sorting-Task Results Analyzed by DISTATIS in a Consumer Study of American Bourbon and Rye Whiskeys (J Food Sci. 2016 Apr 18. doi: 10.1111/1750-3841.13301)

As you can probably tell from the article title, this study is not going to be a detailed analysis of bourbon flavour.  If you peruse the abstract, you will see that this is really a scientific analysis to compare how a new statistical method for analysis of sorted study data performs against an older method. It also introduces a new variable of subject scoring replication, to see how that affects the results.

Unfortunately, some over-reaching comments have been made about this article, so I thought it would be a good idea to dissect out what conclusions you can actually draw about American bourbons and ryes from this analysis.

I have a copy of the full article, and have reviewed the methodology in some detail. I find it a generally well-described exploration of a new statistical method. But it allows you to draw almost no inferences about the ability to discriminate bourbons and ryes.  The main problems boil down to the reference set of whiskies chosen, who scored them, and how.

Personal Bias

Before getting started, I should point out that personal bias is hard to account for here. Many enthusiasts believe they have great power to detect and differentiate differences between whiskies. But the history of blind sensory sorting studies tells us that we commonly greatly inflate our own abilities in this regard.

On the one hand, whisky enthusiasts are likely to approach any such reported study with a pre-conceived bias, looking for flaws in the design or conclusions that support their existing world view. But equally of concern, designers of such studies could similarly choose to design or analyze their results in such a way as to support a pre-existing bias on their own part (namely, that people over-estimate their ability to differentiate). The bias knife cuts both ways.

My goal here is to fairly and objectively review the design and analysis of this particular study, to see if there are any obvious sources of concern, and whether the authors’ conclusions are evidence-based and limited to the analysis findings.

How to Classify Whisky (or Anything Else)

As explained on this site, the “gold standard” for sorting sensory input into discrete groups first starts with descriptive labels assigned by expert reviewers, based on an underlying physiochemical basis, scored for an exhaustive sample collection (see my Early Flavour Classifications page for more info). This is followed by a statistically-valid cluster analysis, to group the intensity of these distinct characteristics into an appropriate number of clusters. Finally, a principal component analysis allows you determine which dimensions of the cluster analysis are key to discriminating the core characteristics of the group, in a statistically meaningful way. For these last two points, see my Modern Whisky Map page for more info.

While the above has been done for single malt-style whiskies (described on those pages above), I am not aware of such a comprehensive analysis being done for American Bourbon/Rye whiskies. And that is certainly not what this article by Lahne and colleagues sets out to do.

The Lahne Study Design

Lahne_reprintThis paper uses a “short-cut” method – a very small sample of whiskies, sorted by a very small panel (not identified for expertise), asked to simply free-sort (i.e., apply whatever characterization they want, without any descriptive features). This does not compare to the first step described above.

The reason for this is that they are really only seeking to validate a novel cluster and dimensional analysis method, and NOT provide a definite answer to issue of bourbon/rye classification. In other words, they are validating a process for doing the last two steps above, not the first.

Here are the top-line reasons why you should not get too worked up about this article in terms of the ability to discriminate ryes from bourbons:

  • Participants were not asked to separate bourbons from ryes, but rather to free sort into whatever number and type of groupings they felt like
  • Participants did not necessarily have any experience with whisky (selected only for being “nonrejectors of whiskey by aroma”).
  • Participants were drawn from a University campus environment, with a mix of students, staff and faculty. Note the mean age was 42, but the median age was 31. When combined with the standard deviation of 19 yrs, this is a real tip-off as to the spread of age and likely experience with whisky.
  • Consistent with Scotch panel reviewing norms, participants only smelled the whiskies (no tasting was performed).
  • Similarly, whiskies were diluted 1:1 with distilled water, to limit and mask the effects of high alcohol content (i.e., presented only at 20-25% ABV for smelling)
  • A very limited number of whiskies were used – only 5 bourbons and 5 ryes – without explicit consideration of the rye content in their mashbills (I will come back to this point of whisky selection in more detail later)

Note that nothing that I have said above is intended as a criticism of the analysis itself. The above are simply statements as to the participant and task nature of the study. That said, many enthusiasts – with some justification – will reject the use of naive sorters, free sorting, and lack of tasting to separate whiskies in this study.

On the point of smell-only sorting, I should clarify that while it is common in many Scotch whisky panels to only nose the whisky, this is done simply to prevent reviewer fatigue and potential intoxication. While it has been argued that many (though not all) of the characteristics of Scotch whisky can be recognized by smell alone, this presumes an expert panel with extensive experience (which is not the case here). Further, there is at least anecdotal evidence to suggest that the effect of rye on American whisky flavour is not limited to scent (i.e., many find rye flavours more pronounced on tasting than nosing). As such, I find the authors stated claim in this article that it is unlikely that actual tasting would have changed the grouping results is unreasonable and not exactly evidence-based.

In terms of the free sorting, the authors attempt to justify this method by stating that results from such studies “are often equivalent to more exhaustive, traditional methods” (i.e. the ones I explained in the section above, for this site and Scotch whiskies). That may be true, but my experience of whisky analysis makes me seriously doubt it (I would really need to do an independent review of the literature to verify that claim). But it is most certainly NOT true if you draw a biased small sample that is not representational of the overall dataset.

This is the basis of all inferential statistics – if you are going to draw from a population, you must try to be as representational as possible and control for obvious confounds. I will discuss this issue of the specific whisky selection in detail below, as there is good recent to doubt their selection, based on earlier scientific studies and results presented in this analysis.

Consistent with the stated goals of this paper, I find the actual statistical analysis method used to be well described and justified, and is likely appropriate for further large scale studies (as they propose). However, you simply CANNOT make meaningful inferences about the ability to discriminate rye and boubons from a study with the sampling and sorting design used here (i.e., it is not designed to address that question). Any over-arching claims to contrary are not supported by the evidence in the study.

The Real Issue

Now, I could stop there, and draw this commentary to a close. Indeed you may want to stop reading at this point, unless you really care about scientific study design. 🙂

The issue of bias is an important consideration among both the general enthusiast community and in the scientific community. It is worth exploring in detail, given some red flags in this particular study. Let me start with the whiskey analysis results in this paper, and then show why their conclusions about bourbon vs rye are (at best) misleading based on the sample selection.

The authors note that US law only requires (among other things) that the mashbill for bourbons be 51% corn, and that of ryes be 51% rye. They also note that producers do not commonly reveal the exact mashbill composition. As such, it is possible that the bourbons and ryes in their samples could differ by only a couple of percentage points of rye content.  This would certainly be a confound.

But there is actually a lot of information available out there about the proportion of rye in many mashbills. Indeed, it is interesting that 4 of the 5 bourbons they used are considered as “low-rye” by enthusiasts. Here is the actual list of what they used (with distiller/owner identified):

  • Jim Beam Black Bourbon (Clermont/Beam)
  • Old Forester Straight Bourbon (Brown-Forman/Brown-Forman)
  • Old Crow Straight Bourbon (Clermont/Beam)
  • Elijah Craig 12yo Bourbon (Bernheim/Heaven Hill)
  • Buffalo Trace Bourbon (Buffalo Trace/Sazerac)
  • Rittenhouse Rye (Bernheim/Heaven Hill)
  • Sazerac Rye (Buffalo Trace/Sazerac)
  • Bulleit Rye (MGP/Diageo)
  • Knob Creek Rye (Clermont/Beam)
  • Jim Beam Rye (Clermont/Beam)

While there is no official designation of low-rye vs high-rye, I expect most of us would consider all the bourbons except for Old Forester to be particularly low-rye (i.e., all 4 are believed to be <15% rye content).

This brings up a critical point – despite a general lack or reporting by producers, you could still set out to choose whiskies that evenly span the continuum of known rye content fairly easily, from what is reported for available whiskies. In other words, you could assemble samples from known low-rye bourbons (<12% rye), high-rye bourbons (15%>x<35%), sub-maximal ryes (51%>x<100%), and 100% ryes. The authors have not done this – indeed, they do not even discuss this as a possibility.

Summary Results

To start, let’s see what their analysis method actually produced with this particular set of whiskies. The principal component analysis (PCA) in their study found that 47% of the total variance can be explained by 3 dimensions, as follows:

  • The first dimension (21% of the variance) separates 3 whiskies from the others – all 3 produced by Jim Beam (JB Black, JB Rye, and Old Crow Bourbon).
  • The second dimension (14% of the variance) does not separate by rye vs bourbon (the authors claim), but best correlates to age and ABV.
  • The third dimension (12% of the variance) separates Bulleit Rye from the other 7 whiskies that cluster together in the first dimension.

On the basis of these three key dimensions, the authors (seemingly) reasonably conclude that producer, age and ABV have a greater influence on self-selecting of whisky into groups than does mashbill (i.e., the traditional method of producers and enthusiasts).

So what is wrong here? The main problem is that we have potentially a huge selection bias in their choice of whiskies, based on the existing data available to these researchers.

Before I explain how they choose their whiskies, it is worth noting that Jim Beam made up 4 out of 10 whiskies sampled above (again, sorted by diluted scent alone). Is it really so surprising that naive sorters choose to group these together out of the whole set?  Can we really infer from this (and the Bulleit finding) that producer is the key discriminant?  Not in such a limited and biased small sample of whiskies we can’t. Again, I will come back to why this is so at the end, when I discuss their justification for the selection.

Another problem is their interpretation of the second dimension. The authors state that age and ABV correlate best for this dimension, but those correlations are actually very weak statistically. Note as well that there is not a big age or ABV difference between most of these whiskies to start with, and the study is hardly powered to look at these variables. Going through the results, I have to say these conclusions for the second dimension of the PCA seem very tenuous based on the actual analysis in the paper.

But here is the kicker – if you pull Buffalo Trace from the analysis, the second dimension correlates almost perfectly for bourbon vs rye (!).  Buffalo Trace is an outlier in the group, clustering strongly to the ryes. Without it there, you would have a nearly perfect correlation of rye to bourbon on the second dimension of the PCA.

What this means is if they had chosen to substitute another whisky for Buffalo Trace in the (incredibly tiny) bourbon sampling, they would likely have found a completely different result. Indeed, without Buffalo Trace in the mix (i.e., looking at only the other 9 whiskies), they most certainly would have concluded that rye vs bourbon is a main discriminator.

Why Did They Choose These Whiskies?

The authors main justification for their specific sampling of whiskies is that they were selected from ones used in a previous study to “span the space of nonvolatile constituents found in whiskies.” They cite as the sole reference a paper by the second author on this study: Collins et al, Profiling of nonvolatiles in whiskeys using ultra high pressure liquid chromatography quadrupole time-of-flight mass spectrometry (UHPLC-QTOF MS).

Now, first off, you might be thinking it is a bit odd to use a study of “nonvolatile constituents” as the characterization system to pick a subset of whiskies for a smelling-only sensory sorting study (!)

I will say that the earlier Collins et al HPLC/MS paper appears to be a well-designed and analyzed study looking at a larger number of American whiskies (63). Indeed, the analysis is even more thorough and robust that this paper. But the actual findings in that earlier paper seriously call into question the claim made here that 5 ryes and 5 bourbons are going to “span” that space.

Specifically, the Collins paper found that when removing craft whiskies, there is a difference between bourbons and ryes in terms of their nonvolatiles – but with significant overlap between the groups. So, depending on which specific whiskies you sampled for a subsequent smaller-scale study, you could produce any result you wanted (i.e., no difference, or a massive difference between bourbons and ryes – depending on which ones you picked).

Note that the Collins paper does not identify the individual whiskies, so there is no way for the reader to ascertain the selection bias this time around. But the authors had access to all this information.

Is there any reason to doubt their claim that they have chosen a reasonable “span”?  Unfortunately, there is. One particular interesting finding in the Collins paper is that while the whiskies of any given producer tend to cluster together (regardless of rye composition), there were very clear differences between producers in their PCA. In particular, there is one massive discriminator in the first dimension, where one producer was a huge outlier from all the others (who differentiate from each other to a varying extents in a second dimension).

Given this unequal pattern, how exactly did Lahne et al draw a representative span of producers?  If they included that one outlier producer from the earlier study, they would have heavily biased this study for the first dimension of their PCA. In particular, I wonder if that outlier was Jim Beam, since the pattern of an extreme outlier in the PCA is reproduced almost exactly here. If that outlier producer was Beam, then they have deliberately stacked the deck in this study by using a known outlier for 40% of the whiskies examined here.

But even if that is not the case, I don’t see how they could have chosen “evenly” among such divergent producers. Again, 4 of the 10 whiskies used in this study came from a single producer. That seems very surprising, given the strong variance between virtually all the producers reported in the earlier study.

There is a fundamental issue of lack of transparency here. The only way to verify their selection in this study is for the identity of the whiskies in the earlier Collins HPLC/MS study to be publicly revealed, at least for the current set of whiskies studies here. That way, we can all see exactly how they choose to assemble their smaller subset in this study, and verify its supposed representational basis.

Wrapping It Up

The key point that I made early in this commentary is that the participant and sampling design clearly prevents you from drawing any meaningful conclusions about the ability of people to discriminate rye from bourbon (i.e., that is NOT what this study was designed to test for).

But the bigger underlying problem here is the apparently non-representational basis of the whiskies they choose to study. Again, they had access to much more nonvolatile constituent information on these whiskies than they present publicly. And the reported levels of variance from their earlier work calls into question the very idea that a such a small set could possibly be representational here, as they claim.

Moreover, reviewing the results of this study, it is clear that the opposite finding (that is, a clear dimension of rye-to-bourbon differentiation) would have been obtained had 1-2 specific whiskies not been included. Given this, and the authors awareness of the distribution from earlier studies, it is critical that they provide a transparent explanation for their selection criteria, to show a clear absence of selection bias.

Moving forward for any further studies of ryes and bourbons, I would encourage these authors to move beyond their nonvolatile analysis, and consider known information on actual mashbill composition. While incomplete for all producers, there is enough information out there as to reasonably assign a range of American whiskies across a continuum of actual rye content. Further, they also need to test their assertion that actual tasting would not influence the results of any sorting paradigm, given the lack of evidence for this stance in the case of rye in bourbon.


So You Want to Be a Whisky Reviewer …

One of the first questions that comes up when someone is considering becoming a product reviewer is whether or not to provide a score – and if so, over what sort of range?

As discussed on my Flavour Commentaries page, providing a score or rating is hardly required in a product review. I personally avoid doing this in my flashlight reviewing (in part because technology is always advancing there). But if you are interested in scoring, you might find the personal observations (and data) from integrating whisky reviewer scores on this site interesting.

Scoring Systems

Most whisky reviewers tend to provide some sort of quality ranking. As explained on my Understanding Reviewer Scoring page, at its heart scoring is simply a way to rank the relative quality of all the products a given reviewer has sampled. As long as you are only looking within the catalog of reviews of that one reviewer, it doesn’t necessarily matter what category labels they are using for their rank.

A numerical score from 1-100? Fine. Star ratings from 1 to 5, with half-stars? No problem. Six gradations of recommended levels? Sure. Kumquats widths from 2.1cm to 3.3cm in 0.25cm increments? Okay, if that floats your boat. Personally, I’d love to see someone review to base hexadecimal (“Man, this limited edition is much better than the regular OB version – I’ll have give it an 0E”).

One problem with the diversity of scoring systems is that it may be hard to get a feel for how items compare to each other for a given reviewer – until you go through her whole catalog of reviews. Similarly, it would be hard to integrate the reviews of multiple reviewers on a given site (or across sites). This has led to some consolidated approaches for standardization. In the liquor industry, probably the most popular one is that developed by Robert Parker for scoring wines.

In this system, all wines receive a numerical whole number score between 50 and 100. The presumption is that anything below 50 is unfit for human consumption (i.e., swill). 50-59 is not recommended. 60-69 is below average. 70 to 79 is average. 80 to 89 is above average. 90 to 95 is outstanding. And 96-100 is extraordinary (and rare).

The benefit to this system is it is fairly easy to understand and relate to. Unfortunately, it still leads to a lot of variation in interpretation by different individuals – as shown graphically for whisky reviewers on my Understanding Reviewer Scoring page.

Still, if you were starting out as a reviewer, this isn’t a bad system to work from, as it provides a recognizable structure. But fundamentally, it is no better or worse than any other scoring system. From the perspective of someone running a meta-critic integration site, I can tell you it doesn’t really matter what you choose to use as scores/labels – what really matters is your consistency in using them.

Score distributions

Consistency of scoring actually encompasses a number of things. Is the reviewer applying scores in as fair a manner as possible across categories? Would the same product get the same score if sampled on another occasion?  In other words, is the reviewer showing good internal consistency in their scoring?

Few reviewers do repeated testing of the same sample (and almost none with blinding), so it is hard to know. Whiskies are also subject to considerable batch variations (for some of the reasons discussed here), which further complicates matters if the repeated sampling is done on different batches. I recommend you check out my Review Biases and Limitations page a discussion of some of the common pitfalls here.

But one way to address this consistency issue in the aggregate is to compare the distribution pattern of scores across reviewers. This is part of the larger correlational analyses that I did in building the Meta-Critic database.

The key points that I want to share here – as a guide for newcommers to reviewing – are:

  • whisky reviewers do not hand out scores in an evenly distributed manner
  • whisky reviewers are fairly consistent in how they deviate from a normal distribution

The above is true of all the whisky reviewers examined here, including those ostensibly using the Parker wine scoring scheme. As explained on my Understanding Reviewer Scoring page, all reviewers skew left in their distributions. This is shown graphically below in the frequency histogram of the Meta-Critic scores:


In essence, you can interpret this distribution as pretty close to what the “average” or typical reviewer in my dataset looks like.  Again, see that earlier page for some examples of actual reviewers.

Note that I choose to present the Meta-Critic score using a standard scientific notation of one significant digit to the left of the decimal. Those who remember using slide rules will be able to relate. 🙂  Just multiply everything by 10 if you want to know what it would look like on the Parker scale.

Below are the current actual descriptive characteristics of the Meta-Critic score distribution.

Mean: 8.53
Median: 8.58
Standard Deviation: 0.41
Skewness: -0.63
Minimum: 6.93
Maximum: 9.52

While the Parker scoring system provides a nice idealized normal distribution in theory (i.e., min of 50, max of 100 and an average of 75) – in practice most reviewers deviate from it considerably.  I suspect grade inflation has a lot to do with this, along with a desire to please readers/suppliers.  But whatever the reasons, it is a common observation that all whisky reviewers seem to fit the above pattern.

So if you are starting out as a reviewer, you may want to consider trying to match your scores to a similar distribution – just so that your readers will have an easier time understanding your reviews in the context of others out there. Of course, nothing is stopping you from breaking the mold and going your own way.  😉

Range of Whiskies

The other thing I see a lot is reviewers “revising” their score range over time – which can be a problem if they have a lot of old scores to “correct”.

The source of the problem seems to be a sampling bias when they start out reviewing, and have limited experience of only budget to mid-range products. As they start reviewing higher-end products, they realize they are too “squished” in their scoring to be properly proportional.  For example, if you start out giving one of the most ubiquitous (and cheap) single malts like the Glenlivet 12 a 90+ score, that doesn’t leave you much room to maneuver as you start sampling higher quality single malts.

To help new reviewers calibrate themselves, here are how some of the more common expressions typically fall within the Meta-Critic Score, broken down by general category. Note that I’m not suggesting you bias your scores by what the consensus thinks below – but I just want to give you an idea of what the general range is out there for common whiskies that you are likely to have tried.

~7.5 whiskies
Bourbon-like: Jim Beam White, Rebel Yell, Ancient Age
Rye-like: Crown Royal, Canadian Club
Scotch-blend-like: Johnnie Walker Red, Cutty Sark, Ballantine’s Finest, Famous Grouse
Single-Malt-like: (there aren’t many that score this low)

~8.0 whiskies
Bourbon-like: Jack Daniels’s Old No. 7, Jim Beam Devil’s Cut, Wild Turkey 81
Rye-like: Royal Canadian Small Batch, Gibson’s Finest 12yo, Templeton Rye
Scotch-blend-like: Chivas Regal 12yo, Jameson Irish Whiskey, Teacher’s Highland Cream, Black Grouse
Single-Malt-like: Glenfiddich 12yo, Glenlivet 12yo, Glenrothes Select Reserve, Tomatin 12yo

~8.5 whiskies
Bourbon-like: Wild Turkey 101, Basil Hayden’s, Bulleit Bourbon, Four Roses Small Batch
Rye-like: Knob Creek Small Batch Rye, Canadian Club 100% Rye, George Dickel Rye, Forty Creek Barrel Select
Scotch-blend-like: Johnnie Walker Blue, Johnnie Walker Black, Green Spot, Té Bheag
Single-Malt-like: Old Pulteney 12yo, Glenmorangie 10yo, Dalmore 12yo, Ardmore Traditional Cask

~9.0 whiskies
Bourbon-like: Russell’s Reserve, Maker’s Mark 46, Booker’s Small Batch, W.L. Weller 12yo
Rye-like: Lot 40, Masterson’s Straight Rye 10yo, Whistlepig 10yo
Scotch-blend-like: (Not much makes it up to here, maybe Ballantine’s 17yo, Powers 12yo John’s Lane)
Single-Malt-like: Aberlour A’Bunadh, Amrut Fusion, Ardbeg 10yo, Talisker 10yo

Close to ~9.5 whiskies
Bourbon-like: Various Pappy van Winkles, some BTACs, George T. Stagg
Rye-like: High West Midwinter Night’s Dram Rye (closest Canadians: Wiser’s Legacy, Gibson’s 18yo)
Scotch-blend-like: nada
Single-Malt-like: Lagavulin 16yo, Brora 30yo, Caol Ila 30yo, Redbreast 21yo

As an aside, you may notice that some whisky categories get consistently higher or lower scores than others. As a result, I suggest you try to avoid directly comparing scores across categories (e.g. bourbons vs single malts), but focus instead on internal consistency within categories. This is why the Whisky Database is sorted by default by general category (and then flavour profile, if available), before sorting by score.

KumquatAgain, the above is just a way to help you calibrate yourself against the “typical” reviewer (as expressed by the Meta-Critic score).  Nothing is stopping you from going your own way.

But if anyone does decide to use kumquat widths as category labels, please drop me a line – I’d love to hear about it. 🙂


Expert vs User Reviews – Part II

Following up on my earlier discussion of the online whisky review community on Reddit (“Scotchit”), I have now added a properly-normalized set of Reddit user reviews to my Whisky Database.


As mentioned on that earlier page, I found an unusually high degree of consistency of scoring between Reddit reviewers on some whiskies, and evidence of systematic biases that differed from the independent expert reviewers. Scoring methods were also a lot more variable among the user reviews, although this can be partially corrected for by a proper normalization (as long as scoring remains consistent and at least somewhat normally-distributed for each reviewer).

My goal was to find Reddit reviewers who could potentially meet the level of the expert reviewer selection used here. As such, I started by filtering only those reviewers who have performed a similar minimum number of reviews as the current experts in my Whisky Database (in order to ensure equivalent status for normalization). This meant excluding any Reddit reviewer with less than 55 reviews of the current ~400 whiskies in my database. As you imagine, this restricted the number of potential candidates to only the most prolific Reddit reviewers: 15 in this case.

Upon examining the scores of these generally top-ranked reviewers, I identified 6 as having potential inconsistency issues in scoring. One common issue was a non-Gaussian distribution (e.g., a much longer “tail” of low scoring whiskies than high). I was able to account for this in the final analysis by slightly adjusting the normalization method at the low-end.

Of potential concern was inconsistent reviewing, where two products of similar flavour profile, price and typical mean expert scores were given widely divergent scores. Only a small number of reviewers should issues here, but some examples include the Aberlour 12yo non-chill-filtered compared to the Balvenie DoubleWood 12yo, the Coal Ila 12yo compared to the Ardmore Traditional Cask, and the Glenfiddich 18yo compared to the Dalmore 12yo.  I found reviewers who placed those exact pairings at the extreme ends of their complete review catalogue (i.e., ranked among the best and worst of all whiskies reviewed by that individual).

To be clear, this is not really a problem when perusing the Reddit site. As long as you are looking at whiskies within a given flavour cluster, you are likely still getting a clear relative rank from these reviewers. It is just when trying to assemble a consistent ranking across all flavour classes of whiskies that inconsistent review scores for a given reviewer becomes a potential issue. As explained in my article discussing how the metacritic score is created, scoring is simply a way to establish a personal rank for each reviewer.

Fortunately, these instances were fairly rare, even for the reviewers in question. In most cases, it was the low-ranked whisky that was disproportionately un-favoured for some reason. If significantly discordant with the rest of the database, these could be accounted for in the normalization by excluding a small number of statistically-defined outliers (using the standards described below).

Correlation of Reddit Reviewers

Independence of review appears to be lower among the Reddit reviewers than among the expert reviewer panel used here. Reddit reviewers often reference the scores and comments of other users in their own reviews. This tends to lead to some harmonization of scoring, perpetuating dominant views on a number of whiskies. Indeed, the variance on many of the well-known (and heavily reviewed) expressions was lower for normalized Reddit reviewers than the expert reviewer panel. Also, the average of all significant correlation pairings across the 15 Reddit reviewers was higher than among the expert review panel (r=0.60 vs 0.40). Interestingly, the one Reddit reviewer who seemed the most independent from the others (r=0.37 on average to the others) correlated the closest with my exiting Meta-critic score before integration (r=0.72).

I also noticed a strong correlation in the selection of whiskies reviewed among Reddit reviewers – which was again much higher than among my independent expert reviewers. This was initially surprising, given the wide geographic distribution of Reddit users. But on further examination, I discovered that lead Reddit reviewers typically share samples with one another through trades and swaps. This can of course further reduce the independence of the individual reviews.

As a result of this analysis, I decided to combine these 15 Reddit reviewers into one properly normalized reviewer category for my Whisky Database (i.e., a single combined “Reddit Reviewer” category).  [Revised, See Update later in this review]

Normalization Method for Reddit Reviewers

For this analysis, each Reddit reviewer was individually normalized to the overall population mean and standard deviation (SD) of my current expert review panel. The normalized scores for these individual Reddit reviewers were then averaged across each individual whisky they had in common, to create the combined Reddit Reviewer category. On average, there were n=5 individual Reddit reviewers per whisky. As expected, the SD for the Reddit group of reviewers was lower on average than among my current expert panel.

To deal with any inconsistent scoring patterns, I used fairly stringent criteria to isolate and remove outlying scores. To be considered an outlier, the individual normalized Reddit reviewer score had to differ from the average Reddit reviewer score for that whisky by more than 2 SD units, AND had to exceed the existing Meta-critic score by more than 3 SD units. In cases where only one Reddit reviewer score was available, exclusion was based solely on the 3 SD unit criteria from the existing Meta-Critic mean score. This resulted, on average, in 0.8 outlier scores being removed from each Reddit reviewer (i.e., less than one outlier per reviewer).

The combined group of top Reddit Reviewers was then treated as a single reviewer category in my database. The combined Reddit score was then integrated with the other expert reviews for all whiskies in common (~200 whiskies in my database). The second pass normalization was performed in the same manner as for each individual expert reviewer described previously.

Comparison of the Reddit Reviewers to the overall Metacritic Score

Now that this category of top Reddit user reviews is properly integrated into my Whisky Database, it is interesting to compare how the Reddit scores compare to the other experts – to see if the general trends noted earlier in Part I persist.

For the comparisons below, I am comparing the combined Reddit Reviewer scores to the revised total Meta-critic scores (which now includes this Reddit group as a single reviewer category). I will be reporting any interesting Reddit reviewer differences in terms of Standard Deviation (SD) units of that group from the overall mean.

I am happy to report that the integrated Reddit scores do not show the pattern of unusually low ranking for international whiskies, as noted previously for the broader Reddit group (i.e., these top reviewers are commensurate with the other expert reviewers here).

For the Rye category, the overall distribution of scores was not that different. There was a trend for Reddit reviewers to rank Canadian ryes lower than American ryes, but the numbers are too low to draw any significant inferences.

The Bourbon category similarly shows no consistent difference between the Reddit reviewers and the other reviewers in my database. The Jack Daniels’ brand of whiskies seems somewhat less popular on Reddit, however, compared to the overall Meta-critic score (Gentleman Jack -2.2 SD, Jack Daniels No.7 -1.1 SD, Jack Daniels Single Barrel -0.5 SD).

The blended Scotch whisky category was scored lower overall by the Reddit reviewers compared to the expert reviewers – consistent with the earlier observation (i.e., almost all Scotch blends received a lower Reddit score than the overall Meta-critic score). Only a couple of blends stood out as being equivalently ranked by both the top Reddit reviewers and the other reviewers – the most notable being Té Bheag (pronounced CHEY-vek). Incidentally, this happens to be one of the highest ranking Scotch blends in my database. To be clear: Scotch blends get consistently lower ranks than single malts by virtually all reviewers – it’s just the absolute scoring of the normalized Reddit reviewers that is particularly lower than the others.

The single malt whiskies showed some noticeable examples of divergence between the Reddit reviewers and the overall Meta-critic scores. The clearest example of a brand that was consistently ranked lower by the Reddit group was the new Macallan “color” series (Gold -2.4 SD, Amber -1.4 SD, and Sienna -0.5 SD). To a lesser extent, a similar pattern was observed for some of the cask-finished Glenmorangie expressions (Nectar D’Or -1.1 SD, Lasanta -1.0 SD), and the entry-level Bowmore (12yo -1.4 SD, 15yo -1.3 SD), and Ardmore (Traditional Cask -2.1 SD) expressions.

Similarly, some brands got consistently higher scores from the top Reddit reviewers – most notably Aberlour (A’Bunadh +2.0 SD, 12yo NCF +1.8 SD, 12yo double-cask +1.6 SD, 10yo +0.4 SD). Again, to a lesser extent, other seemingly popular Reddit choices were Glenfarclas (105 NAS +1.1 SD, 17yo +0.9 SD, 12yo +0.7 SD, 10yo +0.6 SD) and Glen Garioch (Founder’s Reserve +1.3 SD, 12yo +0.8 SD, 1995 +0.4 SD).

Note that both Aberlour and Glenfarclas are generally in the heavily “winey” end of the flavour clusters, just like the relatively unpopular Macallans (and Glenmorangies). I suspect part of the issue may be the perceived value-for-money in this “winey” category. Macallan is considered especially expensive for the quality, and the new NAS “color” series are generally regarded as lower quality by most critics (and even more so by Reddit reviewers). In contrast, Aberlour remains relatively low cost for the (high) perceived quality.

In any case, those were among the most extreme examples. On most everything else, there is little obvious difference between the normalized top Reddit reviewers and the other expert panel members. Properly normalized in this fashion, they provide a useful addition to the Meta-critic database. I am happy to welcome their contribution!

UPDATE July 22, 2016:

In the year since this analysis was published, I have continued to expand my analysis of Reddit whisky reviews.  I now track over 30 Redditors, across my entire whisky database, properly normalized on a per-individual basis.

While many of the observations above remain, this larger dataset has allowed me to explore reddit reviewing in more detail.  Through correlation analyses, I have been able to refine subtypes of reviewers on the site.

Specifically, there is a core set of reviewers who show very high inter-reviewer correlations.  This group, as a whole, correlates reasonably well with the Meta-Critic score, but is really defined by how consistent they are to one another.  Many of the high-profile, prolific reviewers fall into this group. All the associations noted above apply to this group, and are strongly present (e.g., they score American whiskies consistently higher than the Meta-Critic, and Canadian whiskies consistently lower).

A second group of reviewers show relatively poor correlations to each other, the main reddit group above, and the Meta-Critic score. On closer examination however, the main reason for this discrepancy is greater individual extremes in scoring on specific whiskies or subtypes of whisky.  When properly normalized and integrated, this group demonstrates a similar whisky bias to the first group (although somewhat less pronounced, and with greater inter-reviewer variability). A number of high-profile reviewers fall into this second group.

The third group (which is the smallest of the three) is a subset of reviewers who correlate better with the Meta-Critic score than they do the two groups above.  This group appears to show similar biases to the larger catalog of expert reviewers, and not the specific cohort of reddit reviewers.

As a result of these analyses, I have expanded the contribution of reddit scores to my database by adding the average scores for each group above.  Thus, instead of having a single composite score for all of reddit on each whisky (properly normalized and fully integrated), I now track 3 separate reddit reviewer groups (each normalized and integrated for that specific group).

I believe this gives greater proportionality to the database, encompassing both the relative number of reddit reviews, and their enhanced internal consistency.


Expert vs User Reviews – Part I

Wine barrels

As discussed on my Biases and Limitations page, I have chosen to use only established reviewers – with an extensive range of individual whiskies reviewed – when building the Whisky Database here. Please see my Reviewer Selection page for the criteria used in selecting these reviewers.

But there are a number of active online whisky communities that have member reviews, and it is worth exploring how these may relate to the properly-normalized expert panel developed here. In this commentary, I am going to explore correlations of the Whisky Database to the Reddit Scotch Whisky subgroup (“Scotchit”).

My goal here is simply to see whether or not it is worthwhile to try to incorporate this user community into my expert panel. I am personally a big fan of discussion forums, where newcomers and experts can rub shoulders and shares experiences.

The Reddit Scotchit Review Archive

This Scotchit user group meets many of my established reviewer criteria, including being very active in the last few years with openly available reviews. While the main Scotchit site can be a bit daunting to navigate, you can find the full open-access review archive (with over 13,000 reviews as of July 2014, including “community reviews”) – as well as several attempts at quantitative analysis and summary of the results.

The main challenge is that individual Scotchit user reviews can vary widely in quality, experience and consistency. Scoring is also hugely variable (i.e., some members use the full range from 0-100, whereas others use the more common restricted higher range). Ideally, a proper normalization should be performed for each reviewer, but this poses considerable technical and logistical challenges for the massive review archive dataset. User Dworgi has created a user review normalizer program, but I couldn’t get it to work with the current review archive.

On that front, I should point out that while they have done an impressive job of maintaining the Scotchit review archive, it is still a community project using automated review-catching bots. As a result, there are a certain number of errors in the database. The most significant of these are erroneous scores (likely due to the reviewer mentioning several scores in his/her review, with the automated script having trouble finding the final score). There are also structural problems with the complete dataset – for example, several hundred entries currently have missing columns or transposed columns. There are also many more cases where the same expression is listed under different titles. So if you plan to work with this archive, you will still need to do your own manual quality control checks and data curation.

Given these issues (which are generally well appreciated on the site), it is recognized that some filtering restriction of the archive is required to meaningfully interpret any summary results. One approach is to restrict to only those reviewers that meet a certain minimum number of published individual reviews (e.g., those who have done 50 or more), and to ignore community reviews. Another is to set a minimum number of user reviews required for each whisky before considering it in an analysis (e.g., 10 reviewers, as done here for an analysis by Dworgi for data up to the end of 2013). Another option is to also restrict reviews to those that meet a minimum score cut-off (e.g., neglecting reviews that score <50 out of 100). Charles Vaughn has a good interactive graphing tool using the same dataset as Dworgi, where you can dynamically adjust these cut-off values yourself on the Overview tab and see how it affects the results. This is a good tool to help you calibrate you understanding of the dataset (although it is limited in time to an early summary set).

Again, all these restrictions are done in order to try and help compensate for the wide variations in scoring, given the lack of normalization. Given my experience, I’m not sanguine about the success of these methods – as demonstrated on this site, you really need to properly normalize each reviewer’s scoring if you are to meaningfully integrate. However, given the daunting size of the Reddit reviewer database, these simple filtering approaches are understandable – and are better than nothing. It is at least worthwhile to see if the filtering methods suggest any meaningful trends that could be followed up  with a more detailed analysis.

As an aside, one potential advantage to having a very large dataset of user ratings is the possibility of using a proper Bayesian estimator. Popular in estimation/decision theory, a Bayes estimator is used to compensate for when only a small number of ratings are available on any given item in a much larger dataset (i.e., what is known as posterior expected loss). It works nicely across extremely large datasets that have highly variable numbers of reviews (e.g., the Internet Movie Database apparently uses a Bayesian estimator). Items with a low number of reviews are given lower weighing in the analysis. Once the number of ratings reaches and surpasses a defined threshold, the confidence of the average rating surpasses the confidence of the prior knowledge, and the weighed Bayesian rating essentially becomes a regular average. Of course, any biases in the underlying dataset would still be confounding (see below), but this would definitely be something to consider if you want to mine the Scotchit database further.

Comparative Analysis

For this first pass analysis, I have pulled out of their current public review archive (as of July 2014) all reviews for whiskies in my dataset. This yielded >200 whiskies in common, with almost 6000 individual Scotchit reviews. A quick descriptive-statistic examination of the raw data illustrates that single malt-like and bourbon-like whiskies get generally equivalent average scores across Scotchit reviewers, but that scotch blend-like and rye-like whiskies get significantly lower scores on average. While this trend is apparent among my expert review panel as well, it is noticeably more pronounced in the Scotchit user archive. This feature is well noted on the site – i.e., many seem aware that blends (and other perceived lower quality whisky categories) are particularly devalued by the members.

At a more granular level, I note that the international malt whisky subset of my database are scored lowered by the Scotchit users on average, compared to the Scottish single malts. In contrast, the expert panel used here rates these international whiskies higher than the single-malt group average. Further, I note that Canadian rye whiskies get lower scores on average in the Scotchit review archive than the American rye whiskies – whereas those two sub-categories of rye get equivalent scores among my expert panel. While relative numbers are low in these last two cases, it does suggest that the underlying biases may be different between the expert panel and Scotchit users for international products.

To explore relationships between our datasets in more detail, I have applied the same filtering method used by Scotchit users themselves when depicting their own data. For this analysis, I have used moderately stringent criteria, excluding all whiskies with <10 reviews, and any individual review score <60. This reduces the dataset to ~5400 Scotchit reviews, across ~160 whiskies in common. Again, I would prefer to use proper normalization of each of their reviewers, but the automated tool does not currently seem to be functioning.

A few observations of this restricted dataset. For one, the variation across Scotchit user reviews is much higher than across my expert review panel, on a per-review basis (even before normalization was applied to my dataset). And even after consolidating to an average score for each whisky, the variation within each reviewer group is again much higher for the Scotchit dataset. These results are not surprising given the wide variation in how scoring is applied by Scotchit users (and despite the attempt at filtering the results).

Correlation to the Meta-Critic Scores

Let’s see how a correlation of the average Scotchit score for each whisky compares to my meta-critic dataset. For this depiction, I have broken out the whiskies by category (single malt-like, scotch blend-like, bourbon-like and rye-like). Further, I have identified the flavour cluster categories for the single malt style whiskies. See my How to Read the Database for an explanation of these terms.

Correlation of to Reddit Scotchit

The overall correlation to my meta-critic score is reasonable (although lower than the average expert reviewer to the meta-critic score). You can also see that the variation increases as we move to whiskies with lower scores, as you might expect given the significant variation in how the Scotchit user base applies scores. Note that this variation is much higher than that seen in my expert reviewers (and discussed here).

Looking more closely, you can also see how the blended whiskies do indeed score relatively lower for the Scotchit users (i.e., virtually all these whiskies are below the best fit line, indicating relatively lower Scotchit scores). There are no obvious group differences among the other classes or clusters, suggesting the Scotchit users share a similar bias to my expert review panel of favouring complex whiskies over delicate ones, and heavily-smoky over lightly-smoky (in general terms).

That said, there are a number of individual whiskies results that are interesting (i.e., cases where the two reviewer bases obtain significantly different results). One example is for the less heavily-peated Highland and Islay whiskies (e.g., Bowmore and Ardmore). The entry level expressions of these brands tend to get average-to-slightly-above ratings in my expert panel, compared to consistently below average ratings from the Scotchit user base. I am not clear as to the reason for this difference, but it may reflect how user reviewers tend to apply a wider scoring scale within a given class of product (i.e., while they rate heavily-peated whiskies equivalently to the expert reviews, they disproportionately rank lighter-peat whiskies lower).

There are also individual whiskies where the Scotchit rankings seem unusual. One example is the Glenmorangie Nectar D’Or. The other Glenmorangie expressions rank in similar ways between the Scotchit and meta-critic reviewers (i.e., Lasanta < 10 yr Original < Quinta Ruban < Signet). The Nectar D’Or gets an overall average score among Scotchit users (placing it in the middle of the pack), but a consistently high score among the meta-critic reviewers (i.e., equivalent to the Signet). A possible explanation for this is that the Nectar D’Or is frequently cited in some of the popular Scotchit analyses as an example of an “average” scoring whisky (going back several years now). This may thus be influencing members to consistently rank it that way, based on earlier assessments (i.e., a trend towards consistency over time).

Wrapping Up

Taken together, this analysis suggests that may be some specific and general differences in the underlying scoring approach taken by Scotchit users and the experts used here in the meta-critic score. In particular, Scotchit users seem more critical and relatively negative toward perceived lower quality whiskies (e.g., blends) than the expert reviewers. Similarly, the users may have a relative bias in favour of UK single malts and American bourbons/ryes compared to other international jurisdictions of similar products, relative to the expert panel. Note that I am not saying the expert panel is better or worse in this regard – just that their relative systemic biases may be different (and thus difficult to integrate across users and reviewers).

There are also definite differences in how scoring is applied to whiskies, with a lot more variation among Scotchit users. However, it may be possible to correct for this with a proper normalization for each reviewer. And a proper Bayesian estimator could be used to adjust for cases where there is a low number of reviews. To date however, it seems that simpler filtering approaches have been used for most analyses of the Scotchit archive.

An underlying question to explore in more detail is the relative level of independence of reviewers. Even the expert reviewers used here are bound to be influenced to some degree (likely a variable degree) by the reviews of other experts. There are indications in the Scotchit analyses that this effect is more pronounced among the members of the user group, especially in regards to certain specific whiskies and defined clusters of whiskies. This may pose an insurmountable problem in equating expert reviews (where distinctiveness of review – within overall consistency – is highly prized) and community user reviews (where consensus may be highly prized by some, and extreme/inconsistent positions valued by others). Again, the relative value of the meta-critic analysis here is that it is drawn from samples that are as independent as possible, while striving for internal consistency (both important criteria for inferential statistics).

At the end of the day, it would not be appropriate to try and incorporate any simple summary of the Scotchit archive into the expert meta-critic score. However, there may be individual reviewers from Scotchit who have similar characteristics to the experts used here (i.e., similar relative biases and levels of independence). Indeed, there is one member that is common to both groups (the Whiskey Jug). I will continue to explore the individual reviewers in more detail, to see if there are any others that may be relevant for inclusion in the current meta-critic group.

And of course, none of this should get in your way of joining and participating in any user community you feel a connection with. They can be a great place to explore ideas with people of similar interest. 🙂

UPDATE: I proceeded further through the dataset, and done proper statistical normalizations on a number of Reddit reviewers. Discussed further here.