When Reviewers Disagree

One of the more interesting questions to consider in this analysis is how to compare whiskies where there is wide disagreement among reviewers – compared to those where everyone seems to be on the same page.

While I could just rank the combined meta-critic scores for each whisky by Flavour Cluster (or Super Cluster) and be done with it, you would be missing out on a truly important piece of data – the variation among reviewers in that summary score.

Standard Deviations Matter

To give an example as to why this is important, let us see how the variation in individual whiskies looks across all the meta-critic scores. To do this, I have plotted below the standard deviation (STDEV) for each whisky against its normalized mean meta-critic score. Each blue dot is a different whisky. I have chosen to highlight two specific whiskies for further discussion.Mean STDEVThe first point to notice above is that the variation across individual whiskies (blue dots) is typically higher at lower scores (although the correlation here is not great). This makes sense, since it would be hard for a really high-scoring whisky to be based on very discordant individual scores (i.e., it wouldn’t be high-scoring otherwise, as the meta-critic score is based on the average of all reviews).

Glenfiddich 18 vs The Glenlivet 18

Glenfiddich.18But what do you make of the Glenfiddich and Glenlivet 18 year old examples pulled out above? Both whiskies share the same flavour profile (cluster E), are commonly available at the same price (currently ~$110 each at the LCBO), and received basically the same meta-critic mean score (~8.6) on the same number of reviews (n=8). But they differ considerably in how variable those reviews are: a STDEV of 0.53 vs 0.18. What does the fact that the the Glenfiddich 18 has a standard deviation that is nearly three times higher than the Glenlivet 18 tell us?

Basically, this means that everyone agrees that the Glenlivet 18 is a slightly-above “average” whisky. Please note, that close to average is not a bad thing here – it simply means that all the reviewers consistently agreed with ranking this whisky just above the middle-of-the-pack for all the whiskies they scored (it is still quite good).

In contrast, the higher STDEV range of the Glenfiddich 18 means that some reviewers loved this whisky and scored it highly, some thought it was “average”, and some gave it a relatively low score. Indeed, I have looked at the percentile rankings, and it turns out 2 reviewers ranked the Glenfiddich 18 in their 85th percentile or higher (i.e., in the top 15% of all the whiskies they reviewed), 2 reviewers ranked in the 15th percentile or lower (i.e., bottom 15% of all whiskies they reviewed), and the rest all ranged in-between. Compare that to the Glenlivet 18, where all 8 reviewers consistently ranked it somewhere in their 40-60th percentile.

So, which of these two whiskies would you rather try? 🙂

Glenlivet.18.Decisions, Decisions …

Would it surprise you to learn that a lot of other people will have a very different conclusion from yours?

It comes down to what you are looking for in a whisky. Do you want a whisky that you can be reasonably sure falls into the middle of the whole group? Or are you willing to take a gamble that you could find it to be extraordinary (like some reviewers do), for the same price? Of course, you would also be taking the gamble that you will not like it very much for that price (as some reviewers have).

More to the point, those perspectives are not limited to different individuals – you may switch between them based on circumstances. For example, consider if you were choosing between the two when ordering a single drink in a bar, or when buying a whole bottle – in both cases, not having previously tasted either one. It is conceivable that you would be more likely to give the “riskier” Glenfiddich 18 a chance on a single pour than if you were investing in a whole bottle. And even if you are a big enough gambler to prefer buying the bottle of Glenfiddich 18 “blind” for yourself, would you still be as likely to buy it as a gift for a good friend? In that circumstance, wouldn’t you be more likely stick with the “safer”, more consistently ranked Glenlivet 18?

You see how we each weigh these options differently, depending on the context. Simply put, we each have our own internal calibration scale for risk – and different things cause us to sway more one way or another, depending on circumstance.

It is not up to me or anyone else to convince you which whisky you should try or buy. But I strongly urge you to consider ALL the relevant data summarized in my meta-critic table in making your decision. Reviewer variability (as demonstrated through by STDEV) may be as important as the overall average meta-critic score in helping you make a decision.

Building the Database

A related question comes up from all of this: how many reviews are enough?

Clearly, it depends on whether the reviewers are generally concordant or discordant. If all the reviewers rate a given whisky similarly (a relatively rare phenomenon, as you can tell from the plot above), then one or two scores could be enough to give you a pretty good idea as to perceived quality. But when discordant, you need to look at a larger number of reviews – and consider both the mean score and the variance.

I have done some boostrapping analyses on my dataset, to get a better idea of the minimum number of reviewers required in different cases. This involves randomly removing individual reviewer scores and watching how the mean meta-critic score changes. Specifically, I’ve looked to see what would cause a score to shift by 0.1 units, or what would cause a shift in or out of given quartile for a specific flavour cluster.

I’ve found that, on average, if you only have two reviewer scores, the odds are >50% that a third review will change things at one of those levels (i.e., ±0.1 mean score, or in/out of a quartile when near the border). But by the time you reach 7 reviews, addition of an 8th review is very unlikely to have any significant impact on the mean score or inter-quartile positioning.

As a result of these analyses, I’ve decided to report the meta-critic score and variance for any whisky with at least 3 reviews in my database (~350 at the time of site launch). But I urge you treat any score with less than 7 reviews as preliminary, with the potential to shift as more reviews come in (especially when the variance is high). I am continually updating my database, so you can expect to see additional reviews pop up from time to time (and thus, there may be some slight movement of whiskies in the relative rankings for each flavour category over time).

Armed with this information, it’s now time to head over to my explanation of how to interpret the whisky database here.

Leave a Reply

Your email address will not be published. Required fields are marked *