So You Want to Be a Whisky Reviewer …

March 5, 2016 selfbuilt Leave a comment

One of the first questions that comes up when someone is considering becoming a product reviewer is whether or not to provide a score – and if so, over what sort of range?

As discussed on my Flavour Commentaries page, providing a score or rating is hardly required in a product review. I personally avoid doing this in my flashlight reviewing (in part because technology is always advancing there). But if you are interested in scoring, you might find the personal observations (and data) from integrating whisky reviewer scores on this site interesting.

Scoring Systems

Most whisky reviewers tend to provide some sort of quality ranking. As explained on my Understanding Reviewer Scoring page, at its heart scoring is simply a way to rank the relative quality of all the products a given reviewer has sampled. As long as you are only looking within the catalog of reviews of that one reviewer, it doesn’t necessarily matter what category labels they are using for their rank.

A numerical score from 1-100? Fine. Star ratings from 1 to 5, with half-stars? No problem. Six gradations of recommended levels? Sure. Kumquats widths from 2.1cm to 3.3cm in 0.25cm increments? Okay, if that floats your boat. Personally, I’d love to see someone review to base hexadecimal (“Man, this limited edition is much better than the regular OB version – I’ll have give it an 0E”).

One problem with the diversity of scoring systems is that it may be hard to get a feel for how items compare to each other for a given reviewer – until you go through her whole catalog of reviews. Similarly, it would be hard to integrate the reviews of multiple reviewers on a given site (or across sites). This has led to some consolidated approaches for standardization. In the liquor industry, probably the most popular one is that developed by Robert Parker for scoring wines.

In this system, all wines receive a numerical whole number score between 50 and 100. The presumption is that anything below 50 is unfit for human consumption (i.e., swill). 50-59 is not recommended. 60-69 is below average. 70 to 79 is average. 80 to 89 is above average. 90 to 95 is outstanding. And 96-100 is extraordinary (and rare).

The benefit to this system is it is fairly easy to understand and relate to. Unfortunately, it still leads to a lot of variation in interpretation by different individuals – as shown graphically for whisky reviewers on my Understanding Reviewer Scoring page.

Still, if you were starting out as a reviewer, this isn’t a bad system to work from, as it provides a recognizable structure. But fundamentally, it is no better or worse than any other scoring system. From the perspective of someone running a meta-critic integration site, I can tell you it doesn’t really matter what you choose to use as scores/labels – what really matters is your consistency in using them.

Score distributions

Consistency of scoring actually encompasses a number of things. Is the reviewer applying scores in as fair a manner as possible across categories? Would the same product get the same score if sampled on another occasion? In other words, is the reviewer showing good internal consistency in their scoring?

Few reviewers do repeated testing of the same sample (and almost none with blinding), so it is hard to know. Whiskies are also subject to considerable batch variations (for some of the reasons discussed here), which further complicates matters if the repeated sampling is done on different batches. I recommend you check out my Review Biases and Limitations page a discussion of some of the common pitfalls here.

But one way to address this consistency issue in the aggregate is to compare the distribution pattern of scores across reviewers. This is part of the larger correlational analyses that I did in building the Meta-Critic database.

The key points that I want to share here – as a guide for newcommers to reviewing – are:

whisky reviewers do not hand out scores in an evenly distributed manner
whisky reviewers are fairly consistent in how they deviate from a normal distribution

The above is true of all the whisky reviewers examined here, including those ostensibly using the Parker wine scoring scheme. As explained on my Understanding Reviewer Scoring page, all reviewers skew left in their distributions. This is shown graphically below in the frequency histogram of the Meta-Critic scores:

In essence, you can interpret this distribution as pretty close to what the “average” or typical reviewer in my dataset looks like. Again, see that earlier page for some examples of actual reviewers.

Note that I choose to present the Meta-Critic score using a standard scientific notation of one significant digit to the left of the decimal. Those who remember using slide rules will be able to relate. 🙂 Just multiply everything by 10 if you want to know what it would look like on the Parker scale.

Below are the current actual descriptive characteristics of the Meta-Critic score distribution.

Mean: 8.53
Median: 8.58
Standard Deviation: 0.41
Skewness: -0.63
Minimum: 6.93
Maximum: 9.52

While the Parker scoring system provides a nice idealized normal distribution in theory (i.e., min of 50, max of 100 and an average of 75) – in practice most reviewers deviate from it considerably. I suspect grade inflation has a lot to do with this, along with a desire to please readers/suppliers. But whatever the reasons, it is a common observation that all whisky reviewers seem to fit the above pattern.

So if you are starting out as a reviewer, you may want to consider trying to match your scores to a similar distribution – just so that your readers will have an easier time understanding your reviews in the context of others out there. Of course, nothing is stopping you from breaking the mold and going your own way. 😉

Range of Whiskies

The other thing I see a lot is reviewers “revising” their score range over time – which can be a problem if they have a lot of old scores to “correct”.

The source of the problem seems to be a sampling bias when they start out reviewing, and have limited experience of only budget to mid-range products. As they start reviewing higher-end products, they realize they are too “squished” in their scoring to be properly proportional. For example, if you start out giving one of the most ubiquitous (and cheap) single malts like the Glenlivet 12 a 90+ score, that doesn’t leave you much room to maneuver as you start sampling higher quality single malts.

To help new reviewers calibrate themselves, here are how some of the more common expressions typically fall within the Meta-Critic Score, broken down by general category. Note that I’m not suggesting you bias your scores by what the consensus thinks below – but I just want to give you an idea of what the general range is out there for common whiskies that you are likely to have tried.

~7.5 whiskies
Bourbon-like: Jim Beam White, Rebel Yell, Ancient Age
Rye-like: Crown Royal, Canadian Club
Scotch-blend-like: Johnnie Walker Red, Cutty Sark, Ballantine’s Finest, Famous Grouse
Single-Malt-like: (there aren’t many that score this low)

~8.0 whiskies
Bourbon-like: Jack Daniels’s Old No. 7, Jim Beam Devil’s Cut, Wild Turkey 81
Rye-like: Royal Canadian Small Batch, Gibson’s Finest 12yo, Templeton Rye
Scotch-blend-like: Chivas Regal 12yo, Jameson Irish Whiskey, Teacher’s Highland Cream, Black Grouse
Single-Malt-like: Glenfiddich 12yo, Glenlivet 12yo, Glenrothes Select Reserve, Tomatin 12yo

~8.5 whiskies
Bourbon-like: Wild Turkey 101, Basil Hayden’s, Bulleit Bourbon, Four Roses Small Batch
Rye-like: Knob Creek Small Batch Rye, Canadian Club 100% Rye, George Dickel Rye, Forty Creek Barrel Select
Scotch-blend-like: Johnnie Walker Blue, Johnnie Walker Black, Green Spot, Té Bheag
Single-Malt-like: Old Pulteney 12yo, Glenmorangie 10yo, Dalmore 12yo, Ardmore Traditional Cask

~9.0 whiskies
Bourbon-like: Russell’s Reserve, Maker’s Mark 46, Booker’s Small Batch, W.L. Weller 12yo
Rye-like: Lot 40, Masterson’s Straight Rye 10yo, Whistlepig 10yo
Scotch-blend-like: (Not much makes it up to here, maybe Ballantine’s 17yo, Powers 12yo John’s Lane)
Single-Malt-like: Aberlour A’Bunadh, Amrut Fusion, Ardbeg 10yo, Talisker 10yo

Close to ~9.5 whiskies
Bourbon-like: Various Pappy van Winkles, some BTACs, George T. Stagg
Rye-like: High West Midwinter Night’s Dram Rye (closest Canadians: Wiser’s Legacy, Gibson’s 18yo)
Scotch-blend-like: nada
Single-Malt-like: Lagavulin 16yo, Brora 30yo, Caol Ila 30yo, Redbreast 21yo

As an aside, you may notice that some whisky categories get consistently higher or lower scores than others. As a result, I suggest you try to avoid directly comparing scores across categories (e.g. bourbons vs single malts), but focus instead on internal consistency within categories. This is why the Whisky Database is sorted by default by general category (and then flavour profile, if available), before sorting by score.

Again, the above is just a way to help you calibrate yourself against the “typical” reviewer (as expressed by the Meta-Critic score). Nothing is stopping you from going your own way.

But if anyone does decide to use kumquat widths as category labels, please drop me a line – I’d love to hear about it. 🙂

tagged with Statistics

A scientific meta-analysis of whisky flavours and quality

So You Want to Be a Whisky Reviewer …

Leave a Reply Cancel reply