Expert vs User Reviews – Part II
As mentioned on that earlier page, I found an unusually high degree of consistency of scoring between Reddit reviewers on some whiskies, and evidence of systematic biases that differed from the independent expert reviewers. Scoring methods were also a lot more variable among the user reviews, although this can be partially corrected for by a proper normalization (as long as scoring remains consistent and at least somewhat normally-distributed for each reviewer).
My goal was to find Reddit reviewers who could potentially meet the level of the expert reviewer selection used here. As such, I started by filtering only those reviewers who have performed a similar minimum number of reviews as the current experts in my Whisky Database (in order to ensure equivalent status for normalization). This meant excluding any Reddit reviewer with less than 55 reviews of the current ~400 whiskies in my database. As you imagine, this restricted the number of potential candidates to only the most prolific Reddit reviewers: 15 in this case.
Upon examining the scores of these generally top-ranked reviewers, I identified 6 as having potential inconsistency issues in scoring. One common issue was a non-Gaussian distribution (e.g., a much longer “tail” of low scoring whiskies than high). I was able to account for this in the final analysis by slightly adjusting the normalization method at the low-end.
Of potential concern was inconsistent reviewing, where two products of similar flavour profile, price and typical mean expert scores were given widely divergent scores. Only a small number of reviewers should issues here, but some examples include the Aberlour 12yo non-chill-filtered compared to the Balvenie DoubleWood 12yo, the Coal Ila 12yo compared to the Ardmore Traditional Cask, and the Glenfiddich 18yo compared to the Dalmore 12yo. I found reviewers who placed those exact pairings at the extreme ends of their complete review catalogue (i.e., ranked among the best and worst of all whiskies reviewed by that individual).
To be clear, this is not really a problem when perusing the Reddit site. As long as you are looking at whiskies within a given flavour cluster, you are likely still getting a clear relative rank from these reviewers. It is just when trying to assemble a consistent ranking across all flavour classes of whiskies that inconsistent review scores for a given reviewer becomes a potential issue. As explained in my article discussing how the metacritic score is created, scoring is simply a way to establish a personal rank for each reviewer.
Fortunately, these instances were fairly rare, even for the reviewers in question. In most cases, it was the low-ranked whisky that was disproportionately un-favoured for some reason. If significantly discordant with the rest of the database, these could be accounted for in the normalization by excluding a small number of statistically-defined outliers (using the standards described below).
Correlation of Reddit Reviewers
Independence of review appears to be lower among the Reddit reviewers than among the expert reviewer panel used here. Reddit reviewers often reference the scores and comments of other users in their own reviews. This tends to lead to some harmonization of scoring, perpetuating dominant views on a number of whiskies. Indeed, the variance on many of the well-known (and heavily reviewed) expressions was lower for normalized Reddit reviewers than the expert reviewer panel. Also, the average of all significant correlation pairings across the 15 Reddit reviewers was higher than among the expert review panel (r=0.60 vs 0.40). Interestingly, the one Reddit reviewer who seemed the most independent from the others (r=0.37 on average to the others) correlated the closest with my exiting Meta-critic score before integration (r=0.72).
I also noticed a strong correlation in the selection of whiskies reviewed among Reddit reviewers – which was again much higher than among my independent expert reviewers. This was initially surprising, given the wide geographic distribution of Reddit users. But on further examination, I discovered that lead Reddit reviewers typically share samples with one another through trades and swaps. This can of course further reduce the independence of the individual reviews.
As a result of this analysis, I decided to combine these 15 Reddit reviewers into one properly normalized reviewer category for my Whisky Database (i.e., a single combined “Reddit Reviewer” category). [Revised, See Update later in this review]
Normalization Method for Reddit Reviewers
For this analysis, each Reddit reviewer was individually normalized to the overall population mean and standard deviation (SD) of my current expert review panel. The normalized scores for these individual Reddit reviewers were then averaged across each individual whisky they had in common, to create the combined Reddit Reviewer category. On average, there were n=5 individual Reddit reviewers per whisky. As expected, the SD for the Reddit group of reviewers was lower on average than among my current expert panel.
To deal with any inconsistent scoring patterns, I used fairly stringent criteria to isolate and remove outlying scores. To be considered an outlier, the individual normalized Reddit reviewer score had to differ from the average Reddit reviewer score for that whisky by more than 2 SD units, AND had to exceed the existing Meta-critic score by more than 3 SD units. In cases where only one Reddit reviewer score was available, exclusion was based solely on the 3 SD unit criteria from the existing Meta-Critic mean score. This resulted, on average, in 0.8 outlier scores being removed from each Reddit reviewer (i.e., less than one outlier per reviewer).
The combined group of top Reddit Reviewers was then treated as a single reviewer category in my database. The combined Reddit score was then integrated with the other expert reviews for all whiskies in common (~200 whiskies in my database). The second pass normalization was performed in the same manner as for each individual expert reviewer described previously.
Comparison of the Reddit Reviewers to the overall Metacritic Score
Now that this category of top Reddit user reviews is properly integrated into my Whisky Database, it is interesting to compare how the Reddit scores compare to the other experts – to see if the general trends noted earlier in Part I persist.
For the comparisons below, I am comparing the combined Reddit Reviewer scores to the revised total Meta-critic scores (which now includes this Reddit group as a single reviewer category). I will be reporting any interesting Reddit reviewer differences in terms of Standard Deviation (SD) units of that group from the overall mean.
I am happy to report that the integrated Reddit scores do not show the pattern of unusually low ranking for international whiskies, as noted previously for the broader Reddit group (i.e., these top reviewers are commensurate with the other expert reviewers here).
For the Rye category, the overall distribution of scores was not that different. There was a trend for Reddit reviewers to rank Canadian ryes lower than American ryes, but the numbers are too low to draw any significant inferences.
The Bourbon category similarly shows no consistent difference between the Reddit reviewers and the other reviewers in my database. The Jack Daniels’ brand of whiskies seems somewhat less popular on Reddit, however, compared to the overall Meta-critic score (Gentleman Jack -2.2 SD, Jack Daniels No.7 -1.1 SD, Jack Daniels Single Barrel -0.5 SD).
The blended Scotch whisky category was scored lower overall by the Reddit reviewers compared to the expert reviewers – consistent with the earlier observation (i.e., almost all Scotch blends received a lower Reddit score than the overall Meta-critic score). Only a couple of blends stood out as being equivalently ranked by both the top Reddit reviewers and the other reviewers – the most notable being Té Bheag (pronounced CHEY-vek). Incidentally, this happens to be one of the highest ranking Scotch blends in my database. To be clear: Scotch blends get consistently lower ranks than single malts by virtually all reviewers – it’s just the absolute scoring of the normalized Reddit reviewers that is particularly lower than the others.
The single malt whiskies showed some noticeable examples of divergence between the Reddit reviewers and the overall Meta-critic scores. The clearest example of a brand that was consistently ranked lower by the Reddit group was the new Macallan “color” series (Gold -2.4 SD, Amber -1.4 SD, and Sienna -0.5 SD). To a lesser extent, a similar pattern was observed for some of the cask-finished Glenmorangie expressions (Nectar D’Or -1.1 SD, Lasanta -1.0 SD), and the entry-level Bowmore (12yo -1.4 SD, 15yo -1.3 SD), and Ardmore (Traditional Cask -2.1 SD) expressions.
Similarly, some brands got consistently higher scores from the top Reddit reviewers – most notably Aberlour (A’Bunadh +2.0 SD, 12yo NCF +1.8 SD, 12yo double-cask +1.6 SD, 10yo +0.4 SD). Again, to a lesser extent, other seemingly popular Reddit choices were Glenfarclas (105 NAS +1.1 SD, 17yo +0.9 SD, 12yo +0.7 SD, 10yo +0.6 SD) and Glen Garioch (Founder’s Reserve +1.3 SD, 12yo +0.8 SD, 1995 +0.4 SD).
Note that both Aberlour and Glenfarclas are generally in the heavily “winey” end of the flavour clusters, just like the relatively unpopular Macallans (and Glenmorangies). I suspect part of the issue may be the perceived value-for-money in this “winey” category. Macallan is considered especially expensive for the quality, and the new NAS “color” series are generally regarded as lower quality by most critics (and even more so by Reddit reviewers). In contrast, Aberlour remains relatively low cost for the (high) perceived quality.
In any case, those were among the most extreme examples. On most everything else, there is little obvious difference between the normalized top Reddit reviewers and the other expert panel members. Properly normalized in this fashion, they provide a useful addition to the Meta-critic database. I am happy to welcome their contribution!
UPDATE July 22, 2016:
In the year since this analysis was published, I have continued to expand my analysis of Reddit whisky reviews. I now track over 30 Redditors, across my entire whisky database, properly normalized on a per-individual basis.
While many of the observations above remain, this larger dataset has allowed me to explore reddit reviewing in more detail. Through correlation analyses, I have been able to refine subtypes of reviewers on the site.
Specifically, there is a core set of reviewers who show very high inter-reviewer correlations. This group, as a whole, correlates reasonably well with the Meta-Critic score, but is really defined by how consistent they are to one another. Many of the high-profile, prolific reviewers fall into this group. All the associations noted above apply to this group, and are strongly present (e.g., they score American whiskies consistently higher than the Meta-Critic, and Canadian whiskies consistently lower).
A second group of reviewers show relatively poor correlations to each other, the main reddit group above, and the Meta-Critic score. On closer examination however, the main reason for this discrepancy is greater individual extremes in scoring on specific whiskies or subtypes of whisky. When properly normalized and integrated, this group demonstrates a similar whisky bias to the first group (although somewhat less pronounced, and with greater inter-reviewer variability). A number of high-profile reviewers fall into this second group.
The third group (which is the smallest of the three) is a subset of reviewers who correlate better with the Meta-Critic score than they do the two groups above. This group appears to show similar biases to the larger catalog of expert reviewers, and not the specific cohort of reddit reviewers.
As a result of these analyses, I have expanded the contribution of reddit scores to my database by adding the average scores for each group above. Thus, instead of having a single composite score for all of reddit on each whisky (properly normalized and fully integrated), I now track 3 separate reddit reviewer groups (each normalized and integrated for that specific group).
I believe this gives greater proportionality to the database, encompassing both the relative number of reddit reviews, and their enhanced internal consistency.