Expert vs User Reviews – Part I

As discussed on my Biases and Limitations page, I have chosen to use only established reviewers – with an extensive range of individual whiskies reviewed – when building the Whisky Database here. Please see my Reviewer Selection page for the criteria used in selecting these reviewers.
But there are a number of active online whisky communities that have member reviews, and it is worth exploring how these may relate to the properly-normalized expert panel developed here. In this commentary, I am going to explore correlations of the Whisky Database to the Reddit Scotch Whisky subgroup (“Scotchit”).
My goal here is simply to see whether or not it is worthwhile to try to incorporate this user community into my expert panel. I am personally a big fan of discussion forums, where newcomers and experts can rub shoulders and shares experiences.
The Reddit Scotchit Review Archive
This Scotchit user group meets many of my established reviewer criteria, including being very active in the last few years with openly available reviews. While the main Scotchit site can be a bit daunting to navigate, you can find the full open-access review archive (with over 13,000 reviews as of July 2014, including “community reviews”) – as well as several attempts at quantitative analysis and summary of the results.
The main challenge is that individual Scotchit user reviews can vary widely in quality, experience and consistency. Scoring is also hugely variable (i.e., some members use the full range from 0-100, whereas others use the more common restricted higher range). Ideally, a proper normalization should be performed for each reviewer, but this poses considerable technical and logistical challenges for the massive review archive dataset. User Dworgi has created a user review normalizer program, but I couldn’t get it to work with the current review archive.
On that front, I should point out that while they have done an impressive job of maintaining the Scotchit review archive, it is still a community project using automated review-catching bots. As a result, there are a certain number of errors in the database. The most significant of these are erroneous scores (likely due to the reviewer mentioning several scores in his/her review, with the automated script having trouble finding the final score). There are also structural problems with the complete dataset – for example, several hundred entries currently have missing columns or transposed columns. There are also many more cases where the same expression is listed under different titles. So if you plan to work with this archive, you will still need to do your own manual quality control checks and data curation.
Given these issues (which are generally well appreciated on the site), it is recognized that some filtering restriction of the archive is required to meaningfully interpret any summary results. One approach is to restrict to only those reviewers that meet a certain minimum number of published individual reviews (e.g., those who have done 50 or more), and to ignore community reviews. Another is to set a minimum number of user reviews required for each whisky before considering it in an analysis (e.g., 10 reviewers, as done here for an analysis by Dworgi for data up to the end of 2013). Another option is to also restrict reviews to those that meet a minimum score cut-off (e.g., neglecting reviews that score <50 out of 100). Charles Vaughn has a good interactive graphing tool using the same dataset as Dworgi, where you can dynamically adjust these cut-off values yourself on the Overview tab and see how it affects the results. This is a good tool to help you calibrate you understanding of the dataset (although it is limited in time to an early summary set).
Again, all these restrictions are done in order to try and help compensate for the wide variations in scoring, given the lack of normalization. Given my experience, I’m not sanguine about the success of these methods – as demonstrated on this site, you really need to properly normalize each reviewer’s scoring if you are to meaningfully integrate. However, given the daunting size of the Reddit reviewer database, these simple filtering approaches are understandable – and are better than nothing. It is at least worthwhile to see if the filtering methods suggest any meaningful trends that could be followed up with a more detailed analysis.
As an aside, one potential advantage to having a very large dataset of user ratings is the possibility of using a proper Bayesian estimator. Popular in estimation/decision theory, a Bayes estimator is used to compensate for when only a small number of ratings are available on any given item in a much larger dataset (i.e., what is known as posterior expected loss). It works nicely across extremely large datasets that have highly variable numbers of reviews (e.g., the Internet Movie Database apparently uses a Bayesian estimator). Items with a low number of reviews are given lower weighing in the analysis. Once the number of ratings reaches and surpasses a defined threshold, the confidence of the average rating surpasses the confidence of the prior knowledge, and the weighed Bayesian rating essentially becomes a regular average. Of course, any biases in the underlying dataset would still be confounding (see below), but this would definitely be something to consider if you want to mine the Scotchit database further.
Comparative Analysis
For this first pass analysis, I have pulled out of their current public review archive (as of July 2014) all reviews for whiskies in my dataset. This yielded >200 whiskies in common, with almost 6000 individual Scotchit reviews. A quick descriptive-statistic examination of the raw data illustrates that single malt-like and bourbon-like whiskies get generally equivalent average scores across Scotchit reviewers, but that scotch blend-like and rye-like whiskies get significantly lower scores on average. While this trend is apparent among my expert review panel as well, it is noticeably more pronounced in the Scotchit user archive. This feature is well noted on the site – i.e., many seem aware that blends (and other perceived lower quality whisky categories) are particularly devalued by the members.
At a more granular level, I note that the international malt whisky subset of my database are scored lowered by the Scotchit users on average, compared to the Scottish single malts. In contrast, the expert panel used here rates these international whiskies higher than the single-malt group average. Further, I note that Canadian rye whiskies get lower scores on average in the Scotchit review archive than the American rye whiskies – whereas those two sub-categories of rye get equivalent scores among my expert panel. While relative numbers are low in these last two cases, it does suggest that the underlying biases may be different between the expert panel and Scotchit users for international products.
To explore relationships between our datasets in more detail, I have applied the same filtering method used by Scotchit users themselves when depicting their own data. For this analysis, I have used moderately stringent criteria, excluding all whiskies with <10 reviews, and any individual review score <60. This reduces the dataset to ~5400 Scotchit reviews, across ~160 whiskies in common. Again, I would prefer to use proper normalization of each of their reviewers, but the automated tool does not currently seem to be functioning.
A few observations of this restricted dataset. For one, the variation across Scotchit user reviews is much higher than across my expert review panel, on a per-review basis (even before normalization was applied to my dataset). And even after consolidating to an average score for each whisky, the variation within each reviewer group is again much higher for the Scotchit dataset. These results are not surprising given the wide variation in how scoring is applied by Scotchit users (and despite the attempt at filtering the results).
Correlation to the Meta-Critic Scores
Let’s see how a correlation of the average Scotchit score for each whisky compares to my meta-critic dataset. For this depiction, I have broken out the whiskies by category (single malt-like, scotch blend-like, bourbon-like and rye-like). Further, I have identified the flavour cluster categories for the single malt style whiskies. See my How to Read the Database for an explanation of these terms.
The overall correlation to my meta-critic score is reasonable (although lower than the average expert reviewer to the meta-critic score). You can also see that the variation increases as we move to whiskies with lower scores, as you might expect given the significant variation in how the Scotchit user base applies scores. Note that this variation is much higher than that seen in my expert reviewers (and discussed here).
Looking more closely, you can also see how the blended whiskies do indeed score relatively lower for the Scotchit users (i.e., virtually all these whiskies are below the best fit line, indicating relatively lower Scotchit scores). There are no obvious group differences among the other classes or clusters, suggesting the Scotchit users share a similar bias to my expert review panel of favouring complex whiskies over delicate ones, and heavily-smoky over lightly-smoky (in general terms).
That said, there are a number of individual whiskies results that are interesting (i.e., cases where the two reviewer bases obtain significantly different results). One example is for the less heavily-peated Highland and Islay whiskies (e.g., Bowmore and Ardmore). The entry level expressions of these brands tend to get average-to-slightly-above ratings in my expert panel, compared to consistently below average ratings from the Scotchit user base. I am not clear as to the reason for this difference, but it may reflect how user reviewers tend to apply a wider scoring scale within a given class of product (i.e., while they rate heavily-peated whiskies equivalently to the expert reviews, they disproportionately rank lighter-peat whiskies lower).
There are also individual whiskies where the Scotchit rankings seem unusual. One example is the Glenmorangie Nectar D’Or. The other Glenmorangie expressions rank in similar ways between the Scotchit and meta-critic reviewers (i.e., Lasanta < 10 yr Original < Quinta Ruban < Signet). The Nectar D’Or gets an overall average score among Scotchit users (placing it in the middle of the pack), but a consistently high score among the meta-critic reviewers (i.e., equivalent to the Signet). A possible explanation for this is that the Nectar D’Or is frequently cited in some of the popular Scotchit analyses as an example of an “average” scoring whisky (going back several years now). This may thus be influencing members to consistently rank it that way, based on earlier assessments (i.e., a trend towards consistency over time).
Wrapping Up
Taken together, this analysis suggests that may be some specific and general differences in the underlying scoring approach taken by Scotchit users and the experts used here in the meta-critic score. In particular, Scotchit users seem more critical and relatively negative toward perceived lower quality whiskies (e.g., blends) than the expert reviewers. Similarly, the users may have a relative bias in favour of UK single malts and American bourbons/ryes compared to other international jurisdictions of similar products, relative to the expert panel. Note that I am not saying the expert panel is better or worse in this regard – just that their relative systemic biases may be different (and thus difficult to integrate across users and reviewers).
There are also definite differences in how scoring is applied to whiskies, with a lot more variation among Scotchit users. However, it may be possible to correct for this with a proper normalization for each reviewer. And a proper Bayesian estimator could be used to adjust for cases where there is a low number of reviews. To date however, it seems that simpler filtering approaches have been used for most analyses of the Scotchit archive.
An underlying question to explore in more detail is the relative level of independence of reviewers. Even the expert reviewers used here are bound to be influenced to some degree (likely a variable degree) by the reviews of other experts. There are indications in the Scotchit analyses that this effect is more pronounced among the members of the user group, especially in regards to certain specific whiskies and defined clusters of whiskies. This may pose an insurmountable problem in equating expert reviews (where distinctiveness of review – within overall consistency – is highly prized) and community user reviews (where consensus may be highly prized by some, and extreme/inconsistent positions valued by others). Again, the relative value of the meta-critic analysis here is that it is drawn from samples that are as independent as possible, while striving for internal consistency (both important criteria for inferential statistics).
At the end of the day, it would not be appropriate to try and incorporate any simple summary of the Scotchit archive into the expert meta-critic score. However, there may be individual reviewers from Scotchit who have similar characteristics to the experts used here (i.e., similar relative biases and levels of independence). Indeed, there is one member that is common to both groups (the Whiskey Jug). I will continue to explore the individual reviewers in more detail, to see if there are any others that may be relevant for inclusion in the current meta-critic group.
And of course, none of this should get in your way of joining and participating in any user community you feel a connection with. They can be a great place to explore ideas with people of similar interest. 🙂
UPDATE: I proceeded further through the dataset, and done proper statistical normalizations on a number of Reddit reviewers. Discussed further here.