{D} 118: Four stars? Kill me.
more than you ever wanted to know about ranking, scoring, and starring things
// I worked on a product this year that numerically scored the risk in a business process (think: air quality index, but for logistics)
// Scores are common and useful! (Uber, Yelp, etc) A score is easy to grok, standardizes communication, and enhances rule-based workflows for the system
// But complex scores and ranges can hurt more than they help: a score won't be useful unless the user understands the score's inputs, the significance of the output, and the score's rate of change
// Anyway, I researched and wrote so much about scores during the gig that I thought I'd vomit all of that info here. I hope, but make no promises, that what's below is in some semblance of order
Humans love to score things
Rotten Tomatoes for movies, Goodreads for books, Yelp for businesses.
We have driver scores for Uber, which are a good indicator of how conscientious your driver will be, and rider scores for Uber, which are a good indicator of whether your mama raised you right.
Scores are fun. They’re easy to read. They’re wildly reductive, and that’s why we love them. They simplify complexity, helping you make a quick decision according to a number, or a star, or a tomato.
And they’re everywhere, in consumer apps and beyond.
Walk Scores for your neighborhood, AQI for your air, credit scores for the likelihood of you missing your car payment, and SAT scores for determining your aptitude for filling bubble forms for endless hours.
Over the years I’ve worked on several projects where the Big Idea has been to introduce a score.
One I didn’t work on, but which I watched with some bemusement being born (many years ago now), was The Feast—a site from NBC Local Media that, for the brief moment it existed, scraped restaurant reviews and other ratings from around the web then listed all restaurants with a “Feast Score” from 0 to 100.
NBC built the site and created the algorithm and then realized, thanks to their lawyers, that they weren’t legally allowed to scrape the data they needed to inform that algorithm.
So not a feast for page views, but certainly a feast for billable hours.
What is a score?
In consumer apps, a score is a numerical rating that represents user preference, reputation, or content quality. Think: five stars.
Scores typically serve as a quantifiable metric for comparison, gamification, or content ranking.
More simply, they translate vibes, converting the qualitative to the quantitative.
This is true whether you’re scoring something obviously subjective (movies, books, restaurants) or something that has the pretension of being the lord’s objective truth (e.g., judging risk in a supply chain). Scores are enumerated sentiments. Feelings, stacked and racked.
Vibes → Numbers
Feelings → Facts
Well not quite facts, but close enough for government work.
There are three common scoring methods
Which method a site uses depends on the behavior it’s trying to encourage
There are three common scoring methods used in consumer apps: Binary, Five Stars, and Ten Stars.
Each method encourages different behavior, and each has different pros and cons for the user and the system.
Generally speaking, a simpler method encourages more interactivity but will provide less accurate recommendations. A complex method carries more cognitive load, but will result in more accurate scores (both individually and in the aggregate).
A binary system (e.g., upvotes or downvotes) is stupid easy to use, but the resulting recommendations won’t benefit from granularity. This is why, for example, Netflix asks you to upvote or downvote content—that’s the system casual viewers are most likely to use.
A five star system requires a little bit more judgement from the user, and the resulting recommendations benefit from granularity. These scores are more common when the user is more invested in their experience, and the system requires more input to make better recommendations: See, for example, AirBnb.
A ten star system asks the user to rank content in half star increments, and the resulting recommendations are granular and nuanced. This is why, for example, Letterboxd uses ten stars: arguments can be made (and are made) about highly specific criteria the users care about.
Scores power recommendation engines
Scores help users and systems make decisions.
In most consumer apps, a scoring mechanism is used to power a recommendation engine. User ranks a thing, recommendation engine provides more things like that thing (or not).
A brief and incomplete but also useful accounting of a score’s utility, in consumer systems:
Scores foster user engagement. Click button, see stars!
Scores influence content discovery. People who gave X five stars also liked Y.
Scores build community—because, depending on the domain, users tend to trust peer reviews more than critics
For a user, a score creates comparative value, helping them to make informed decisions. This book is better than that book, according to <waves hands> people who’ve read the book.
For a system, a score facilitates standardization and rule based workflows. This book is better than that book, according to <points to database> users who use this system. These scores are aggregates, which compels users then to trust more, and which allows e.g. Goodreads to order books in lists and suggest similar books, or e.g. Uber to charge more for highly rated drivers during surge pricing.
IOW, scores inform system rules and drive usage.
But this is also why it’s terribly f’ing confusing when, for example, Amazon promotes sponsored listings that also have five-star scores. The sponsored placement undermines the trustworthiness of the score. Is the 4.5 star rating for product X, with a thousand user reviews, real? Does product X have those thousand user reviews because of its innate popularity, or because it’s been promoted? Amazon has guidelines for these sorts of things (I guess, somewhere maybe), but to the casual user it’s not exactly clear.
Anyway, this reveals a few lessons. Whether the user trusts the score depends on:
the visibility of the inputs (who is doing the scoring, what factors determine the score, and how are they weighted)
the clarity of the output (the score within a bounded range)
transparency of how the score changes
Explicit vs. implicit scores
The internet has gradually shifted from explicit scoring (user control) to implicit scoring (algorithmic control).
A score can be explicit—as in visible to users, like star ratings.
That would be Yelp, Goodreads, Netflix, etc., where there’s a direct feedback loop between how people score content and what content is then revealed.
A score can be implicit, too — as in hidden algorithmic values used for content distribution or matchmaking.
That would be TikTok, Facebook, Instagram, etc., where there’s an indirect feedback loop between scoring a piece of content (e.g., with a like) and what content is then revealed.
Explicit scores were popular in the Web 2.0 era, when consumer apps began to use javascript to obviate calls to the server whenever a piece of content was scored.
Implicit scores became more common in the algorithm era, when consumer apps began to internally rank content to increase engagement. That transition probably began with the introduction of Facebook’s feed in 2006, and really took off with tiktok in 2017. Who knows why we see what we see, anymore. But hoo boy we do see a lot of it.
For users, a scoring system must have an intuitive range
The simpler the range, the more intuitive the score.
Scores can’t exist without a range.
The boundedness of the range, and not the score itself, is what makes comparison intuitive.
Everybody, for example, understands a binary range (upvote/downvote) and a range from 1 to 5. Low cognitive load, sufficient granularity, a balanced positive/negative ratio with a defined midpoint: Easy peasy. That’s why they’re used for consumer apps.
Yelp, Uber, Goodreads, the Apple App Store, etc., they all use one to five stars. You just don’t need anything more complicated to rate Candy Crush.
The use of stars (or tomatoes, or clovers, or whatever) makes the score more approachable. Western culture has been scoring things with stars ever since John Murray’s travel guides began ranking Viennese hotels and sausages in 1836. Michelin later adopted Murray’s star (*) system, then trademarked their own three-rosette ratings for restaurants and destinations:
*** worth a journey
** worth a detour
* interesting
This gave rise to the five-point scoring systems we’re familiar with today. For some reason restaurant critics have traditionally used a four-star system, which is such an unhelpful range I give it no stars at all.
Ranges from 1 to 10: also intuitive, and useful when more granularity is required. IMDB, Pitchfork, Metacritic user reviews: all ten-point scoring systems, all attempts to signify authority in a domain. A more granular range implies a more complex (or pretentious!) value system. Can you imagine Pitchfork scoring albums by a five-point emoji system? The layered harmonies of Fall Out Boy cry out for numerical precision!
Even a range that runs 0 to 100, bless our base ten roots, can be easy to grok. These are most often published in percentages, for what I assume are obvious reasons. Rotten Tomatoes does this. Metacritic critic reviews, too.
The American school system, on the other hand, does an odd thing where they provide a numerical score of 0 to 100, often neglect to append a percentage, and then layer on top of that a five-point system that uses letters (which is really a 13pt system when you consider all the modifiers).
Grading systems are institutionally-assigned rankings, so not directly comparable for what we’re focusing on today, but still: F minus for that nonsense.
Anyway, there are so many scores and stars scattered across the internet, and users are so accustomed to scoring things, that a kind of intuitive shorthand hasn’t developed:
This shorthand is supported by apps like AirBnb and Uber, which are famous for penalizing anything four stars and below. Those systems teach users and providers a simple lesson: it’s better not to be scored at all, rather than receive a score of four.
I mean, I can’t be the only one who’s received a request like this:
Today there are various types of scoring regimes
If you were obsessed, however temporarily, with scoring systems and given to making 2x2s in your spare time, you might come up with something like this:
User Rates / System Does Not: The user scores content, the system aggregates that content. You rate a book, Goodreads does not. You rate a restaurant, Yelp does not, etc.
User Doesn’t Rate / System Doesn’t Rate: There is no scoring system. Blogs, articles, a dearth of interactivity!
User Doesn’t Rate / System Rates: The system provides a score that the user then needs to trust. This is the old publishing model, where e.g. a critic at Pitchfork scores an album, or the Michelin illuminati score a restaurant, or Rick Steve scores a budget-friendly hotel.
User Rates / System Rates: Both the user and the system scores content. This is e.g. Amazon or Google Maps. The user “stars” content, but content is also surfaced by promotions, advertisements, and other inscrutable and frankly wildly obnoxious methods.
To build a good scoring system, you have to understand the intersection of user and system needs
If you were to create a scoring system, ideally you would want to consider the intersection of user and system needs.
Why does a user want or need to rate content? What will the system do with that content?
Here, for example, is a look at what a user’s needs might be across a variety of platforms:
You might want to address all of a user’s needs, but by choosing just one or two you can create a more focused system.
Take Netflix. Yeah, a user may want to express their opinion on shows and movies they’ve watched, and sure, scoring content I guess can be fun. It’s always nice to see your opinion reflected back at you. But the user is more likely to view scoring as a way to improve the system’s recommendations. What the user wants is better content.
Now consider Netflix’s system needs:
The system has a lot of needs. But two of those needs in particular—provide a personalized experience and sort content for users—stand out. If you get those two right, then you will have succeeded in engaging users more deeply and, post-facto, justifying the content experience.
There’s nothing simple about simple
Don’t confuse a simple score with a simple product
The funny thing is you don’t need a complex interface to achieve a complex result. Some five-point systems are actually hundred-point systems in disguise.
Take Letterboxd.
If I asked you how scores work on Letterboxd, there’s a fair chance you’d say “one to five stars”.
That’s true! But on Letterboxd you score movies in half point increments. What we call a five star score actually includes ten points of differentiation. And, when Letterboxd aggregates scores from all users, the resulting scores live on a 100pt range (0., 0.1, 0.2 … 4.8, 4.9, 5).
From a system’s perspective, that’s the best of both worlds: the five star UX element is accessible for the user, but it gives rise to a complex system that allows for more complex meta-scores and scoring for the system.
Complex scoring systems rarely help the user
Any range over 100 is an affront to god.
I once worked on a software product that measured risk. More precisely, the product measured the risk in a chain of business events by providing the user with a score that indicated the likelihood of the chain failing. The user of this product didn’t score anything; rather, they product told the user the level of risk within a range of 0 to 1,000. (an example which falls into the Publishers & Aggregators quadrant in our 2x2 above)
If you agree that scoring systems are designed to reduce complexity, then you’ll agree that a 1,000pt score fails at that task. Why use 1,000 points, a range that is easily reducible to 100, or even 10, or perhaps even five?
In the product’s defense, it was measuring a complex set of factors: varied products across different suppliers and thousands of offices across the globe. A quick google will reveal a variety of products that do similar things: mapping and scoring climate risks, operational risks, logistics impacts, all of that. Each product may have a different specialty (e.g., ESG compliance), but the touted benefits are roughly the same: see what’s happening and how it will affect your business so you can take actions to mitigate risk.
This is complex shit! Behind the scenes there are hundreds (if not thousands) of weighted factors that are being aggregated. But just because a process is complex doesn’t mean the score must be, too.
In fact, a complex score, instead of reducing the complexity, simply changes what the complexity looks like and transfers the complexity to the user.
That complexity then makes it more necessary to explain how the score works. What does a score of 857 mean, vs. a score of 735? Inquiring users want to know.
And consider those user’s needs. The user needs to trust the score. They need to trust the score because they use the score to make decisions. Those decisions have financial impact (not just on the business, on the career of the user). You’re asking the user to replace their known, bespoke process with an unknown, generalized one. To be successful—to really benefit from using the product—the user must trust the number that’s being spit out.
That’s a lot of cognitive load. The job of product design and UX is to decrease that load. You can make that job easier by talking to users in a language they already know, with a scoring system that’s easier to grok.
All unwieldy ranges must die
If you think a range is bounded by seemingly arbitrary numbers then you're probably not the intended end user
This is getting a little far afield from our origin point, but since we mentioned the importance of ranges earlier I am bound by OCD to include it. So consider your credit score.
Your credit is scored, in the FICO model, on a range of 300 (wut) to 850 (wut), with 300 being financially leperous and 850 being Capitalist of Highest Esteem. This is hard to grok!
There are some scores, I am told, that have odd ranges to accommodate physical processes that don’t translate well to the human preference for round numbers. The Beaufort Scale for example, which runs from 0 to 12, relates wind speed to observed conditions on sea and land. More important to more precisely observe the effect of wind than to have a convenient ten-point range.
You could make a similar argument for FICO scores—i.e., that a 550 point range that begins at 300 correlates well to a broad spectrum of financial conditions. But this is only a good explanation of why the range is useful to the system’s original, intended users: e.g, financial analysts, department stores, and banks. For you, the consumer, the broad range makes it harder to understand the score in toto. I don’t know exactly why Fair, Isaac and Company use the 300 to 850 range, but some cursory research suggests the genesis of that odd range has something to do with incorporating a variety of historical models used in non-consumer contexts, the desire to differentiate FICO from other models, and the desire not to instill the existential dread that comes with a score of zero.
Ultimately it’s unfair to compare the way you rate Candy Crush to the way banks rate your solvency, but I’m doing just that to make a point: FICO scores are more system friendly than they are user friendly. That is, they’re designed with the needs of the system in mind (the analysts and banks deciding whether to give you a loan) more than that of the consumer. Which means that consumers—who, to be fair, were never expected to have daily exposure to a FICO score—need a good deal of education to understand what a credit score is and how it changes. If Fair, Isaac and Company were developing the FICO score for today’s consumer (not for a bank), would they create the score again on a 300 to 850 range? <Shakes Magic 8 Ball> Signs point to no.
The Scholastic Aptitude Test, or SAT, is like this, too.
Your performance on the test is scored on a range of 200 to 800 for math, and 200 to 800 for verbal, for a total possible combined score of 1600. But what does a SAT score of, say, 1259, mean for a university’s likelihood of admitting you to their undergraduate program? For the system’s intended user, the admissions officer, the scores make sense. But for the score recipient, the value of the score in getting you into college is difficult (at first) to understand—not least of which because there are other non-SAT score variables at play, and in that chasm you can fit an entire market’s worth of SAT prep tutors, online courses, celebrity cheating rings, etc.
In fact, I would argue that the best thing you could do for the economy is to create a scoring system that is as wildly important as it is wildly inscrutable, and thereby inspire the creation of secondary markets just to explain what the Jehoshaphat the score means.
Why does any of this matter?
Scoring systems must serve the system and the user.
Because when you create a scoring system for your product, you need to balance the system’s needs with the needs of the user.
Had NBC’s The Feast been allowed to live, I’d bet it would have faltered by design of its scoring: 100pts is too granular and complex a system for scoring restaurants. What is the difference between a 87pt restaurant and an 88pt restaurant? How is that more intuitive than a five-star score? It feels like the system wanted to be more authoritative by providing more detail (“let’s make the score bigger!”). And there is a business case there, something like we’re the most authoritative restaurant index in all the land, advertise (or buy our data) now. But the amount of detail was unnecessary for the user to make a decision.
Perhaps the model should have included a 100pt criteria, and that information could have been communicated on the site, but the score should have been five stars, or dinner plates, or ortolans, or whatever.
This is also true for business software, and it’s the advice I gave to my client with the 1000pt scoring range. The calculations behind the scenes are complex, but the guidance to the user should be a paragon of simplicity. One to five stars, friends. Or even green yellow red. Just make it easy for me to make a decision. That’s all I, or anybody, wants.
Because the more complex a score, the more explanation needed. And if you’re using a 1,000pt score, you’re going to need a content strategy to explain the inputs, the output, and how scores can change.
If the user can’t explain the score, they can’t have confidence in their recommendation. The software won’t work.
Delightful is a 100% organic, free-range, desktop-to-inbox newsletter devoted to helping you nurture creativity in your daily life and work. Links every Tuesday (free), resources on Thursdays (occasionally, paid). Your host is Steve Bryant, friendly neighborhood content strategist, who is for hire (linkedin).
A few ways I can help you:
1. Hire me as a content advisor to create powerful content ideas, manage your content engine, or optimize your workflows. Book a call.
2. Work with me 1:1 to audit your existing content and/or create an audience profile and competitor map. Book a call.
3. Take my 1-day workshop for creating effective content for you and your team. Book a call or buy the self-serve version.
{ 🔒 archive }
Content frameworks for advertising agencies
4 frameworks for agencies shifting to a content modelMy content strategy toolkit
14 tools for organizing, measuring, and creating contentMy value proposition template
A Canva whiteboard for designing a value proposition that targets a customer segment with your unique valueMy concept diagram template
A Figjam for diagramming the complex relationships between conceptsProduct Content Strategy 101
For anybody who’s creating a product that requires editorial contentThe Bento Box Method for developing topical content
A cute and useful way to structure your content topicsMarketing strategy for agencies 101
The four different content marketing strategies for agencies, written for the exec team at a previous agencyPositioning strategy for agencies 101
The four different ways to compete, written for the exec team at a previous agencyHiring scorecards and why you should use them
Stop using voodoo hiring methods, friend
Thanks for reading. Be seeing you.
"Hope is like a road in the country; there was never a road, but when many people walk on it, the road comes into existence.” —Lu Xun