Someone passed over a link to Yelp today and pointed out their restaurant review highlighting feature. So I poked around for a bit to figoure out what they’re doing. They aggregate reviews for several types of retail stores and restaurants, but here I was only interested in how they pull out phrases from the reviews and highlight them. For example, if you go to the review page for Jewel Bako you’ll see three reviews each with a different phrase highlighted:
The phrases are “omakase”, “flown in daily”, and “large roll tasting menu”. All good foody descriptions. If you click on the phrases, it pulls up reviews that contain the phrases. For example, three of 35 reviews contained the phrase “flown in daily”.
So how did they do this? You can go through the reviews, and count up the number of unique n-word phrases (N-grams), and take those that appear more frequently than one would expect — so-called statistically improbable phrases. There are a lot of N-grams though, and with only 30 or 40 reviews you can wind up with noise. The N-grams that Yelp is pulling up, however, are almost always food-related. If you look up reviews for the Hatsuhana Sushi Restaurant, you’ll notice that three of 44 reviews mentioned “Box of Dreams”, one of the items on their menu. That’s a statistically unlikely phrase, in fact it only shows up in these three reviews in a search I conducted for New York City restaurants. And yet, it is not highlighted. So I suspect they have a list of food-related terms, and filter recurring N-grams so that they only consider those that have at least one food-related term. It also makes problem simpler — fewer N-grams to keep track of. Also supporting the conjecture that they are filtering based on food-related terms is that they’ve only rolled this out to restaurants. If it was more generic, just counting up N-grams and calcuating their statistical significance, they would have rolled it out to other areas, like shops and hairstyling salons.
You’ll also notice that the N-grams can be fairly large, as with “best sushi places I’ve been”. The “too” got missed because two reviewers spelt it “to”, and one spelled it “too”. Humans, always making mistakes, and machines are so inflexible. They’re also looking at single words (unigrams). So they have some sense of the statistical likelihood of each N-gram, probably by inspecting their own body of reviews. Google made their N-gram statistics publically available, they could use that. However, restaurant reviews probably have their own idiomatic expressions, and you probably want to know what phrases distinguish a restaurant from all the other restaurants, and not from the rest of the written world. For example, “great atmosphere, but noisy” is far more likely in the world of restaurant reviews than in English text in general.
Also, referring back to the Hatsuhana review page, it is interesting what words are used to search when you click or roll over a highlighted section. Its a subset of the terms in the phrase. Furthermore, the url it displays on mouse-over is not where it takes you when you click on the highlight. The “best sushi places I’ve been” highlight has a url of http://www.yelp.com/biz/hatsuhana-sushi-restaurant-new-york?q=places+been , which shows you reviews that have “places” and “been”, but no particular proximity or order. When you click on it however, it pulls ones with the “places I’ve been”, i.e. proximate and same order.
They probably winnowed down “best sushi places I’ve been” to “places+been” by maintaining a list of kill words for retrieval. Sushi is not a good one. If its a sushi restaurant, it’s likely to have the term, otherwise not. ”Best” and “places” might tend to pull up interesting phrases. ”I’ve”, not so much. So in addition to a set of food-related terms used to filter N-grams, they also filter terms used for review retrieval. When you click on the “best sushi places I’ve been”, it scans the reviews and applying the same filter used for the highlighted phrase, so that if “best sushi places I’ve been” appears in the document, it is filtered down to “places been”, and the “I’ve” is highlighted because it got caught in the middle. Whether the filter used for review retrieval uses the same food-related term list for N-gram filtering, I don’t know, but it seems plausible. You might however only want “places” and “been” to be in the review retrieval filter, and not the N-gram filter, so I think there is some utility in separating them. However, that “sushi” is filtered out, and “places” and “been” are not, indicates that the list was either manually compiled, or has been manually edited, to make highlight retrieval more useful.
So in a nut shell, build a list of food-related terms, and for each restaurant, filter out statistically unlikely phrases that contain those terms. Use the same or a similar filter to match the phrases to the reviews, when a user clicks on the highlight. Not too complicated, but it makes for a great feature.
How could you improve this? Well, you might try using a parts-of-speech tagger to pull out noun phrases, and look for statistically unlikely ones. That’s a fairly standard task, a bit more expensive computationally, and some additional work. I’m not sure if the quality would be meaningfully improved. The Yelp site is pretty good as it is right now, so I think they made some good engineering choices and kept it simple.


Posted by Ken Ellis
RSS