Quantcast
Channel: CodeSection,代码区,Python开发技术文章_教程 - CodeSec
Viewing all articles
Browse latest Browse all 9596

Predicting Yelp Stars from Reviews with scikit-learn and Python

$
0
0

In this post, we’ll look at reviews from the Yelp Dataset Challenge. We’ll train a machine learning system to predict the star-rating of a review based only on its text. For example, if the text says “Everything was great! Best stay ever!!” we would expect a 5-star rating. If the text says “Worst stay of my life. Avoid at all costs”, we would expect a 1-star rating. Instead of writing a series of rules to work out whether some text is positive or negative, we can train a machine learning classifier to “learn” the difference between positive and negative reviews by giving it labelled examples.

This post follows closely from the previous one: Analyzing 4 Million Yelp Reviews with python on AWS . You’re strongly encouraged to go through that one first. In particular, we will not be showing how to set up an EC2 Spot instance with adequate memory and processing power to handle this large dataset, but the same setup was used to run the analysis for this post.

Introduction and Overview

This post will show how to implement and report on a (supervised) machine-learning based system of the Yelp reviews. Specifically, this post will explain how to use the popular Python library scikit-learn to:

convert text data into TF-IDF vectors split the data into a training and test set classify the text data using a LinearSVM evaluate our classifier using precision, recall and a confusion matrix

In addition, this post will explain the terms TF-IDF, SVM, precision, recall, and confusion matrix.

In order to follow along, you should have at least basic Python knowledge. As the dataset we’re working with is relatively large, you’ll need a machine with at least 32GB of RAM, and preferably more.The previous post demonstrated how to set up an EC2 Spot instance for data processing, as well as how to produce visualisations of the same dataset. You’ll also need to install scikit-learn on the machine you’re using.

Loading and Balancing the Data

To load the data from disk into memory, run the following code. You’ll need to have downloaded the Yelp dataset and untarred it in order to read the Yelp Review’s JSON file.

import json # read the data from disk and split into lines # we use .strip() to remove the final (empty) line with open("yelp_academic_dataset_review.json") as f: reviews = f.read().strip().split("\n") # each line of the file is a separate JSON object reviews = [json.loads(review) for reviewin reviews] # we're interested in the text of each review # and the stars rating, so we load these into # separate lists texts = [review['text'] for reviewin reviews] stars = [review['stars'] for reviewin reviews]

Even on a fast machine, this code could take a couple of minutes to run.

We now have two arrays of data: the text of each review and the respective star-rating. Our task is to train a system that can predict the star-rating from looking at only the review text. This is a difficult task since different people have different standards, and as a result, two different people may write a similar review with different star ratings. For example, user Bob might write “Had an OK time. Nothing to complain about” and award 4 stars, while user Tom could write the same review and award 5 stars. This makes it difficult for our system to accurately predict the rating from the text alone.

Another complication is that our dataset is unbalanced. We have more examples of texts that typically have a 5-star rating than texts that typically have a 2-star rating. Because of the probabilistic models at the base of most machine learning classifiers, we’ll get less biased predictions if we train the system on balanced data. This means that ideally we should have the same number of examples of each review type.

In machine learning, it’s common to separate our data into features and labels. In our case, the review texts (the input data) will be converted into features and the star ratings (what we are trying to predict) are the labels. You’ll often see these two categories referred to as X and Y respectively. Adding the following method to a cell will allow us to balance a dataset by removing over-represented samples from the two lists.

from collections import Counter def balance_classes(xs, ys): """Undersample xs, ys to balance classes.""" freqs = Counter(ys) # the least common class is the maximum number we want for all classes max_allowable = freqs.most_common()[-1][1] num_added = {clss: 0 for clssin freqs.keys()} new_ys = [] new_xs = [] for i, y in enumerate(ys): if num_added[y] < max_allowable: new_ys.append(y) new_xs.append(xs[i]) num_added[y] += 1 return new_xs, new_ys

Now we can create a balanced dataset of reviews and stars by running the following code (remember that now our texts are x and the stars are y).

print(Counter(stars)) balanced_x, balanced_y = balance_classes(texts, stars) print(Counter(balanced_y)) >>>Counter({5: 1704200, 4: 1032654, 1: 540377, 3: 517369, 2: 358550}) >>>Counter({1: 358550, 2: 358550, 3: 358550, 4: 358550, 5: 358550})

You can see above that in the original distribution, we had 358,550 2-star reviews and 1.7 million 5-star reviews. After balancing, we have 358,550 of each class of review. We’re now ready to prepare our data for classification.

Vectorizing our Text Data

Computers deal with numbers much better than they do with text, so we need a meaningful way to convert all the text data into matrices of numbers. A straightforward (and oft-used) method for doing this is to count how often words appear in a piece of text and represent each text with an array of word-frequencies. Therefore the short text of the dog jumps over the dog could be represented by the following array:

[2, 0, 0, 0, ..., 1, 0, 0, 0, ..., 2, 0, 0, 0, ..., 1, 0, 0, 0, ...]

The array would be quite large, containing one element for every possible word. We would store a lookup table separately, recording that (for example) the 0th element of each array represents the word “dog”. Because the word dog occurs twice in our text, we have a 2 in this position. Most of the words do not appear in our text, so most elements would contain 0. We also have a 1 to represent jumps, another 1 for over and another 2 for the.

A slightly more sophisticated approach would be to use Term Frequency Inverse Document Frequency (TF-IDF) vectors. This approach comes from the idea that common words, such as the aren’t very important, while less common words such as Namibia are more important. TF-IDF therefore normalises the count of each word in each text by the number of times that that word occurs in all of the texts. If a word occurs in nearly all of the texts, we deem it to be less significant. If it only appears in several texts, we regard it as more important.

The last thing that you need to know about text representation is the concept of n-grams. Words often mean very different things when we combine them in different ways. We will expect our learning algorithm to learn that a review containing the words bad is likely to be negative, while one containing the word great is likely to be positive. However, reviews containing phrases such as “… and then they gave us a full refund. Not bad!” or “The food was not great” will trip up our system if it only considers words individually.

When we break a text into n-grams, we consider several words grouped together to be a single word. “The food was not great” would be represented using bi-grams as (the food, food was, was not, not great), and this would allo

Viewing all articles
Browse latest Browse all 9596

Trending Articles