Yelp provides two main ways for users to review the businesses - reviews & stars. Traditionally, businesses have focused on how their rating to assess whether users like their service or not. But reviews contain huge amounts of critical data for the businesses which they can take advantage of. In this paper, we explore how reviews can be used to predict the rating of a business.
Recommender systems have come a long way in terms of modeling ratings for various purposes such as predicting the future rating of the product/business, identifying the customer segment who is most interested in the product and measuring the success of a product/business. But interestingly, very little work has been done in the field of analyzing the reviews which are provided by the users. These reviews should not be ignored since they are a rich source of information for the businesses.
In this paper, we look at these reviews to predict the rating of the business. We have focused on restaurants only for the purpose of this research. Reviews tend to be biased based on the users' thinking of what rating should be for a restaurant. Reviews can be extremely variable in length, content and style. We try to remove this bias by predicting the rating purely from the content and style of the reviews.
2 Related Work
There has been some previous work in extracting information from the user written reviews. The work began when Yelp started the Dataset Challenge few years back.
One work was focused towards identifying the subtopics in the reviews which are important to the user other than the quality of food . They used the online LDA, generative probabilistic model for collections of discrete data such as text corpora.
Another interesting work I found in this area was personalizing the ratings based on the different topics extracted across different user reviews . This was done using a modified, semantic-driven LDA.
The third work which was closest to this paper focused on predicting the rating using sentiment analysis . Their research scope was focused towards only 1 user and close to 1000 reviews. This didn't provide a holistic approach which has been covered in this research paper.
3 Data Collection
The data for the project was collected was provided by Yelp themselves for a Yelp dataset challenge which is conducted to provide opportunities to explore a real world dataset.
The size of the dataset itself is in millions of records, but we have focused on specific section of restaurants for the purpose of the project.
4 Procedure Outline
The objective of this paper is to train a classifier that can predict the rating of a restaurant from reviews written by users. This section outlines that process. Data preparation and feature selection are outlined in Section 5; this section explores how the data is brought to a form that can be used to create the models. This section also discusses how the data is divided into development, cross-validation and test sets. In Section 6, exploratory data analysis is performed on the development data set. Feature selection is performed here too. Section 7 presents a baseline performance, using Naive Bayes, SVM Logistic Regression with default settings on the cross-validation dataset. Parametric optimization is performed in Section 8; this includes a comparison of baseline and optimized performance. Finally, the optimized model is trained on cross-validation dataset and used to classify instances in the test dataset. The results of this are presented in Section 9.
5 Data Preparation
I divided the data into three sets: development, used for data exploration; the cross-validation dataset and the test set, to be used after optimization.
We have taken close to 20,000 records of Yelps' restaurant data. The development set has close to 4000 records, cross-validation has close to 14000 and test set has 2000.
Yelp provides 5 entity types: business, review, user, check-in & tip. We have focused on the business and the review entities for the project.
The business entity contains attributes such as type, business_id, name, full_address, city, state, latitude, longitude, stars, review_count, categories, open, hours etc.
The review entity contains attributes such as type, business_id, user_id, stars, text, date & votes.
I identified a list of restaurants from the business entity and then collected all the reviews for those restaurants from the review entity.
Also, I converted the numeric columns into nominal by mapping 1-5 values to its equivalent nominal values (one, two, three, four & five).
The final attribute list for the project were
2. stars: overall average stars
3. review_stars: stars for the particular review
Wefocus on predicting the nominal_review_stars from the model we build.
6 Baseline Performance
I performed a baseline analysis using some modified LightSide settings. The rare threshold was 25 and the feature selection was 1000 features.
 J. Huang, S. Rogers and E. Joo, "Improving Restaurants by Extacting Subtopics from Yelp Reviews," 2013.
 J. Linshi, "Personalizing Yelp Star Ratings: a Semantic Topic Modeling Approach".
 C. Li and J. Zhang, "Prediction of Yelp Review Star Rating using Sentiment Analysis".