Yelp is a platform that provides consumers to discover, connect and transact with local businesses, which allows consumers to make reviews, reservations, appointments, and purchases on businesses. In real life, other people's opinions are crucial in future decisions. Besides, in the commercial field, people's reviews can help merchants know customers' preferences. Online review platforms and merchants can get information from customers' reviews to make personalized recommendations for other customers, from which customers can also benefit. This project uses multiple text data mining methods to discover the potential information, analyze the text sentiment and construct a recommendation system. We used the Yelp review database to analyze the sentiment inside reviews and found patterns to provide personalized customer recommendations.
We implemented the K-mean clustering algorithm to cluster business locations, used natural language processing techniques to clean the data, vectorized the text of the review by word counts & TF-IDF, visualized the word frequency, found associate words in context by Word2Vec, searched the optimal hyper-parameters and classification model by k-fold cross validation, found the most suitable sentiment analysis classifier by comparing the results of logistic regression and support vector machine, extracted principal components of business by non-negative matrix factorization and retrieved similar business by k-mean clustering.
The program has been seperated into 3 stages. Note that at the end of the first stage, we randomly select a specific amount of data, so every different run will produce a different result.
- Jupyter Notebook
- Jupyter notebook is required for running the code.
- Yelp Datasets
- The size of Yelp datasets is too large which exceed the GitHub maximum upload file size. So to be able to run the code, it is required to download the datasets from Yelp.
- In this program we only need business.json and review.json.
- After you have downloaded the datasets, please put the file into the Datasets folder.
Below is the order of codes to be run.
-
First Stage
- Data_Preprocessing.ipynb
- Filter_Reviews.ipynb
-
Second Stage
- Process_Reviews.ipynb
- Analysis_Sentiment.ipynb
-
Third Stage
- NMF_Model.ipynb
- Clustering_Topics.ipynb
- Group members
- Yelp Datasets