This projects aims to examine the influence that specific business categories have on the consumer rating of businesses. We created a model to predict the relative impact that specific business categories, as listed on Yelp, have on the consumer ratings on Yelp of these businesses.
In today's digital landscape, consumer ratings and online reviews are of importance in shaping expectations and driving purchase decisions. As people increasingly rely on online reviews, understanding the business categories that influence positive and negative consumer ratings has become vital for businesses navigating the competitive digital marketplace. By identifying these categories, businesses can refine their offerings to better align with consumer preferences and improve their online reputation. This research leverages a comprehensive open dataset from Yelp, which serves as a major platform where consumers can share experiences and evaluate businesses. With its strong presence across various recreational industries (including restaurants, retail, and services). Yelp has established itself as a trusted source for review-based feedback.
The Yelp dataset provides a wealth of information on over 150,000 businesses across the United States, capturing details such as business name, location, category, attributes, and consumer ratings, used as variables in this research. By analyzing this extensive dataset, we aim to extract valuable insights into which business categories drive consumer ratings. These insights can empower business managers to adopt a data-driven approach, going beyond anecdotal evidence and moving toward strategic, evidence-based decision-making. This shift aligns with the growing trend of data-driven marketing, where decisions are rooted in robust analysis rather than intuition, further underscoring the importance of data analytics in today’s business environment.
Several academic studies support the notion that data-driven decision-making can significantly enhance business performance. For instance, McAfee et al. (2012) highlight how data-driven approaches provide businesses with a competitive advantage by facilitating improved decision-making processes. Similarly, Akter and Wamba (2016) emphasize that data analytics can enhance operational efficiency and customer satisfaction in various service industries. Furthermore, Luca (2016) discusses the critical role of online reviews in consumer decision-making and how they can influence a business's financial performance. These studies underscore the need for a nuanced understanding of consumer ratings, and our research aims to contribute to this understanding by focusing on business categories as primary drivers of consumer feedback.
Which business categories drive higher consumer ratings on Yelp in the United States?
This study aims to identify which business categories drive higher consumer ratings on Yelp in the United States. By examining these categories, the research provides critical insights into consumer expectations, allowing businesses to prioritize specific attributes that resonate with their target audience. For example, restaurants might focus on service and ambiance, while fitness centers may emphasize convenience and accessibility.
Furthermore, this study supports a data-driven approach to business strategy, leveraging Yelp reviews as a valuable source of information. In an increasingly digital market, understanding category-specific drivers of satisfaction enables businesses to strategically enhance their offerings and improve their online reputations.
The data incorporated in this research is an open dataset provided by Yelp. This dataset includes various business variables, such as names, opening hours, addresses, star reviews, and attributes, of more than 150,000 recreational businesses in the United States of America. These businesses have a extremely broad range of categories, from massage saloons to casinos, and from cheese tastings to tattoo shops. This widespread of types of business allows us to analyse the influence of specific business attributes on consumer ratings on a very broad level. This means that our findings are implementable by a broad range of businesses and industries.
The table below shows the variables within the raw dataset including a brief description of each of the variables. The variables are in order of how the original dataset is built:
Variable | Description |
---|---|
business_id | The unique Yelp id for each registered business |
hours | The opening hours of the business, using a 24hr clock |
postal_code | The postal code of the business |
is_open | A dummy variable that indicates if the business is permanently closed, 0 implies permanently closed and 1 implies open for business |
accommodates | The maximum capacity of the listing |
address | The address of the business |
categories | A list of Yelp business categories the business is classified as |
latitude | The geographical latitude of the business |
city | The city the business is located in |
longitude | The geographical longitude of the business |
state | The American stat the business is located in |
review_count | The number of Yelp reviews the business has |
name | The name of the business |
stars | The business rating in stars, ranges between 0-5 and is rounded to half-stars |
attributes | A list of Yelp business attributes and whether the business has this attributes or not |
If there is a need to download the Yelp dataset used for this research, the dataset can be downloaded here (yelp_academic_dataset_business.csv)
We are going to analyse the influence of specific business categories on consumer ratings on Yelp. The dependent variable will be the consumer rating of business measured in stars. Firstly, we will recode the variable 'category' into separate dummy variables for each category, indicating whether each business pertains to that category or not (1 if pertains, 0 if not). Since these independent variables are non-metric, we will perform an ANOVA analysis. We will design our ANOVA model by regressing the dependent variable 'stars' on all these previously created category dummy variables to find the relative impact of each category on the consumer ratings.
The findings of this research have several important implications for marketing strategies. First of all it could enhance customer experience by identifying key drivers of high ratings. Businesses could focus on working towards being part of certain categories that are found to have a high effect on consumer ratings, which will allow for more targeted improvements rather than broad inefficient changes. The insights of this research could also strengthen a company's branding and communication strategy. If specific categories are found to be significantly important for higher consumer ratings, businesses can highlight being part of these categories in their promotional material to attract more consumers.
├── data_converter
├── src
│ ├── Analysis
│ ├── DataExploration
│ └── DataPreparation
├── .gitignore
├── LICENSE
├── README.md
└── makefile
Please follow the installation guides on Tilburg Science Hub. To run this project, the following software is needed:
- R and RStudio (installation guide)
- Make (installation guide)
Additionally, the following software needs to be downloaded:
- Pandoc (installation guide)
To run this project, the following R packages are mandatory:
broom
data.table
dplyr
effectsize
emmeans
flextable
ggplot2
googledrive
grid
gridExtra
knitr
readr
rmarkdown
tidyr
The packages can be installed using the install.packages
command, or by running the Install_Pacakages scripts in the subfolders DataExploration, DataPreperation and Analysis.
To run the code, follow these instructions:
- Fork this repository
- Open your command line / terminal and run the following code:
git clone https://github.com/{your username}/consumer-rating-insights-Yelp.git
- Set your working directory to consumer-rating-insigts-Yelp by typing the following command:
cd consumer-rating-insights-Yelp
- Run the makefile to run all source code by typing the following command:
make
- The code will run and will generate all output. New folders will be generated called 'data' (containing all the raw and cleaned/prepared data) and 'gen' (containing all generated output). The 'gen' folder will contain the actual output of the project, including the results of the analysis and data exploration visualizations.
It is also possible to run the makefiles of the the underlying folders DataPreparation and DataExploration separately. If only in need of interesting visualization produces by the data exploration, for example, solely run the data exploration makefile. To do so:
- Fork this repository
- Open your command line / terminal and run the following code:
git clone https://github.com/{your username}/consumer-rating-insights-Yelp.git
1. Set the working directory to the folder of which the makefile should be run (in this example DataExploration):
cd src/DataExploration
2. Run the makefile to run all source code by typing the following command:
make
3. The code will run and will generate all output of the data exploration, stored in the folders 'data' and 'gen' in the main directory. Not that, in order to run the analysis makefile seperately, the data preperation makefile should be run first!
This repository is produced by Team 4 of the course Data Preparation & Workflow Management taught by Hannes Datta at Tilburg University. This course is part of the Master's program Marketing Analytics. The team members and authors of this repository:
Author | Contact |
---|---|
Claudia van Hoof | [email protected] |
Beste Özyürekoğlu | [email protected] |
Lucía Ramos Domínguez | [email protected] |
Ashley Saarloos | [email protected] |
- Akter, S., & Wamba, S. F. (2016). Big data analytics in E-commerce: A systematic review and agenda for future research. Electronic Markets, 26(2), 173-194.
- McAfee, A., Brynjolfsson, E., Davenport, T. H., Patil, D. J., & Barton, D. (2012). Big data. The management revolution. Harvard Business Review, 90(10), 60-68.
- Luca, M. (2016). Reviews, reputation, and revenue: The case of Yelp.com. Harvard Business School NOM Unit Working Paper, (12-016).