Authors: Pete Davis, Andrew Han, Steve Sohn, Matthew Streichler, Dian Zhao
The main goal of this project was to analyze customer browsing patterns on Expedia to predict which hotel cluster they will book from. We first went over the problem statement and the motivation behind the project. Next, we explored the data and explained what types of pre-processing methods we went through before applying the dataset to the model. Based on the cleaned dataset, we explained how and why we applied different machine learning methods. Finally, we elaborated on our results and insights.
This is the link that includes the code.
Before the pandemic, many people used to travel around the world. Given the unfortunate current circumstances, global travel is no longer much of an option, but we can still hope and dream of future travel. Our group recalled when planning trips, many of us spent a lot of the time looking for accommodations under different budgets and conditions. A trip is only as good as your research and planning, and it is important for companies like Expedia to effectively contribute to the user’s research. Therefore, because planning is so tedious and time-consuming, we wanted to solve a problem that optimized the user’s search results.
We structured our project mostly based on the previous Kaggle Competition by Expedia. Expedia is one of the most notable travel agency companies. Users can navigate through different hotel and flight options based on search criteria on Expedia’s homepage. However, most searches do not result in booking a hotel. Is there a solution that we can provide that increases the percentage of interactions that lead to hotel bookings? We wanted to see if we can use the context of the user’s browsing data to provide better recommendations to customers interacting with Expedia’s site. So, by contextualizing the data that was provided, our team tried to come up with a solution to optimize the search results on Expedia’s site, listing hotels that a consumer is more likely to book.
Why is this problem important?
We are expecting both traveling businesses and customers to benefit from this project. On the business side, different hotel businesses will be able to understand the customer’s preferences on which marketing factor they might have to focus on to increase profit. For example, by looking at the booking tendency of different customers, hotels might be able to come up with a better marketing idea to attract more customers. Consumers, on the other hand, would be able to save time on finding different kinds of hotels they want. Just as suggesting the right movies would be important for YouTube or Netflix, being able to make relevant suggestions to visitors would be important for a platform similar to Expedia. Gaining an insight on booking preferences would enable these platforms to be more valuable for both hotels and visitors.
We have encountered different blog posts with the same topic that we’re covering and decided to set one of the blog posts by Wesley Klock (link) as a benchmark for output comparison.
Approach and Rationale
Our goal is to predict the hotel cluster for a user event, based on their search preferences and other attributes related to the user’s activity. This is a classification problem since we have to predict the likelihood that users will stay at different hotel clusters based on the attributes that were provided. There are 100 different clusters based on hotels that have similar characteristics such as historical price, customer star ratings, locations (whether they are close to the city center or landmarks), etc. The competition provided an immense amount of data that we utilized to run different models and to determine which model produces the largest accuracy score.
Our approach to this competition was to explore and pre-process the data that was provided. It is necessary to spend time understanding the data by looking at the booking tendency. Then, we applied different predictive models. From the number of models applied, we determined which produced the best recommendation by looking at the test accuracy score.
We collected the datasets from the past Kaggle Competition on “Expedia Hotel Recommendations” (link) on Kaggle.
There were three different datasets that were provided which include: train.csv, test.csv, and destinations.csv. The training data is from 2013 and 2014, which includes all the users that both clicked and booked the events. The test data is from 2015, which only includes booking events. By having the test data as the parameter (the actual output for future years), the competition wanted participants to create a model that predicts the likelihood of the customers who will book or not by fitting the training set on it. The destination dataset contains latent description variables of search regions. It is an unlabeled dataset which is a combination of destination characteristics on user reviews. We utilized this dataset as another feature option while we pre-processed our data.
In the middle of exploring the data, we found out that the data types are all integers or dates. Therefore, we were not able to find the actual meaning of the integers on different categorical variables. However, it is unnecessary to find the actual meanings of what each value represented as those variables are used as a feature that contributes to predicting the likelihood.
Data Pre-Processing & Exploration
After reading in all three of the datasets provided by Kaggle, data pre-processing steps were strictly performed on the training dataset due to a number of reasons: i) the training dataset included hotel cluster values (target variable) which allowed for supervised learning whereas unsupervised learning would be done when running models on the test dataset, ii) the training dataset was large enough (37.5M+ observations) to accommodate a further train/test split, iii) the test dataset leakage issue was discovered which caused the test dataset to add minimal value to predictive models as approximately 30% of the data was being overestimated.
After deciding to perform pre-processing steps on the training dataset, our group took a deeper dive into the data to understand what we were working with. Out of the 24 features in the training dataset, three of them contained NaN values in some observations. The three features included ‘orig_destination_distance’, which is the physical distance between a hotel and a customer at the time of their search, ‘srch_ci’, the search check in date specified by the user, and ‘srch_co’, the search check out date specified by the user. Based on the Kaggle competition website, a null value for ‘orig_destination_distance’ meant that the distance was not able to be calculated. We did not want NaN values negatively affecting our analysis and our dataset was large enough, so we decided to drop all rows with NaN values. The other two features that contained NaN values, ‘srch_ci’ and ‘srch_co’, had datetime data types. Logically looking at the dataset, it would not be correct to impute these NaN values as they had a datetime data type. There were approximately 600K observations containing NaN values, which was only around 1.5% of the total data. We decided that removing these observations from our dataset was the best option, as we still retained a large portion of the original training data.
The last pre-processing step was performed on our destinations dataset. The destinations dataset contained 150 features, 149 of which were latent descriptions of user search regions. We understood that these features had the potential to add value to our models. However, we were unable to attribute the values of these latent variables to specific features, which made them a perfect candidate for feature selection through PCA. The purpose of performing PCA was to reduce the dimensions of the destinations dataset, while also retaining information about the original data. In our code, pictured below, it is seen that we created a loop to determine the optimal number of principal components. The code also outputted Scree Plots which visually represented how much of the variance was explained with each additional principal component. As it is shown below, three principal components seemed to be the optimal number of components for our analysis. As the Scree Plot shows, the marginal variance explained with every additional principal component is minimal as the percentage values level out from four principal components and on.
After calculating the three principal components, we then merged the latent variables with a random sample of 10K user IDs from the training dataset. The random sample equated to approximately 220K rows. The two datasets were merged based on similar IDs of the destination where the hotel search was performed (‘srch_destination_id’).
Exploratory Data Analysis
The group then went on to understand the data a bit more through data exploration. After all the data pre-processing steps outlined above, our dataset consisted of 29 features: three binary variables, five continuous variables, or integers representing integers, eighteen categorical number variables, or integers representing qualitative values, and three latent PCA variables.
We then ran simple functions on the entire data frame to visually understand each feature. As it is shown below, we used the .describe() method to view different metrics per feature such as minimum, maximum, and range.
Additionally, we outputted histograms per feature to gain a better idea of what distribution each feature follows. Pictured below, it is apparent that the majority of features don’t follow a specific distribution pattern as the values vary in an abnormal manner. All three latent PCA variables, however, have a somewhat normal distribution with the second principal component being rightly skewed.
From this exploratory analysis, we were then able to determine which models to run on our final dataset.
In order to predict the hotel cluster based on the browsing data, we first needed to determine which predictive modeling techniques can be used to train the data. In this case, we need to look at our insights from the exploratory analysis to assist in choosing different techniques.
Looking at the 29 features of our processed dataset, over half of the features were categorical number variables. The remaining features were a mix of continuous floats, binary, and PCA latent variables. In addition, our target variable was a categorical number variable ranging from 1–100, representing different hotel clusters.
These categorial number variables can represent hundreds of categories each. For example, ‘user_location_country’ ranges from 0–215 representing 216 different countries. When dealing with categorical variables, it is a common practice to one-hot-encode into categorical columns with binary representation. In this case, to one-hot-encode ‘user_location_country’ would mean adding 215 columns to our dataset, which is not practical. That being said, this dataset poses a number of issues that make it particularly challenging for modeling purposes.
The mix of variable types made it difficult to perform feature selection and dimensionality reduction, so we chose models that could handle binary variables, categorical variables, and continuous variables.
We chose the following modeling techniques for our analysis:
· Logistic Regression
· K-Nearest Neighbors (KNN)
· Decision Tree
· Random Forest
Train and Test Split
Before performing any of the predictive modeling techniques, we set our target variable and split our data into a training set (75%) and a testing set (25%).
Before performing any of the predictive modeling techniques, we decided to run a default model to get a baseline accuracy score to use as a threshold to surpass when experimenting with different modeling techniques.
We decided our default model would be one that predicts the most frequent cluster from the training set. To perform this procedure, we found the most frequent cluster in the training set, set the y_train and y_test to an array only consisting of that cluster, and found the accuracy score of both the training set and testing set.
The most frequent hotel cluster in the training set was hotel cluster 91. When predicting this cluster, the default model had a training accuracy of 0.0364 and a test accuracy of 0.0370. Now, when implementing other models, we can ensure that they are adding some sort of value to our prediction accuracy. If other modeling techniques do not get test accuracies above 0.0370, we would know that we are not being very productive with our analysis.
Even though we suspected it would probably yield a lower test accuracy, we chose logistic regression analysis as we wanted to use this as an alternative base comparison instead of the default method since the most frequent class yielded a low accuracy.
We chose only the numerical or binary data from our dataset for this analysis to help with simplicity. There were 10 original features, and we added one additional feature by calculating how long the customers wanted to stay from the start and ending dates of the stay.
We could have added more variables using dummy variables to convert the categorical variables, but chose not to pursue this to simplify the analysis. Then, we tried three different numbers of features, first with all eleven features, then with only the three PCA features, then added two features of ‘distance’ and ‘is_booking’ to come up with five features.
The Y variable also had to be one-hot encoded. Since there were 100 different classes for Y, our one-hot encoding yielded 100 columns.
Printing out the counts of the values verified the most often occurring clusters.
Then we ran the logistic regression using the multiclass option, and we chose the one-vs-rest method with the expectation that it would run faster than one-vs-one method.
Using all of the eleven variables yielded the highest test accuracy at 0.0976 but using only the five variables yielded a similar accuracy of 0.0958.
KNN is a straightforward model. The only parameter to tune for is the value of K. Using the down-sampled training set, different values of K, ranging from 1 to 10, were tried. The K value that gave the highest accuracy score on the test set is the optimum estimator.
When K = 1, the highest accuracy of 0.3040 was achieved.
Double-checking with cross-validation yielded the same result.
However, K=1 has the potential of overfitting. The model has a low bias but high variance, meaning the model may not be the most accurate when trained on other datasets. One reason could be that our test set actually comes from the original training set so the train and test sets are similar in many ways. Another potential explanation is that in the dataset, one user could have multiple records. Since the same user likely exhibits similar browsing patterns, each data point is not truly independent from another.
Given our mix of variable types, Decision Tree is a modeling technique that provides an effective way to perform feature selection. There are two common criterions for splits on a decision tree: gini and entropy. We decided to test both criterions, testing at maximum tree depths of 2, 5, 10, 15, 17, and 20 for each.
The decision tree that used entropy as the criterion and had a depth of 20 had the highest test accuracy score of 0.3362, much higher than that of our default model.
One thing we considered was whether or not the specific browsing instance resulted in a hotel booking. Some browsing instances result in a booking (‘is_booking’ = 1), and some do not (‘is_booking’ = 0). This made us question whether interactions that resulted in booking a hotel might have some inherent difference to interactions that did not result in booking. We thought maybe if we ran our decision tree model on a subset of data that only included interactions that resulted in booking, we might get a better accuracy score when predicting the hotel cluster on the test set. So, we tested this hypothesis.
Results for subset (is_booking = 1):
It turns out that only including instances that resulted in a booking actually worsened our prediction. Our leading model remained on top, with an accuracy score of 0.3362, found using the entire data set, entropy as the criterion, and a tree depth of 20.
Random Forests, an ensemble method for classification, is very similar to Decision Trees. However, one difference is that Random Forest has an additional step of randomization. This classification model generates bootstrap samples which corrects Decision Trees’ habit of overfitting the training dataset. With the additional randomization from bootstrap sampling, the Random Forest model randomly selects observations and features to build multiple Decision Trees. All the Decision Trees’ accuracy scores are then averaged to give an overall Random Forest accuracy score. Random Forests is an ensemble method as the multiple Decision Tree models that are run on the dataset is a competing learning method, meaning the model takes multiple looks at the same problem.
One of the features included in our final dataset was whether or not the user actually booked a hotel or if they only logged on and clicked through the website (‘is_booking’). This variable was binary, with a value of 1 meaning that the user booked a hotel and the value of 0 meaning the user strictly logged on and clicked through the website without physically booking any hotel. Due to this, we believed that we may return better, more accurate, results if we ran our model on data that only contained observations where the user physically booked a hotel in comparison to all the observations. For both classification models, we ran the Random Forest models with a maximum Decision Tree depth equal to either 2, 5, 10, 15, 17, 20, and no limit on nodes. As maximum depth increased, the training accuracy continued to increase with the largest accuracy being given to the maximum depth with no limit. Test accuracy also increased as maximum depth increased, with the exception of the no limit maximum depth. This was most likely due to both models overfitting the training set as they returned an accuracy score of nearly 100%. Excluding the overfitted models, Random Forests with a maximum depth of 20 nodes returned the largest test accuracy for both those who booked hotels and the set of observations that also included those who did not. Those who only booked hotels returned a test accuracy of 0.1913 while the other set which includes those who did not book hotels returned a test accuracy of 0.3244.
Our initial hypothesis that the model would return better results with only observations that physically booked hotels was incorrect. When running Random Forests on all observations, the model returned a test accuracy score approximately one and a half times greater than those who only booked hotels.
During our research we found mentions of XGBoost to be highly popular in the competition, so we also decided to include this method. We ran XGBoost in the hopes that it would give us a higher test accuracy than other methods.
We tried using two different sets of features, one with all of the columns that we had, and the other with only the numerical or binary columns.
We then selected on a narrow range of hyperparameters. Because of limited computing power, we chose to not go with trees with a maximum depth larger than 20, even though our best results came from a maximum depth of 20, so it would have helped to verify that additional trees do not cause overfitting.
The best test accuracy we found from the XGBoost with all features was 0.3372, when learning rate = 0.5 and max depth = 20.
With 11 features the highest accuracy rate was 0.3023, when learning rate = 0.3 and max depth = 20.
When we looked at the feature importance for the model with the highest test accuracy, we found that the distance (‘orig_destination_distance’) was the feature that was most important often used to split the trees, followed by the PCA features and length of stay (‘len_stay’).
For our overall observations, our best test accuracy was 0.34, which was achieved by XGBoost (0.3372) and Decision Tree (0.3362). Compared to the benchmark, our accuracy was higher because we used a different subset of the data. We did not see much difference between KNN, Decision Tree, Random Forest, and XGBoost, whereas our Benchmark saw a larger improvement with XGBoost. This smaller difference most likely came from our limited tuning of XGBoost.
Conclusion & Insights
The original training dataset included over 37 and a half million rows of data with nearly 170 columns of features. Thus, we first had to down-sample the dataset and conduct feature selection using Principal Component Analysis to reduce the size of the data to a manageable level before training any predictive models.
A variety of models, including both supervised and unsupervised, were trained throughout this project. The accuracy score on the test dataset was the metric to evaluate and compare among different models to find the one with the most promising result.
Decision Trees and XGBoost were our best models. Even though there is still room for improvement, we still think our predictive model has important real life applications. The model can help Expedia to utilize past user activities to give personalized hotel search results, which could greatly augment user experience and the search quality. Expedia can also use it for advertising, pushing hotel recommendations to targeted users that are most likely to book them.
If given more time, we would have tried to use other models, such as neural networks and other ensemble methods to see if we can increase the prediction accuracy further. We would also want to test the model on data coming from more recent periods, since our training set only covered user data from 2013 to 2014.
Overall, this was a great learning experience to conclude a semester of learning different predictive models. Thank you for reading!
For this assignment, we are planning to use the dataset available from the Kaggle competition “Expedia Hotel Recommendations” https://www.kaggle.com/c/expedia-hotel-recommendations).
In the dataset provided, we are able to include key factors that are needed to construct an efficient predictive model in determining hotel recommendations for users.
Other resources that we are able to reference throughout our assignment are listed below. They include other competitors’ write-ups and other documents that talk about Expedia’s hotel recommendation system.
Kaggle Competition: Expedia Hotel Recommendations by Gourav G. Shenoy, Mangirish A. Wagle, Anwar Shaikh https://arxiv.org/ftp/arxiv/papers/1703/1703.02915.pdf
Expedia Hotel Recommendations by Genki Oji, Wesley Klock, Andriy Dashko, Emil Häglund https://medium.com/@wesleyklock/expedia-hotel-recommendations-ea6a9d5fbaa7
Predicting Expedia Hotel Cluster Groupings with User Search Queries http://cs229.stanford.edu/proj2016spr/report/065.pdf
Expedia Recommender System http://kaushal-desai.us/expedia-recommender-system/
Predicting Hotel Bookings on Expedia https://towardsdatascience.com/predicting-hotel-bookings-on-expedia-d93b0c7e1411
How to Get Into the Top 15 of a Kaggle Competition Using Python