I worked in a group with four other students*, tasked with locating a dataset of our choosing, and performing cleaning, EDA, and machine learning steps. We chose an Airbnb dataset because of the varied and interesting features, the sizable number of observations, and the relevance of the subject matter. Each row of the "Listings" dataframe represents a listing on Airbnb, containing information about the property itself, as well as statistics about the host.
Our primary objectives were:
- Analyzing neighborhood popularity for superhosts and non superhosts, based on traffic, types of rooms (Private Rooms/Shared Rooms/Entire Apartment), and price
- Drawing a comparison between the pre- and post-COVID listing prices
- Predicting the type of host (superhosts/non-superhosts) and understanding what attributes contribute to the classification of a superhost
- Predicting the price for Airbnb listings in New York City and understanding what features contribute to profitable business opportunities for Airbnb
Our data was taken from 4 files as listed below. The datasets were taken from InsideAirbnb and were last updated in October 2020.
- Listings (Detailed listings data for New York City.)
- Reviews (Detailed review data for listings in New York City.)
- Calendar(Detailed calendar data for listings in New York City.)
- Neighborhoods (Neighborhoods list for geo filter. Sourced from city or open source GIS files.)
*Yulong Gong, Ruchika Venkateswaran, Yi-shuan Wang, Yangyang Zhou
In the visualization below, we examined the number of hosts on Airbnb over time. In order to achieve this, we first created a host dataframe, in which each row represents a unique host id (originally, the Listings dataframe had many duplicated host id values since one host may have many listings). We used Pandas to calculate a cumulative sum of hosts based on the date that a particular host joined Airbnb. We also used Pandas to aggregate this value by year in order to build the graph. The graph was created using plotly, and was exported as an HTML file for display on this website.
In the visualization below, we sorted our hosts based on those that have the largest number of listings. We then used the location data provided in the dataframe to plot the locations of the properties on the map.
We used the visualization below in our cleaning process. We first split our hosts between Superhosts and non-Superhosts. In order to be a Superhost, there are a variety of requirements listed by Airbnb, including having a certain number of guests in the last year, and maintaining good ratings. By examining this visualization, we can see that there are a number of outliers when it comes to price. In fact, there are a few listings that are $10,000/night. After following the link to these properties, we determined that in fact these listings were not legitimate. Later in our notebook, we filter data by the price in order to remove listings that don't appear to be legitimate.
First, we constructed a classification model in order to predict whether a host is a Superhost or not (binary feature that is 1 or 0). We ran a Random Forest Classifier to determine our most important features. Not surprisingly, features related to reviews were the most important. This makes sense from the perspective that Airbnb makes it clear that the number of reviews and quality of reviews will make you more likely to become a Superhost (something that is determined by Airbnb manually). In fact, the most important feature appeared to be the total number of reviews for a host in the last twelve months. After adding only these most important features to the model, we were presented with the confusion matrix below. Since we had an unbalanced target variable (there are a lot more non-Superhosts than Superhosts), we focused a lot on the Sensitivity score, which was about 62.96%. We had an overall accuracy score of 85.05%. We used a KNN model with three neighbors. We chose this model after some trial and error with other models, and different numbers of neighbors.
There are a lot of improvements that could be made to the model to achieve a better accuracy score, including more feature engineering, and trying different machine learning models. We did complete some feature engineering, including aggregating some statistics for each host across all of their listings, but there is more that could be done in this area.
In our second machine learning model, we chose the price of an Airbnb listing as the dependent or target variable. One of the big focuses in this model was to find an effective way to deal with our outliers (listings with an abnormally high price). One of my team members decided to use a logarithmic scale. By making our scale non-linear, she reduced the effect of the outliers, leading to better results. She calculated an r-squared value of 53.27% (in sample), and 53.73% (out of sample).
As with the previous model, there is a lot more that could be done to improve this model in the future, including more feature engineering, and experimentation with other machine learning models.