airbnb – Analysis of Variance

Abstract

Airbnb Business Problem:

Where will a new guest book their first travel experience?

Instead of waking to overlooked “Do not disturb” signs, Airbnb travellers find themselves rising with the birds in a whimsical treehouse, having their morning coffee on the deck of a houseboat, or cooking a shared regional breakfast with their hosts.

New users on Airbnb can book a place to stay in 34,000+ cities across 190+ countries. By accurately predicting where a new user will book their first travel experience, Airbnb can share more personalized content with their community, decrease the average time to first booking, and better forecast demand.

In this recruiting competition, Airbnb challenges you to predict in which country a new user will make his or her first booking. Kagglers who impress with their answer (and an explanation of how they got there) will be considered for an interview for the opportunity to join Airbnb’s Data Science and Analytics Team.

Introduction

Airbnb posed the above mentioned business problem.

Through the use of ANOVA we aim to see if the country of destination is affected by the other variables in the data set.

What is our Hypothesis?

H0 = No effect between country of destination and other variables

H1 = There is an effect.

Data Source:

Data Set Information:

The data set comes from https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings.

It is a true data set that Airbnb placed on Kaggle, as part of a recruitment competition.

Airbnb posed the question “Where will a new guest book their first travel experience?”

The dataset contains 16 Variables and 213451 observations.

Data

Visualise, explore and clean data

Bring in data – Run summary and graph data: observations from outcome listed below

Original dataset contains 16 Variables and 213451 observations.

Final dataset contains 17 variables and 66707 observations

Tools used R and Excel

Variable Missing Values Comments – before cleaning data Comments   after cleaning data
Id No Customer ID – not really needed
date_account_created No Same date as timestamp_first_active
timestamp_first_active No Same date as account created. Could split data to date and time column or remove as not needed? Didn’t use this column, but time could be interesting.
date_first_booking Yes – 124530 Remove all blanks, as no booking made.

QUESTION: Is this date of booking made or Date of travel?

Cleaned – all blanks gone

 

Gender No Male/female/other/

unknown?? See if gender split might be relevant. What to do with unknown? (unknown- biggest)

After cleaning the above, unknown reduced, female/male similar.

Unknown – keep, too big to lose.

age

 

Yes – 87990 From age 1 to 2014

Clean data – 18-80?

Cleaned – slightly skewed
signup_method No Google/Facebook/Basic

Basic – biggest

Basic – biggest
signup_flow No
Language No en-biggest En – biggest
affiliate_channel No Direct – biggest Direct – biggest
affiliate_provider No Direct – biggest Direct – biggest
first_affiliate_tracked Yes – 6065 6065 – Might all go when date_first_booking cleaned Still 352 missing data.
signup_app No Web – biggest Web – biggest
first_device_type No Mac desktop and Window desktop – biggest More Mac than Window after clean
first_browser No Mixed outcome Still mixed
country_destination No Catagorical –

NDF biggest – will go when date_first_booking cleaned

US – next biggest

NDF – Gone

US still biggest 71%, next closest other 11%

country_code No New Column created For ANOVA

Quick observations:

US is by far the most popular destination, with en as most popular language.

Looks like most business is coming Direct to Airbnb, through the web.

Users of Apple products quite substantial

 

airbnb_top_3

Reminder of our Hypothesis?

H0 = No effect between country of destination and other variables

H1 = There is an effect between country of destination and other variables

Outcome of ANOVA:

airbnb_anova

 

By looking at the stars that R has given each variable we can determine the significance, if any that the variable has on the country of destination. The more stars that appear the more significant the variable is.

Based on the definite significance, denoted by ***, **, *, that R is showing us, we reject the null hypothesis and accept the alternate, that different variables do have an effect of country of destination.

The outcome of the ANOVA test tells us that language and sign_up method have no effect on the country of destination. First_browser has some significance, first_affiliate_tracked and signup_app have more significance, but the most significant are age, gender, signup_flow, affiliate_channel, affiliate_provider and first_device_type.

If we look deeper into this we can expect that signup and affiliate variables will effect the country of destination as both types of variables are guiding the consumer in some way to the airbnb website. EG: an affiliate may be guiding a consumer to a particular destination.

Parametric or Non-Parametric?

I think it is fair to say that data within this dataset is not of normal distribution, some variables have quite normal distribution, eg: age, which we cleaned to give a better distribution but other variables definitely do not, eg: US as a destination and ‘en’ as language and that any further analysis should be dealt with using non-parametric tests.

There is a definite skew to travel in US, with 71% of bookings for US.

Language is definitely skewed to ‘en’ as language totalling 97%

Just out of curiosity I ran an ANOVA on the dataset excluding US as a destination.

The result was quite interesting:

airbnb_anova_no_us

Here is a comparison:

Variable Name Dataset with US Dataset without US
Age *** ***
Language ***
Gender *** ***
signup_method
signup_flow *** ***
affiliate_channel *** **
affiliate_provider *** **
first_affiliate_tracked **
signup_app ** ***
first_device_type *** ***
first_browser * *

Note how language has now become a very significant factor, once US as a destination has been removed. The affiliate variables without the US have reduced in significance also.

To summarise there are a number of issues that I think that airbnb need to address going forward:

Data quality – There are definitely some issues with the way in which airbnb capture their data, these areas include:

    • Gender – catagories need to be defined, not sure if ‘unknown’ is computer generated?
    • Age – lower and upper limits should be put on these
    • First_affiliated_tracked – allows null values
    • Date_first_booking – is suggests the date that the booking as made, however could it be the date of travel?

No Bookings – 124543, (58%) accounts created without generating bookings. WHY?

  • There is opportunity here to incentivise a booking or an opportunity to capture further data, a quick click box as to why the customer is leaving the site eg: just browsing, booking elsewhere, will book at later date, will give some insight.

Country of Destination – It would appear that there is a difference in approach between travel to the US and the rest of the destinations, (as per ANOVA), this needs to be addressed or monitored to see if this a cultural issue, ie: are most bookings US citizens traveling within the US? Or maybe US is an established model and Airbnb is only growing its business elsewhere?Understanding this will determine whether or not the US should be dealt with as a separate model to the rest.

Language – ‘en’ is definitely an influencing factor with 97% bookings having ‘en’, as language. The next most popular languages are ‘zh’, fr’ and ‘es’, which make up 54% of the remaining 3% of bookings.

This could be investigated to see are there other factors influencing the bookings made with these languages.

Gender – An interesting observation, the gender split does show some differing trends, EG: women booking more than men for US, FR, GB and IT.

This data is useful and can be used when designing the website.

I would recommend that Airbnb capture the following data:

  • country of residence
  • country of departure
  • date of travel
  • duration of stay
  • whether the trip is one destination or multiple destinations

This data would give more insight to consumer behaviour and would be good indicators of country of destination and would be able to answer questions like: are consumers using airbnb to travel within their own countries? how far ahead are consumers booking? do consumers use Airbnb for overnight stays or holidays? are consumers using Airbnb when they are on a long trip travelling the world?

Conclusion

To conclude, we haven’t quite answered the question “Where will a new guest book their first travel experience?” but there is a pretty good chance it will be US.

However, we have gained some insight to the dataset and were able to make recommendations going forward to help answer this question.

REFERENCE:

R code used, with comments:

# Read in data file
airbnbData <- read.csv(“train_users_2.csv”, header = TRUE)

# Get a feel for the data
head(airbnbData)
str(airbnbData)

# Look for NA’s and outliers, anything odd
summary(airbnbData)
# OUTPUT – Question marks???
# date_first_booking: 124543 blank fields, so no booking made, take out
# gender: 4 catagories, unknown, female, male and other – gender m/f split quite not sure if I want to take out unknown?
# age: NA’s: 87990, outliers min age 1, max age 2014 – adjust to reasonable 18-80?
# first_affiliate_tracked: 6065 blank fields. Not sure this variable relevant?

# Visualise Data
plot(airbnbData$country_destination) # ‘NDF’ most – as above, these represent where no booking exists
plot(airbnbData$gender) # ‘unknown’ by far greatest
plot(airbnbData$age) # visual all over place as expected
hist(airbnbData$age) # data needs cleaning
plot(airbnbData$signup_method) # ‘basic’ most popular by far
plot(airbnbData$language) # ‘en’ most popular by far
plot(airbnbData$affiliate_channel) # ‘direct’ most popular by far
plot(airbnbData$affiliate_provider) # ‘direct’ most popular by far, followed by Google
plot(airbnbData$first_affiliate_tracked) # check this after blanks taken out
plot(airbnbData$signup_app) # ‘web’ most popular by far
plot(airbnbData$first_device_type) # Mac Desktop’ and ‘Windows Desktop’ largest
plot(airbnbData$first_browser) # mixed
# create new dataset – clean to have: only bookings and deal with age
airbnbNew <- airbnbData

# could use these to take out unneeded variables if file too big!
# take out id
#airbnbNew[1] <- NULL
#head(airbnbNew)

# take out timestamp
#airbnbNew[2] <- NULL
#head(airbnbNew)

# remove all rows with no date_first_booking – should result in 88908 obs (ref: Excel Clean Data)
airbnbNew <- subset(airbnbData, date_first_booking != “” )
summary(airbnbNew)

# making sure all NDF are gone
plot(airbnbNew$country_destination)

# give age an appropriate range – should result in 67059 obs (ref: Excel Clean Data)
airbnbNew <- subset(airbnbNew, age >= 18 & age <= 80)

summary(airbnbNew)
# OUTPUT – compare to first summary
# date_first_booking: blank fields gone now
# gender: 4 catagories, unknown, female, male and other.
# age: NA’s: all gone, min 18, max 80, mean 36, as expected.
# first_affiliate_tracked: only 352 blank fields now.(originally 6065 blank fields). Take out

# Take out blank first_affiliate_tracked
airbnbNew <- subset(airbnbNew, first_affiliate_tracked != “” )
summary(airbnbNew)

plot(airbnbNew$gender) # unknown much smaller, female most, close behind male
hist(airbnbNew$age) #this data is slightly skewed. (Excel : age 18 to 60 is less skewed, more normal)
plot(airbnbNew$signup_method) # still ‘basic’ most popular by far
plot(airbnbNew$language) # still ‘en’ most popular by far
plot(airbnbNew$affiliate_channel) # still ‘direct’ most popular by far
plot(airbnbNew$affiliate_provider) # still ‘direct’ most popular by far, followed by Google
plot(airbnbData$first_affiliate_tracked)
plot(airbnbNew$signup_app) # still ‘web’ most popular by far
plot(airbnbNew$first_device_type) #  still Mac Desktop’ and ‘Windows Desktop’ largest, Bigger gap now.
plot(airbnbNew$first_browser) # still mixed

# Create new dataset and Create a new column with country_code – for ANOVA
airbnbClean <-airbnbNew

airbnbClean$country_code[airbnbClean$country_destination==’US’]<-1
airbnbClean$country_code[airbnbClean$country_destination==’GB’]<-2
airbnbClean$country_code[airbnbClean$country_destination==’ES’]<-3
airbnbClean$country_code[airbnbClean$country_destination==’FR’]<-4
airbnbClean$country_code[airbnbClean$country_destination==’IT’]<-5
airbnbClean$country_code[airbnbClean$country_destination==’AU’]<-6
airbnbClean$country_code[airbnbClean$country_destination==’CA’]<-7
airbnbClean$country_code[airbnbClean$country_destination==’DE’]<-8
airbnbClean$country_code[airbnbClean$country_destination==’NL’]<-9
airbnbClean$country_code[airbnbClean$country_destination==’PT’]<-10
airbnbClean$country_code[airbnbClean$country_destination==’other’]<-11

# Create ANOVA model
model_1 <- aov(country_code~age+language+gender+signup_method+signup_flow+affiliate_channel+affiliate_provider+first_affiliate_tracked+signup_app+first_device_type+first_browser,data = airbnbClean)
summary(model_1)
# OUTPUT
# language, signup_method – no significance, take out

model_2 <- aov(country_code~age+gender+signup_flow+affiliate_channel+affiliate_provider+first_affiliate_tracked+signup_app+first_device_type+first_browser,data = airbnbClean)
summary(model_2)

# Curious??? Feel that US is having effect on these models
# Take out US to see the difference
no_us_data <- airbnbClean
no_us_data <- subset(no_us_data, country_destination != “US” )
summary(no_us_data)

model_no_us <- aov(country_code~age+language+gender+signup_method+signup_flow+affiliate_channel+affiliate_provider+first_affiliate_tracked+signup_app+first_device_type+first_browser,data = no_us_data)
summary(model_no_us)
# OUTPUT – change in significance levels.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *