Starbucks Capstone Challenge

7 min readDec 17, 2020

Introduction

These days, if we are talking about the coffeeshops, for sure Starbuck will be in our mind, and all of the global companies are trying their best to apply many offers for their customers, in order to increase the number of sales and to get the benefits. these days, the offering and advertisement types and criteria get changed very fast with technology and everyone nowadays is competing with each other to do the best. In our project, we will create a model

that will help Starbuck to predict the customers that will get the benefit from the offer. And also, it will help in getting an estimation number before applying the offer, and a correction later on to increase the benefits.

The data sets for this project are provided by Starbucks & Udacity in three files:

portfolio.json — containing offer ids and metadata about each offer (duration, type, etc.)
profile.json — demographic data for each customer
transcript.json — records for transactions, offers received, offers viewed, and offers complete

Here is the schema and explanation of each variable in the files:

portfolio.json

id (string) — offer id
offer_type (string) — type of offer ie BOGO, discount, informational
difficulty (int) — minimum required spend to complete an offer
reward (int) — reward given for completing an offer
duration (int) — time for offer to be open, in days
channels (list of strings)

profile.json

age (int) — age of the customer
became_member_on (int) — date when customer created an app account
gender (str) — gender of the customer (note some entries contain ‘O’ for other rather than M or F)
id (str) — customer id
income (float) — customer’s income

transcript.json

event (str) — record description (ie transaction, offer received, offer viewed, etc.)
person (str) — customer id
time (int) — time in hours since start of test. The data begins at time t=0
value — (dict of strings) — either an offer id or transaction amount depending on the record

portfolio data :

profile data:

transcript data:

# The problem / Metrics

What I chose to solve was to build a model that predicts whether a customer will respond to an offer. My strategy for solving this problem has four steps. First, I will merge the customer profile and transaction data. Second, I will build the model. This provides me with a baseline for evaluating the performance of models. Accuracy measures how well a model correctly predicts whether an offer is successful. Accuracy is the best metric to evaluate my model because I want to see how my model by finding accurate predictions with the total number of predictions. third I will compare the performance of DecisionTreeClassifier and Linear regression models. Four I will improve the model to get the highest accuracy.

Data cleaning

portfolio:

1-Rename id col name to offer_id

2-change duration from days to hours

The output will look like this:

profile:

1-Remove customers with N/A income data

2-Change the name of the ‘id’ column to ‘user_id’

3-create date format for became_member_on column

4-drop rows with no gender, income, age data

5-add start_year column and start_month and start_day

6 -Convert gender values to numeric 0 and 1

The output will look like this:

Transcript:

1-rename person col name to user_id

2-create separate columns for amount, reward, and offer_id from the value column

3-add new column to Transcript

4-Change time column from hours to days

The output will look like this:

Explore Data :

Number of customers by gender:

a number of male customers grater than female and another customer.

Distribution of customer by year:

Most customers joined Starbucks in 2017.

Check the Average age for different gender:

avg age in female customers are older than male

Check the Average income for different gender:

avg income in a female is more than male

Offer event:

There are 4 event type received offer, viewed offer, transaction, completed offer

offer Type:

We have three types of offers, BOGO and discount used more than informational

describe all measures for the portfolio data avg difficulty is 7.7 and avg duration is 156, avg reword is 4.2.

describe all measures for the profile data the avg age is 62 and the number of users is 1700 and avg income is 65404.

Combining data sets to get a final clean data.

Merging the two cleaned datasets Profile, and Transaction in one data frame

The output of the clean, combined data looks like this

Modeling

I build a model that predicts whether a customer will respond to an offer. I will only use transcripts with the offer id. I will use Accuracy measures to know how well a model correctly predicts whether an offer is successful. Accuracy is the best metric to evaluate my model because I want to see how my model by finding accurate predictions with the total number of predictions. this is my opinion on why to use them (accuracy).

Features:

1- Time(normalized)

2-Amount(normalized)

3-Reward(normalized)

4-age(normalized)

5-gender(normalized)

6-income(normalized)

The Target is:

Offer Completed.

The models that I used are DecisionTreeClassifier and LinearRegression.

Split data into train and test:

split data and normalized the data

after that, a Machine Learning pipeline for DecisionTreeClassifier Prediction is built to simplify the code and make it easy

DecisionTreeClassifier Prediction Accuracy 0.67

LinearRegression

I use the LinearRegression and then I got 83% accuracy rate

Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data. I used the LinearRegression model because it is made a good prediction accuracy score, I got 83% accuracy rate

suggests that the resulting DecisionTreeClassifier model has a training data accuracy of 100 and test accuracy of 67. I think that the DecisionTreeClassifier model I think did not overfit the training data. To avoid overfitting I will choose LinearRegression since it got better results 83% on training and 83.3% on testing datasets. LinearRegression is better used here since we have few binomial outcomes.

i used GridSearchCV to improve the DecisionTreeClassifier model and I get 71% increased by 4.9% and I think it doesn’t need further improvements and LinearRegression is 83% it is better than DecisionTreeClassifier I will choose LinearRegression and Best params: {‘max_depth’: 13, ‘min_samples_split’: 140}

GridSearchCV lets you combine an estimator with a grid search preamble to tune hyperparameters. The method picks the optimal parameter from the grid search and uses it with the estimator selected by the user. GridSearchCV inherits the methods from the classifier, so yes, you can use the .score, .predict, etc.. methods directly through the GridSearchCV interface.

https://datascience.stackexchange.com/questions/21877/how-to-use-the-output-of-gridsearch#:~:text=GridSearchCV%20lets%20you%20combine%20an,yes%2C%20you%20can%20use%20the%20.

Conclusion

The goal of this project is to predict how the customer will interact with the offers that Starbuck will present/send. Firstly I took the provided data that requires some cleaning activity, then I did the needed changes to analyze that data that required to be applied before starting the data exploring activity. Finally, I used two models first one DecisionTreeClassifier I got 67% and the second one I use the LinearRegression I got 83% accuracy rate. I used GridSearchCV to improve the DecisionTreeClassifier model and I get 71% increased by 4.9% and I think it doesn’t need further improvements. As a result of modeling the data, we found the females are more excited about the offer, and Starbuck should enhance their offers based on the provided result. Moreover, Starbuck can use the model to enhance their offers periodically after each offer to know the real benefits to aim their offer to the correct audience

challenge

The challenging part is what the problem I will solve and how to improve the model, I very enjoy working on this project.I will build many models to help my company

Improvements

I think I arrive at a good point. To make the result better, I will try to improve the data gathering and fix nulls values. I will also try to get more data about customers and products. I think if we have more attributes it is will be perfect

GitHub

https://github.com/Leenaalshaibani/Starbucks_Capstone_notebook_final

Starbucks Capstone Challenge

Introduction

# The problem / Metrics

Data cleaning

Explore Data :

Combining data sets to get a final clean data.

Modeling

The models that I used are DecisionTreeClassifier and LinearRegression.

Split data into train and test:

LinearRegression

Conclusion

challenge

Improvements

GitHub

Written by Lina Alshaibani