Starbucks Challenge… Accepted!

Project Description: Mining Starbucks customer data — predicting offer success

Introduction and motivation

This project aims to answer a set of questions based on the provided datasets from Starbucks: transactions, customer profiles and offer types.
The main question we will ask, and around which the whole project revolves, is:

What is the likelihood that a customer will respond to a certain offer?

Other questions to be answered are:

About the offers:

About the customers:

About the transactions:

The motivation is to improve targeting of offers to Starbucks’ customers to increase revenue.

We will follow the CRISP-DM data science process standard for accomplishing the data analysis at hand.

Some details about the data:

This data set contains simulated data that mimics customer behavior on the Starbucks rewards mobile app. Once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free). Some users might not receive any offer during certain weeks.Not all users receive the same offer, and that is the challenge to solve with this data set.

Note from the author:

You can find the technical details and implementation in the notebook from my Github in the folder Starbucks_Capstone_Challenge. This blog post will not go into details about wrangling, cleansing, etc. It will give a high-level overview of the main decisions made, but if you want to get a deeper view, then visit my Git :)

Business Understanding

The motivation is to improve targeting of offers to Starbucks customers to increase revenue.
Our goal therefore is to find a relationship between customers and Starbucks offers based on purchasing patterns from the customers.
Thus, we need to understand the customers included in the datasets, identify groups within them, and assign the best matching offers to these groups.
Therefore, the main question we should answer is:

What is the likelihood that a customer will respond to a certain offer?

Having a model that predicts a customer’s behavior will accomplish the goal of this project.

During the data exploration, other questions related to customers and offers will be formulated, as tour understanding of the data will be increased.

Data Understanding

The goal of data understanding is to have an overview of what is in the datasets and already filter the data we need to answer the main question.
After we filter the needed data, we will proceed to wrangle and clean the data to make modelling possible. After wrangling and cleaning, we will explore further the data to extract additional questions we could answer based on its new form.

We first need to define a set of metrics to be able to assess whether an offer suits a particular customer (assessing whether we answered the question correctly with our model).


We have a classification problem (customer-offer) and data to train a model. Thus, we we will use supervised learning models and use:

i. Accuracy (number of correct predictions divided by the total number of predictions),
ii. F-Score with beta=0.5 ((1+beta²)*Precision*Recall/(beta² * (Precision+Recall))). F-score is used to combine precision (True_Positive/ (True_Positive+ False_Positive)) and recall(True_Positive / (True_Positive + False_Negative)).

The data seems balanced, but nonetheless, the F-score might come in handy to choose between top models in case accuracy are similar.

Definitions from Medium blogpost. And for a better understanding on precision and recall, wikipedia does the job great!


Offer types: portfolio.json


Customer demographics: profile.json


Transactions: transcript.json


Overall observations

Looking how we need all data points for answering our question (there is no attribute that we could delete without further knowledge), and that everything has the potential to be correlated, we will merge the 3 datasets into one after wrangling and cleaning.

Data Preparation



It is common to all that for better understand-ability, we will rename the attributes in a way that we can know the units of the column and that we can link the datasets as we already surmised.


Undesirable value detection:




We will change the column names to add the units and proper names.


Undesirable value detection:




We will rename the columns so they can be combined later on. We have to think about how to collapse the records from a user into one row, so we can join the 3 datasets. This will require feature engineering. This last sentence is related ot the ‘unsure’ of points 2 and 4.

In essence our final dataset should have pairs of customers and offers, together with a score for how well that offer did with the customer. The score can be binary, whether it worked or not. This score must be distilled from this transcript dataset.


Undesirable value detection:


We would like to combine all datasets to feed it to a model. The transcript dataset contains information about successful and unsuccessful offers, about the purchasing of the customers and about the rewards they have retrieved.
We have two id columns, we can use them as foreign keys for the primary keys in the other two datasets to combine them. However, we cannot do that yet. We have to distill the data of the transcript data set to obtain the valuable information that will allow us to prognosticate if an offer will be accepted or not by an individual in the future.
So first of all, how the ideal dataset would look like:

Offer id |…offer properties…| customer id | …customer qualities… | success/no_success| profit | Viewed/Not viewed | Received/not_received

In order to get this datset, we need to assess the success of the offer. For that, we need to attach to the transcript dataset information from the portfolio: offer duration, reward, difficulty and type for data exploration.

Something to note is that a customer might not spend exactly the same amount of money needed to fulfill the offer, that is interesting but unfortunately, becuase offers overlap, you cannot really assign a profit to a offer-customer pair aside from the obvious one of difficulty — reward.

What we will do for the profit attribute:

Further considerations:

The algorithm to find successful offers per customer (feature engineering):
1. Group by customer
2. Loop through each customer (for)
2.1. Loop through each offer (for)
2.1.1. Bogos and discounts: distill information about success, viewed, received, effective time and profit. Get the amount of received offers Iterate sequentially in time and event ordered, and find viewed or completed events. Depending on what was seen before, offers will have been successful or not
2.1.2 Informational: idem, but there is no concept of success
2.1.3 Non-offer: find the gaps where no offer was active and use these gaps to add the profit

*Profit is viewed at the end-customer of course, not considering how much money costs to sell the product.

*Transaction periods can overlap

With all this done, now we can answer the initial questions we had. except for the main one (that will come after modelling)


About the offers

The maximum duration offers are 4 and 6
With a duration of: 240 h
The most rewarding offers are 0 and 1
With a reward of: 10 $

About the customers:

More females.

It seems that females and males have more or less the same income in this dataset. But it also seems that there is more women that earn above the average than men. (The comparisons are not perfect because the number of women is around 3000 less than men, so the sample of women is less representative) It is worth extracting more insights as the distributions seem different. Let us check the wassertein_distance:

W_dist = 8824.9037

Indeed, the difference is pretty high between the distributions of men and women.

Mean salary for women is 70566.20827461942
Mean salary for men is 61741.30454620141
In this dataset, women earn more money than men in average, by: 8824.903728418009 $
Median salary for women is 66000.0
Median salary for men is 63000.0
In this dataset, however, the median is not so far from genders: 3000.0 $
The std of the salary for women is 20981.542952480755
The std of the salary for men is 18774.535245943607
In this dataset, however, the median is not so far from genders: 2207.007706537148 $

W_dist = 4.7037

Indeed pretty similar distributions between female and male.

There are noticeable jumps every 2 years (half of 2015 and half 2017). Perhaps they correspond to new campaigns or improvements in the app. It is also interesting that in 2018, the number of new memberships dropped (first time), thus perhaps new competitors arrived into the market.

About the transactions:

Most males and females prefer offer 6 which is a discount of 2 dollars after buying products worth 10 dollars, with the longest duration of 10 days, and it reaches thorugh all media: mobile, social and web. The second most liked is offer 5, which is also a discount. An the top 3 is 1 for males (bogo) and 8 for females (bogo). But the differnces between female and mae preferences are not that large.

There are not so many customers with high income. The target group is people who earn between 50 and 75k. For them, the most preferred offer ids are 5 and 6, in the table below we see they prefer discount offers.

Senior adults are the biggest clientele, and prefer offer ids 5 and 6, (discount type on the plot below) This leads me to think that most of them have also a medium low income.

This is something we could have expected, each offer has the same distribution, which leads to think that time of becoming a member does not have an effect on which offers they might prefer.

This plot must be taken with a grain of salt. Bogos are not profitable from our considerations as you produce 0 profit. But they serve other purposed and their success is measured not based on the profit. The most profitable offer is 7 and 2, these are informational. However, informational offers’ profit is based on the spending in a period time, during which there were other offers as well. We could clearly conclude that offer 6 among discounts is the most profitable one.

They have made more money without the offers


We have converted all the categorical values into dummy variables (including the offer_ids, which are a category of themselves). We have also scale all numeric values between 0 and 1.

The classifiers that we have chosen are:

For classification, we will use:

We first check which one performs better.

They all provide fscores (beta = 0.5) and accuracy scores of around 0.6 and 0.7 respectively. Nonetheless, the model that performs between is Gradient boosting, although only by a hair.

Here is the comparison data training on the full sample:

Training with all samples
Training time = 164.17853903770447
Testing time = 33.015408754348755
Test Accuracy = 0.6959884635553225
Test Fscore = 0.582545208095869

Training time = 0.2846059799194336
Testing time = 0.0351099967956543
Test Accuracy = 0.7047718930256948
Test Fscore = 0.5984216117772044

Training time = 0.16101694107055664
Testing time = 0.0060138702392578125
Test Accuracy = 0.7007953155042824
Test Fscore = 0.5910921218455107

Training time = 4.785567045211792
Testing time = 0.03660392761230469
Test Accuracy = 0.7081803880440483
Test Fscore = 0.6026470491641762

We picked Gradient Boosting and I wanted to optimize its parameters, so I performed a grid search:

parameters = {
‘learning_rate’: [0.1, 0.5, 1],
‘max_depth’: [3, 4],
‘n_estimators’: [100, 125, 150],

Nonetheless, the max depth and n_estimators stayed in default, 3 and 100 respectively, and the only difference was the learning rate which went up to 0.75. The accuracy and fscore barely moved.


The best model to predict if an offer will be successful is Gradient Boosting. However, 70% is not such a high accuracy, better than human though. Grid search did not show much improvements, so furtehr tunning should be carried out. We saw that the learning rate went from 0.1 to 0.5, while the rest of parameters stayed the same. The enxt logical step would be to try with a learning rate of 0.75 (as 1 was not chosen) and try to change other parameters.

Unoptimized model
Accuracy score on testing data: 0.7082
F-score on testing data: 0.6026

Optimized Model
Final accuracy score on the testing data: 0.7105
Final F-score on the testing data: 0.6086
GradientBoostingClassifier(criterion='friedman_mse', init=None,
learning_rate=0.5, loss='deviance', max_depth=3,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100,
n_iter_no_change=None, presort='auto',
random_state=None, subsample=1.0, tol=0.0001,
validation_fraction=0.1, verbose=0,

Challenges faced

There are two things to note. The first and minor one is the amount of time spent wrangling, cleaning and exploring. It is crucial, otherwise the model will not predict anything adequately, or not at all as it might not work. Additionally, it must be a process to be automated as soon as possible.

The second one that I found most challenging and where I spent most of my time was in feature engineering. Distilling information from the transaction dataset was a tough problem as the offers not only overlap, but also the same offers would overlap, thus you could not differentiate one from the other within an interval. A set of assumptions had to be made, being as logical and realistic as possible, and based on them, program an algorithm to execute a logic that would work given any combination of sequences.

Feature engineering was crucial, without it, you would not know which offers were successful and which ones were not. Time spent in it is time you save in modeling, tuning your parameters might be a waste of time if (i) your input to the model is not correct, and (ii) your feature engineering was able to collect the information as precisely as possible, improving the accuracy.


With respect to the model, we could perform further tuning to improve the scores. Overall, the score is not too high but of course, it is better than human intuition as there are many parameters into consideration

Summary of the end-to-end solution

We follow the CRISP-DM process to answer the following question: What is the likelihood that a customer will respond to a certain offer?

We started by wrangling, cleaning and exploring the three datasets given: A portfolio with the characteristics of each offer, a profile with the demographics of the clients, and a ledger with transactions from customers. This allowed to answer minor questions related to the data.

Because this is a classification problem, we chose accuracy and f-score as out metrics.

Once each individual dataset was pre-processed (including a feature engineering step for the ledger of transactions in order to distill when an offer was a success or not), we merged all of them in order to have all the information in one dataset, which is necessary to feed a classifier model.

We selected the label on which the model would be trained, which was the success attribute. After that, we created the feature to feed the model (dropping the label) and performed a scan of the performance of different classifiers, names: Random forest, logistic regression, gradient boosting, and support vector machines. Gradient boosting performed the best, so we selected it and tried to perform grid search to tune the parameters for better performance, however, the scores remained on around 60% for fscore and 70 % for accuracy.

The future work would consist on creating an application, a pipeline, to automate all this process .


We have wrangled, cleansed and explored 3 datasets. We have studied individually their content. We have calculated the target label for the model, the success of an offer. We have merged afterwards all these datasets, further explore its insights and finally, prepare it and feed it into a model. The model chosen was Gradient Boosting and teh accuracy and f-score were 0.7 and 0.6 respectively, which are not too high but given the amount of variables, it is a better prediction than a 1/offer_ids of a human.

I am a Ph.D in computer science interested in privacy enhancing techniques, namely differential privacy.