Starbucks Challenge… Accepted!

Gonzalo Munilla Garrido
18 min readOct 24, 2020

Project Description: Mining Starbucks customer data — predicting offer success

Introduction and motivation

This project aims to answer a set of questions based on the provided datasets from Starbucks: transactions, customer profiles and offer types.
The main question we will ask, and around which the whole project revolves, is:

What is the likelihood that a customer will respond to a certain offer?

Other questions to be answered are:

About the offers:

  • Which one is the longest offer duration?
  • Which one is the most rewarding offer?

About the customers:

  • What is the gender distribution?
  • How different genders are distributed with respect to income?
  • How different genders are distributed with respect to age?
  • What is the distribution of new memberships along time?

About the transactions:

  • Which offers are preferred according to gender?
  • Which offers are preferred according to income?
  • Which offers are preferred according to age?
  • Which offers are preferred according to date of becoming a member?
  • Which are the most successful offers?
  • Which are the most profitable offers?
  • Which are the most profitable offers between informational?
  • How much money was earned in total with offers Vs. without offers?

The motivation is to improve targeting of offers to Starbucks’ customers to increase revenue.

We will follow the CRISP-DM data science process standard for accomplishing the data analysis at hand.

Some details about the data:

This data set contains simulated data that mimics customer behavior on the Starbucks rewards mobile app. Once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free). Some users might not receive any offer during certain weeks.Not all users receive the same offer, and that is the challenge to solve with this data set.

Note from the author:

You can find the technical details and implementation in the notebook from my Github in the folder Starbucks_Capstone_Challenge. This blog post will not go into details about wrangling, cleansing, etc. It will give a high-level overview of the main decisions made, but if you want to get a deeper view, then visit my Git :)

Business Understanding

The motivation is to improve targeting of offers to Starbucks customers to increase revenue.
Our goal therefore is to find a relationship between customers and Starbucks offers based on purchasing patterns from the customers.
Thus, we need to understand the customers included in the datasets, identify groups within them, and assign the best matching offers to these groups.
Therefore, the main question we should answer is:

What is the likelihood that a customer will respond to a certain offer?

Having a model that predicts a customer’s behavior will accomplish the goal of this project.

During the data exploration, other questions related to customers and offers will be formulated, as tour understanding of the data will be increased.

Data Understanding

The goal of data understanding is to have an overview of what is in the datasets and already filter the data we need to answer the main question.
After we filter the needed data, we will proceed to wrangle and clean the data to make modelling possible. After wrangling and cleaning, we will explore further the data to extract additional questions we could answer based on its new form.

We first need to define a set of metrics to be able to assess whether an offer suits a particular customer (assessing whether we answered the question correctly with our model).

Metrics

We have a classification problem (customer-offer) and data to train a model. Thus, we we will use supervised learning models and use:

i. Accuracy (number of correct predictions divided by the total number of predictions),
ii. F-Score with beta=0.5 ((1+beta²)*Precision*Recall/(beta² * (Precision+Recall))). F-score is used to combine precision (True_Positive/ (True_Positive+ False_Positive)) and recall(True_Positive / (True_Positive + False_Negative)).

The data seems balanced, but nonetheless, the F-score might come in handy to choose between top models in case accuracy are similar.

Definitions from Medium blogpost. And for a better understanding on precision and recall, wikipedia does the job great!

Datasets

Offer types: portfolio.json

  • id (string) — offer id
  • offer_type (string) — type of offer ie BOGO, discount, informational
  • difficulty (int) — minimum required spend to complete an offer
  • reward (int) — reward given for completing an offer
  • duration (int) — time for offer to be open, in days
  • channels (list of strings)

Comments:

  1. There are 10 types of offers for one product (as specified in the descrition) and are characterized by 6 attributes.
  2. They are a mixture of integers (3) and strings (2) and arrays of strings (1).
  3. There are no null values.
  4. The offers have an average reward of 4, a duration of 6.5 days and a difficulty of 7.7.
  5. The domain of the integer attributes is small (0–20).
  6. The median (50% percentile) is not too far from the mean, thus, the integer columns should be somewhat balanced.
  7. Cross checking with the skewness, we see that duration is balanced, while reward and difficulty are somewhat unbalance, but it is not extreme whatsoever.
  8. It is a very small datasets interns of bytes.
  9. We clearly see the types of categories that channels and offer_type have.
  10. There are no duplicated values

Customer demographics: profile.json

  • age (int) — age of the customer
  • became_member_on (int) — date when customer created an app account
  • gender (str) — gender of the customer (note some entries contain ‘O’ for other rather than M or F)
  • id (str) — customer id
  • income (float) — customer’s income

Comments:

  1. There are 17000 customer records (one per customer) and there are 5 attributes to characterize each.
  2. They are a mixture of numeric values (2 ints and a float) and strings (2).
  3. There are some null values in gender and their income. The number is the same, so most probably they are paired in the same record. ~13% is not a considerable value but nonetheless, we will consider them as part of the analysis, and see if this group of people that do not share the gender have a particular preference for a type of offer. It is also interesting to see that these values apparently have an age of 118, therefore something went wrong on collection.
  4. The average salary
  5. The domain of the integer attributes is reasonable ,being the highest for the income column.
  6. The median (50% percentile) is not too far from the mean, thus, the integer columns should be somewhat balanced.
  7. Cross checking with the skewness, we see that income is balanced, while age and became_member_on are somewhat unbalance, but it is not extreme whatsoever.
  8. It is a relatively small datasets in terms of bytes.
  9. no duplicated values

Transactions: transcript.json

  • event (str) — record description (ie transaction, offer received, offer viewed, etc.)
  • person (str) — customer id
  • time (int) — time in hours since start of test. The data begins at time t=0
  • value — (dict of strings) — either an offer id or transaction amount depending on the record

Comments

  1. There are 1226136 transactions recorded in the dateset.
  2. The attributes are strings, and one int.
  3. There are no null values.
  4. The domain for time is around 30 days, which is larger than the highest duration of the offer, which is 10. this indicates that we are also measuring purchases past the offer time.
  5. The time attribute is balanced.
  6. It is a medium size datasets in terms of bytes, much larger if compared to the rest of the datasets.
  7. No duplicated values

Overall observations

Looking how we need all data points for answering our question (there is no attribute that we could delete without further knowledge), and that everything has the potential to be correlated, we will merge the 3 datasets into one after wrangling and cleaning.

Data Preparation

Portfolio

Wrangling

  1. reward: no change
  2. Channels: create 4 new columns with binary values for the training of the model
  3. difficulty: no change
  4. duration: change to hours to be on the same units as the other datasets
  5. offer_type: create 3 new columns with binary values for the trainig of the model
  6. id: Convert it into an increasing integer ID for easier representation later

It is common to all that for better understand-ability, we will rename the attributes in a way that we can know the units of the column and that we can link the datasets as we already surmised.

Cleansing

Undesirable value detection:

  1. Missing values: No
  2. Duplicates: No
  3. Incorrect values: No. We trust Starbucks that the offer portfolio is correct, as there is no way for us to verify it.
  4. Irrelevant: Each row is relevant because it belongs to a distinct offer we will have to match with customers. The datasets is not large, we do not need to use PCA to know that the channel_email column does not explain any variability (all values are the same), so we can drop it.

Measures:

  1. Replace: No
  2. Modify: No
  3. Delete: channel_email

Profile

Wrangling

  1. age: No changes
  2. became_member_on: transform into date and time
  3. gender: Create dummy variables with M, F, O, and missing_gender. We keep the missing values because they do not seem random, as income has the same number of missing values
  4. id: transform it into an easier id to read
  5. income: No changes

We will change the column names to add the units and proper names.

Cleansing

Undesirable value detection:

  1. Missing values: income. Gender was taken care by the wrangling step.
  2. Duplicates: No
  3. Incorrect values: age
  4. Irrelevant:

Measures:

  1. Replace: replace income nans with the average of the column. Replace the age of 118 with the average age.
  2. Modify: no
  3. Delete: gender_dummy column (AFTER EXPLORATION OF ALL THE DATSETS COMBINED)

Transcript

Wrangling

  1. person: replace the ids with the ones from the previous datasets. It can be connected to customer id
  2. event: no changes. We will not use this column for prediction and it is useful to have it in this format for cleaning and wrangling and visualization.
  3. time: no changes
  4. value: make dict_keys a column and dict_values the values within. Once we do that, we transform offer ids into the easier to read ids defined before and leave the nans as they are (dealt with in cleansing). And the transaction one, we could replace the nans with a 0. For the offer ids with nans (meaning there is only an amount), we can replace the nan value with higher number than the last offer id, indicating that there was no offer. With reward, we set it to 0 for nans, we must check if this coincides with offer completed.

We will rename the columns so they can be combined later on. We have to think about how to collapse the records from a user into one row, so we can join the 3 datasets. This will require feature engineering. This last sentence is related ot the ‘unsure’ of points 2 and 4.

In essence our final dataset should have pairs of customers and offers, together with a score for how well that offer did with the customer. The score can be binary, whether it worked or not. This score must be distilled from this transcript dataset.

Cleansing

Undesirable value detection:

  1. Missing values: offer_id, amount, reward.
  2. Duplicates: No
  3. Incorrect values: offer ids are floats and not ints (probably becuase there are nans in the same column and somehow it affected the converion)
  4. Irrelevant:

Measures:

  1. Replace: offer_id nans with the value 10 (one above the last offer id). amount and reward nans will be replaced with a 0
  2. Modify: offer_id into int again
  3. Delete: none

We would like to combine all datasets to feed it to a model. The transcript dataset contains information about successful and unsuccessful offers, about the purchasing of the customers and about the rewards they have retrieved.
We have two id columns, we can use them as foreign keys for the primary keys in the other two datasets to combine them. However, we cannot do that yet. We have to distill the data of the transcript data set to obtain the valuable information that will allow us to prognosticate if an offer will be accepted or not by an individual in the future.
So first of all, how the ideal dataset would look like:

Offer id |…offer properties…| customer id | …customer qualities… | success/no_success| profit | Viewed/Not viewed | Received/not_received

In order to get this datset, we need to assess the success of the offer. For that, we need to attach to the transcript dataset information from the portfolio: offer duration, reward, difficulty and type for data exploration.

  • Success column: an offer is successful if a user has purchased the amount of the difficulty before the offer expires. thus, we need the duration and difficulty. That is for bogos and for discount. For informational we consider it successful if the customer has bought something.
  • Profit column: we could make predictions and comparisons between groups with this column. For this we need: difficulty — reward. However, this is not the focus of the question.
  • Viewed column: we need the event column of transaction.
  • Received column: we need the event column from transaction.

Something to note is that a customer might not spend exactly the same amount of money needed to fulfill the offer, that is interesting but unfortunately, becuase offers overlap, you cannot really assign a profit to a offer-customer pair aside from the obvious one of difficulty — reward.

What we will do for the profit attribute:

  • bogos and discount profit = difficulty- reward (Note that bogos will have a profit of 0)
  • informational = the amount of dollars transacted in its period. We could also subtract the rewards obtained in that period, but it would only be useful if we added the other offers that were completed at the same time. This is out of scope for my questions. So the word profit for informational is not completely right.
  • non_offer = the amount of dollars outside any offer period

Further considerations:

  • For viewed or completed to happen, at least received has had to happen.
  • There is no complete offer event outside the limit of time (offer leaves the app)
  • There is no viewed offer event after the limit of time (offer leaves the app)
  • There can be a viewed event after the complete event, so the offer is failure.
  • Offers, be them the same or different, can overlap, which makes calculating which offer was successful trickier
  • You can get the same offer in the same interval of time, you could get this combination: Received offer, received offer, complete offer, view offer, complete offer, view offer. You might think that at least one of the offers was successful, as view happens before complete at least once. That is in my view wrong. We assume that the same offer type comes sequentially, so the first time you see it, it belongs to the first offer you received, so that is why, in that sequence, both offers of the same type failed. We would need an identifier that says to which offer it belongs that helps you distinguish between views and completions of the same offer type.
  • For the non-offers, we count only the gaps not influenced by an offer. Even if the offer was completed, we consider still it s influence.

The algorithm to find successful offers per customer (feature engineering):
1. Group by customer
2. Loop through each customer (for)
2.1. Loop through each offer (for)
2.1.1. Bogos and discounts: distill information about success, viewed, received, effective time and profit.
2.1.1.1 Get the amount of received offers
2.1.1.2 Iterate sequentially in time and event ordered, and find viewed or completed events. Depending on what was seen before, offers will have been successful or not
2.1.2 Informational: idem, but there is no concept of success
2.1.3 Non-offer: find the gaps where no offer was active and use these gaps to add the profit

*Profit is viewed at the end-customer of course, not considering how much money costs to sell the product.

*Transaction periods can overlap

With all this done, now we can answer the initial questions we had. except for the main one (that will come after modelling)

Visualization

About the offers

  • Which one is the longest offer duration?
The maximum duration offers are 4 and 6
With a duration of: 240 h
  • Which one is the most rewarding offer?
The most rewarding offers are 0 and 1
With a reward of: 10 $

About the customers:

  • What is the gender distribution?

More females.

  • How different genders are distributed with respect to income?

It seems that females and males have more or less the same income in this dataset. But it also seems that there is more women that earn above the average than men. (The comparisons are not perfect because the number of women is around 3000 less than men, so the sample of women is less representative) It is worth extracting more insights as the distributions seem different. Let us check the wassertein_distance:

W_dist = 8824.9037

Indeed, the difference is pretty high between the distributions of men and women.

Mean salary for women is 70566.20827461942
Mean salary for men is 61741.30454620141
In this dataset, women earn more money than men in average, by: 8824.903728418009 $
Median salary for women is 66000.0
Median salary for men is 63000.0
In this dataset, however, the median is not so far from genders: 3000.0 $
The std of the salary for women is 20981.542952480755
The std of the salary for men is 18774.535245943607
In this dataset, however, the median is not so far from genders: 2207.007706537148 $
  • How different genders are distributed with respect to age?

W_dist = 4.7037

Indeed pretty similar distributions between female and male.

  • What is the distribution of new memberships along time?

There are noticeable jumps every 2 years (half of 2015 and half 2017). Perhaps they correspond to new campaigns or improvements in the app. It is also interesting that in 2018, the number of new memberships dropped (first time), thus perhaps new competitors arrived into the market.

About the transactions:

  • Which offers are preferred according to gender?

Most males and females prefer offer 6 which is a discount of 2 dollars after buying products worth 10 dollars, with the longest duration of 10 days, and it reaches thorugh all media: mobile, social and web. The second most liked is offer 5, which is also a discount. An the top 3 is 1 for males (bogo) and 8 for females (bogo). But the differnces between female and mae preferences are not that large.

  • Which offers are preferred according to income?

There are not so many customers with high income. The target group is people who earn between 50 and 75k. For them, the most preferred offer ids are 5 and 6, in the table below we see they prefer discount offers.

  • Which offers are preferred according to age?

Senior adults are the biggest clientele, and prefer offer ids 5 and 6, (discount type on the plot below) This leads me to think that most of them have also a medium low income.

  • Which offers are preferred according to date of becoming a member?

This is something we could have expected, each offer has the same distribution, which leads to think that time of becoming a member does not have an effect on which offers they might prefer.

  • Which are the most successful offers?
  • Which are the most profitable offers?

This plot must be taken with a grain of salt. Bogos are not profitable from our considerations as you produce 0 profit. But they serve other purposed and their success is measured not based on the profit. The most profitable offer is 7 and 2, these are informational. However, informational offers’ profit is based on the spending in a period time, during which there were other offers as well. We could clearly conclude that offer 6 among discounts is the most profitable one.

  • Which are the most profitable offers between informational?
  • How much money was earned in total with offers Vs. without offers?

They have made more money without the offers

Modeling

We have converted all the categorical values into dummy variables (including the offer_ids, which are a category of themselves). We have also scale all numeric values between 0 and 1.

The classifiers that we have chosen are:

For classification, we will use:

  • SVM
  • Random forests
  • Logistic regression
  • Gradient boosting

We first check which one performs better.

They all provide fscores (beta = 0.5) and accuracy scores of around 0.6 and 0.7 respectively. Nonetheless, the model that performs between is Gradient boosting, although only by a hair.

Here is the comparison data training on the full sample:

Training with all samples
SVC
Training time = 164.17853903770447
Testing time = 33.015408754348755
Test Accuracy = 0.6959884635553225
Test Fscore = 0.582545208095869


RandomForestClassifier
Training time = 0.2846059799194336
Testing time = 0.0351099967956543
Test Accuracy = 0.7047718930256948
Test Fscore = 0.5984216117772044


LogisticRegression
Training time = 0.16101694107055664
Testing time = 0.0060138702392578125
Test Accuracy = 0.7007953155042824
Test Fscore = 0.5910921218455107


GradientBoostingClassifier
Training time = 4.785567045211792
Testing time = 0.03660392761230469
Test Accuracy = 0.7081803880440483
Test Fscore = 0.6026470491641762

We picked Gradient Boosting and I wanted to optimize its parameters, so I performed a grid search:

parameters = {
‘learning_rate’: [0.1, 0.5, 1],
‘max_depth’: [3, 4],
‘n_estimators’: [100, 125, 150],
}

Nonetheless, the max depth and n_estimators stayed in default, 3 and 100 respectively, and the only difference was the learning rate which went up to 0.75. The accuracy and fscore barely moved.

Evaluation

The best model to predict if an offer will be successful is Gradient Boosting. However, 70% is not such a high accuracy, better than human though. Grid search did not show much improvements, so furtehr tunning should be carried out. We saw that the learning rate went from 0.1 to 0.5, while the rest of parameters stayed the same. The enxt logical step would be to try with a learning rate of 0.75 (as 1 was not chosen) and try to change other parameters.

Unoptimized model
------
Accuracy score on testing data: 0.7082
F-score on testing data: 0.6026

Optimized Model
------
Final accuracy score on the testing data: 0.7105
Final F-score on the testing data: 0.6086
GradientBoostingClassifier(criterion='friedman_mse', init=None,
learning_rate=0.5, loss='deviance', max_depth=3,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100,
n_iter_no_change=None, presort='auto',
random_state=None, subsample=1.0, tol=0.0001,
validation_fraction=0.1, verbose=0,
warm_start=False)

Challenges faced

There are two things to note. The first and minor one is the amount of time spent wrangling, cleaning and exploring. It is crucial, otherwise the model will not predict anything adequately, or not at all as it might not work. Additionally, it must be a process to be automated as soon as possible.

The second one that I found most challenging and where I spent most of my time was in feature engineering. Distilling information from the transaction dataset was a tough problem as the offers not only overlap, but also the same offers would overlap, thus you could not differentiate one from the other within an interval. A set of assumptions had to be made, being as logical and realistic as possible, and based on them, program an algorithm to execute a logic that would work given any combination of sequences.

Feature engineering was crucial, without it, you would not know which offers were successful and which ones were not. Time spent in it is time you save in modeling, tuning your parameters might be a waste of time if (i) your input to the model is not correct, and (ii) your feature engineering was able to collect the information as precisely as possible, improving the accuracy.

Outlook

With respect to the model, we could perform further tuning to improve the scores. Overall, the score is not too high but of course, it is better than human intuition as there are many parameters into consideration

Summary of the end-to-end solution

We follow the CRISP-DM process to answer the following question: What is the likelihood that a customer will respond to a certain offer?

We started by wrangling, cleaning and exploring the three datasets given: A portfolio with the characteristics of each offer, a profile with the demographics of the clients, and a ledger with transactions from customers. This allowed to answer minor questions related to the data.

Because this is a classification problem, we chose accuracy and f-score as out metrics.

Once each individual dataset was pre-processed (including a feature engineering step for the ledger of transactions in order to distill when an offer was a success or not), we merged all of them in order to have all the information in one dataset, which is necessary to feed a classifier model.

We selected the label on which the model would be trained, which was the success attribute. After that, we created the feature to feed the model (dropping the label) and performed a scan of the performance of different classifiers, names: Random forest, logistic regression, gradient boosting, and support vector machines. Gradient boosting performed the best, so we selected it and tried to perform grid search to tune the parameters for better performance, however, the scores remained on around 60% for fscore and 70 % for accuracy.

The future work would consist on creating an application, a pipeline, to automate all this process .

Conclusion

We have wrangled, cleansed and explored 3 datasets. We have studied individually their content. We have calculated the target label for the model, the success of an offer. We have merged afterwards all these datasets, further explore its insights and finally, prepare it and feed it into a model. The model chosen was Gradient Boosting and teh accuracy and f-score were 0.7 and 0.6 respectively, which are not too high but given the amount of variables, it is a better prediction than a 1/offer_ids of a human.

--

--

Gonzalo Munilla Garrido

I am a Ph.D in computer science interested in privacy enhancing techniques, namely differential privacy.