Propensity Score
Propensity score helps us to reduce confounding bias when randomization is not possible anymore.
Our Business Objective
We are a team of data scientist in a milk bubble tea company with thousands of franchises. Our marketing team launched a voucher campaign where customers who receive vouchers can redeem them at one of our outlets. The goal of the campaign is customer with vouchers will spend more than those without vouchers. The data scientist team wants to find out how effective is the voucher campaign.
Why Use Propensity Score
When we start analyzing the sales dataset, we notice that there are 100 other covariates such as number of visits, salary, gender etc. These covariates might have an influence over customer's spending which causes confounding bias, for example customer with higher number of visits might spend more compared to those who don't. In order to properly measure the effect of voucher campaign we need a way to condition a large number of covariates.
This is where propensity score comes to play. We can use propensity scoring to compress all the covariates into a single figure, in other words propensity scoring acts like a dimensionality reduction technique.
How it works
We can use a logistic regression for propensity scoring. Let's say we input all the 100 covariates into the logistic regression and it outputs 0.8. The figure 0.8 is our propensity score and it tells us that this customer has a 0.8 (80%) chance to receive a treatment.
Imagine we have only 2 customers and we give customer A voucher but do not give customer B. Then we calculate the propensity score for both of them and found that they have similar propensity scores.
Customer | Voucher | Propensity Score | Spending |
---|---|---|---|
A | 1 | 0.8 | 1000 |
B | 0 | 0.75 | 200 |
This means that they are equally likely to be selected for the treatment, we can also say that they are very similar. Then we can conclude that customer A has higher spending because of the voucher.
Code Implementation
Back to our business problem. We start by creating a baseline model by fitting a linear regression to the treatment and outcome. In our baseline model we do not handle the confounding bias.
import pandas as pd
import statsmodels.formula.api as smf
df = pd.read_csv('voucher_campaign.csv')
baseline_model = smf.ols('spending ~ C(voucher)', data=df3).fit()
baseline_model.summary().tables[1]
The outcome shows that we have an Average Treatment Effect (ATE) of 754.
Next we remove the confounding bias. Our marketing manager tells us that based on his experience only the covariates number of visits and salary might consist confounding bias. We use logistic regression model to compute propensity score using the 2 covariates.
propensity_model = smf.logit("voucher ~ num_visits + salary", data=df).fit(disp=0)
data_ps = df.assign(propensity_score = propensity_model.predict(df))
data_ps.head()
The first 5 rows of our new dataset with propensity model is
After we get the propensity score we can remove the confounding bias by fitting the treatment and propensity score to the outcome using linear regression.
model = smf.ols('spending ~ C(voucher) + propensity_score', data=data_ps).fit()
model.summary().tables[1]
We saw that our ATE reduce a bit to 660. This is because we have removed the confounding effects of number of visits and salary. We can now report to the marketing team that customer with voucher has higher spending.