Confounding Bias
We are a team of data scientist in an ecommerce coffee company. Our company's marketing team launched an email marketing campaign sent to 1000 customers to promote the new Brazilian coffee. Our job is to measure how effective is the marketing campaign. Let's start the task.
Treatment & Outcome
We want to run an experiment to find the effectiveness of the email campaign. We can have a group of customers which receive the email (treatment group) and another group of customers which did not (controlled group). We want to find out whether the customers bought the Brazilian coffee during their next visit to our ecommerce store (converted). In this experiment email is the treatment, and conversion is the outcome.
Association is not Causation
Customer | Conversion | |
---|---|---|
A | Yes | 1 |
B | Yes | 1 |
C | Yes | 1 |
D | Yes | 0 |
E | No | 0 |
F | No | 1 |
G | No | 0 |
H | No | 0 |
This is our experiment result. Now we want to find out whether the treatment is effective. A simple way to do it is to compare the average coversion rate for those customers that receive emails \(E[Conversion|Email = Yes]\) with customers that don't \(E[Conversion|Email = No]\)
\(E[Conversion|Email = Yes] = \frac{1 + 1 + 1 + 0}{4} = 0.75 \)
\(E[Conversion|Email = No] = \frac{0 + 0 + 1 + 0}{4} = 0.25 \)
It looks like customers that receive email have a higher conversion rate, can we jump to the conclusion that our email marketing campaign is a huge success?
Not so fast.
Let’s say we are skeptical of the result and decide to dig further and look at other information of our customers.
Customer | Conversion | Number of visits | |
---|---|---|---|
A | Yes | 1 | 13 |
B | Yes | 1 | 11 |
C | Yes | 1 | 12 |
D | Yes | 0 | 10 |
E | No | 0 | 2 |
F | No | 1 | 1 |
G | No | 0 | 3 |
H | No | 0 | 1 |
It seems like our marketing team are sending emails to our loyal coffee loving customers. How can we be sure that it is the email that makes them want to purchase? Perhaps our coffee loving customers love coffee so much that they will buy our new Brazilian coffee regardless of the email campaign. Hence we cannot be sure that it is the email campaign that cause the increase in conversion rate.
Confounding bias
Number of visits influences the conversion rate so it interferes our experiment of measuring the effect of email campaign, we call number of visits the confounder.
Let's represent what we did earlier mathematically
"We select all customers that get email and get the average conversion"
\(E[Conversion | Email = Yes] = E[Y_1 | T = 1]\)
"We select all customers that do not get email and get the average conversion"
\(E[Conversion | Email = No] = E[Y_0 | T = 0]\)
Average treatment effect is
\(E[Y_1 | T = 1] - E[Y_0 | T = 0]\)
We add a new term \(E[Y_0|T=1]\) which means we select customers that do not get email and travel back in time. This time we do not send them email so the conversion is \(Y_0\).
\(E[Y_1 | T = 1] - E[Y_0 | T = 0] + E[Y_0|T=1] - E[Y_0|T=1] \)
Rearranging the terms
\(E[Y_1 | T = 1] - E[Y_0|T=1] + E[Y_0|T=1] - E[Y_0 | T = 0] \)
Part A of the equation means we customers that get email \((T=1)\) get conversion rate \((Y_1)\) then we travel back in time but this time we do not send emails \((Y_0)\). This part of the quation tells us the treatment effect of the treatment group.
\(E[Y_1 | T = 1] - E[Y_0|T=1] = E[Y_1 - Y_0 | T = 1] = ATT \)
Part B of the equation means we take customers that receive email \((T=1)\) we travel back in time but this time we send email and get the conversion \((Y_0)\), so we get \(E[Y_0 | T = 1]\). Next we take customers that do not get email \((T=0)\) and find the conversion rate \((Y_0)\). Finding the difference of these two just means we want to find out what is the conversion difference if all the customers do not receive email. This part of the equation is the bias.
\(E[Y_0 | T = 1] - E[Y_0|T=0]\)
Our final equation looks like this
\(E[Y_1 | T = 1] - E[Y_0 | T = 0] = E[Y_1 - Y_0 | T = 1] + E[Y_0 | T = 1] - E[Y_0|T=0]\)
Identify Treatment Effect
CASE 1: The 2 groups have different numbers of loyal and normal customers.
Group 1 (T=1): 900 loyal customers and 100 normal customer
Group 2 (T=0): 200 loyal customers and 800 normal customers
If we take companies in group 1, travel back in time, this time they did not get email and we find the average conversion \(E [Y_0|T = 1]\), then we compare it with group 2 \(E [Y_0|T = 0]\), we will see that their average conversion is not the same even though both groups did not get email. This is because group 1 consist mostly of loyal customers and they love coffee and will be more keen to give the new Brazilian coffee a try compared to normal customer regardless of the email.
\(E[Y_0|T = 0] \neq E[Y_0|T = 1]\)
CASE 2:
Group 1 (T=1): 600 loyal customers and 400 normal customers
Group 2 (T=0): 600 loyal customers and 400 normal customers
In this case, both groups have same number of loyal and normal customers (the distribution of the 2 groups are similar). So if we take group 1, travel back in time and did not get email , the average conversion \(E[Y0|T = 1]\) will be similar to group 2 \(E[Y0|T = 0]\)
\(E[Y0|T = 0] = E[Y0|T = 1]\)
In other words, if the group of customers that get email are similar to the customers that did not get, then we can be sure that the increase in conversion is due to email campaign(treatment) and not because of other factors like customer loyalty, then only we can safely conclude that association is causation.
<put diagram>
If the treatment and control group are similar, then we can conclude the change in outcome is because of the treatment effect.
\(E [Y |T = 1] − E [Y |T = 0] = E [Y1 − Y0|T = 1] = ATT\)
Removing bias
We can remove the bias by doing the experiment again but this time instead of sending emails to only loyal customers we randomly select the customers. In other words, we make sure the treatment mechanism is randomized.
Our new experiment results looks like this
Customer | Conversion | Number of visits | |
---|---|---|---|
A | Yes | 1 | 13 |
B | Yes | 0 | 11 |
C | Yes | 1 | 4 |
D | Yes | 1 | 2 |
E | No | 0 | 2 |
F | No | 0 | 11 |
G | No | 1 | 2 |
H | No | 1 | 13 |
To make sure the treatment and control group are truly randomized we can do a sanity check by checking other variables such as gender and age.
Customer | Conversion | Number of visits | Gender | Age | |
---|---|---|---|---|---|
A | Yes | 1 | 13 | Male | 33 |
B | Yes | 0 | 11 | Male | 31 |
C | Yes | 1 | 4 | Female | 35 |
D | Yes | 1 | 2 | Female | 29 |
E | No | 0 | 2 | Male | 33 |
F | No | 0 | 11 | Male | 35 |
G | No | 1 | 2 | Female | 31 |
H | No | 1 | 13 | Female | 32 |
Both the control and treatment groups looks randomized, now we can calculate the average conversion for each group
\(E[Conversion|Email = Yes] = \frac{1 + 0 + 1 + 1}{4} = 0.75 \)
\(E[Conversion|Email = No] = \frac{0 + 0 + 1 + 1}{4} = 0.5 \)
We can now conclude the treatment group has a higher conversion rate than the control group and the email marketing campaign is effective.