Bayesian Ab Testing

  32 min read

Introduction

A/B Testing

A/B Testing is used to compare several items and select the best option based on a quantitative metric.

Some applications include:

  • Testing a new medicine to see if it works. We test if it works by having a control group vs treatment group.
  • Designing several options for a website to see which one brings the most conversion rate.
  • Comparing the ratio of failures of hardware from two vendors.

Frequentist vs Bayesian approach

Frequentist approach:

  • Frequentist assume there is a true value for a given population parameter.
  • Inference exists to estimate it to a given level of confidence: \(\hat{\theta} = \text{argmax}_\theta P(X \| \theta)\).

Bayesian approach:

  • Bayesian assume the given population parameter is a random variable.
  • Inference exits to calculate its posterior probability distribution from the data we have collected: \(P(\theta \| X)\).

Sampling methods:

  • Importance sampling
  • Markov chain Monte Carlo (MCMC), typically used to approximate large multi-dimensional integrals

Limits of Frequentist Statistics

General limitations

  • Terms and definitions are hard to understand for non-technical - and even many technical - people.
  • The experiment needs to run for the appropriate sample size before we can draw conclusions without running the risk of false positives.
  • Failing to reject the null hypothesis is NOT the same as accepting it. Maybe the sample size was just to small or the variance very large.

Estimating a population parameter

We use the CLT to estimate the value of a population paramater to a given level of confidence. The variance of the sampling distribution of the sample statistic is a Gaussian Distribution of variance $\sigma^2 / N$, where $\sigma^2$ is the variance of the population paramater.

Drawbacks:

  • if $\sigma$ is large, you need large samples to get an accurate view
  • we usually don’t know $\sigma$, but we can use the sample variance $\hat{\sigma}$ instead (or use the t-distribution for more precise CI)

Limits of p-value

The example below shows the evolution of the p-value of Chi-Square tests comparing two ads over time. We see that the p-value can vary dramatically from one experiment to the next:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import chi2_contingency

class Clickthrough:
    
    def __init__(self, ctr_a, ctr_b, n_visits):
        '''
        init the true ctr of two ad campaigns
        and the empty df that will store the observed ctr
        and the p-value of the chi-square test after each visit
        '''
        self.ctr_a = ctr_a
        self.ctr_b = ctr_b
        self.visits = 0
        self.visits_a = np.array([0, 0])
        self.visits_b = np.array([0, 0])
        self.pval = np.zeros(n_visits)

    def visit(self):
        '''
        simulates a visit that have a self.true_ctr likelihood
        of doing a ctr
        '''
        # random visit result
        self.visits += 1
        click_a = (np.random.random() < self.ctr_a)
        click_b = (np.random.random() < self.ctr_b)
        
        # save results
        self.visits_a += [click_a, 1 - click_a]
        self.visits_b += [click_b, 1 - click_b]
        
        # calculate p-value of chi-square test
        self.calculate_pval()
      
    def calculate_pval(self):
        '''
        calculate the p-value of a chi-square test checking if ad_A and ad_B have 
        the same ctr
        '''
        if self.visits > 10:
            _, pval, _, _ = chi2_contingency(np.array([self.visits_a, self.visits_b]), correction=False)
            self.pval[self.visits - 1] = pval
        
    def display_results(self):
        '''
        display the respective clicks of ad_A and ad_B in a nice way
        '''
        df = pd.DataFrame(
            np.array([self.visits_a, self.visits_b]), 
            index=['ad_A', 'ad_B'],
            columns=['Click', 'No Click']
        )
        return df
    
    def plot_pvalues(self):
        plt.plot(self.pval)
        plt.plot(np.ones(self.pval.shape[0]) * 0.05)
        axes = plt.gca()
        axes.set_ylim([0,1])
        plt.show()
n_visits = 20000
ctr = Clickthrough(0.1, 0.11, 20000)
for visits in range(20000):
    ctr.visit()
    
ctr.plot_pvalues()

png

n_visits = 20000
ctr = Clickthrough(0.1, 0.11, 20000)
for visits in range(20000):
    ctr.visit()
    
ctr.plot_pvalues()

png


Probability & Bayes Theorem

Definitions

Probability is the likelihood of a given event occuring for each experiment of a given sample space.

Draw $r$ times from a population of $n$ (with or without repetition):

  • Permutations: order matters
  • Combination: order does not matter


Relationship between events:

  • Marginal probability: P(A)
  • Conditional probability P(B|A): probability of B given A
  • Joint probability P(A ∩ B): Intersection of two events A and B
  • Union of two events P(A ∪ B): A or B
  • Complement of an event P(A’): not A

Note:

  • rule of substraction: P(A) = 1 - P(A’)
  • rule of addition: P(A ∪ B) = P(A) + P(B) - P(A ∩ B)
  • rule of multiplication: P(A ∩ B) = P(A) P(B|A)
  • rule of multiplication: P(A ∩ B ∩ C) = P(A) P(B|A) P(C|AB)


Events are:

  • Independant when the outcome of one has no effect on others
  • Dependant when the outcome of one has some effect on others (example: draw without replacement)

Note for independant events:

  • P(B|A) = P(B) so P(A ∩ B) = P(A) P(B)

Bayes Theorem

We know that P(A ∩ B) = P(A) P(B|A) = P(B) P(A|B). It means that (Bayes Theorem):

  • P(A|B) = P(B|A) $\times$ P(A) / P(B)

Where:

  • P(B) = sum [ P(B|A) $\times$ P(A) ] for all A (or integral for continuous data)

Notes:

  • probabilities equal to one when considering the space of random variables. The sum of P(B|A) will be equal to one for each given value of A.
  • P(B) does not depend on A, so $\text{argmax}_A P(A|B) = \text{argmax}_A P(B|A) P(A)$

Curse of Dimensionality

Joint probabilities increase exponentially with the number of variables:

  • requires more computational power
  • requires more data to get accurate estimates
  • leads to exponentially small probabilities (computers having finite precision lead to underflow issues)
  • log probability can be used to mitigate this issue

Bayes Theorem Examples

Diagnostic Tools

Bayes Theorem determines the probability of a parameter given a certain event. It is used extensively in fields with diagnostic tools that are only effective to a certain degree.

  • a company learns that 0.2% of their products are defective.
  • they buy a diagnotic tool that is correct 99% of the time.
  • if a part is diagnosed as defective, what is the probability that is is really defective?

Let’s use the following notations:

  • OK when not defective, NOK when defective
  • DI when diagnosed as not defective, NDI when diagnosed as defective

Bayes Theorem states that:

  • P(NOK|NDI) = P(NDI|NOK) P(NOK) / P(NDI)

Where:

  • P(NDI|NOK) = 0.99 - true positive
  • P(NOK) = 0.002
  • P(NDI) = P(NDI|NOK) x P(NOK) + P(NDI|OK) x P(OK) = true positive + false positive = 0.99 x 0.002 + 0.01 x 0.998

It means that:

  • P(NOK|NDI) = 16.5%

If we run a second test that is also positive on the same product, we can update the probability of it being really defective using the result of the first test:

  • P(NOK) = 16.5%
  • P(NOK|NDI) = 95.1%

Coin tosses

We want to toss a fair coin 200 times. After 20 tosses, we have 15H and 5T. What is the expected end result?

  • it is not 100H and 100T, because we have already tossed 20 times.
  • it has become 15 + (200 - 20) / 2 = 105H and 5 + (200 - 20) / 2 = 95T.

Monty Hall Problem

A show has three doors to choose from, and behind one of them is a car. After you choose one door, the host reveals another one and asks you if you want to switch. Should you do it?

Let’s imagine you choose door N°1 and the host opens door at door you did not pick, for instance N°2. You want to calculate the probabilities of P(D1 | H2) and P(D3 | H2), where P(Dx) is the probability that the car is behind door N°x and H2 the door that the host has chosen.

  • P(Dx | H2) = P(H2 | Dx) * P(Dx) / P(H2)
  • P(H2) = P(H2 | D1) * P(D1) + P(H2 | D2) * P(D2) + P(H2 | D3) * P(D3)

Let’s look at P(H2 | Dx):

  • P(H2 | D1) = 0.5 - the host could have chosen either door N°2 or door N°3
  • P(H2 | D2) = 0 - the host would not reveal the car
  • P(H2 | D3) = 1 - the host would always avoid revealing the door that the contestant chose if possible

A simple calculation gives P(D1 | H2) = 1/3 and P(D3 | H2) = 2/3: you should switch doors.


Bayesian Statistics

The bayesian approach treats $\theta$ as a random variable.

  • $P(\theta | X) = P(X | \theta) \times P(\theta) / P(X)$
  • $P(X)$ does not depend on $\theta$. It’s called the normalizing constant $Z$.
  • $P(\theta)$ is called the prior: the old belief about $\theta$
  • $P(X | \theta)$ is called the likelihood: how likely is the data we have collected given the current $\theta$
  • $P(\theta | X)$ is called the posterior: the new belief about $\theta$ after seeing the data.

Note: $P(X)$ is the integral over $P(X | \theta) P(\theta) d\theta$ and is impossible to solve without sampling methods like MCMC.

Conjugate Priors

If we choose specific distributions for $P(X | \theta)$ and $P(\theta)$, called conjugate priors, we can make the posterior $P(\theta | X)$ follow the same type of distribution as the prior $P(\theta)$.

For instance, if $P(X | \theta)$ is Bernoulli, then $\theta$ follows a distribution going from 0 to 1. Its PDF can be modelled by $P(\theta) = Beta(a, b)$. Combining both Bernoulli likelihood function and $\theta$ distribution, we can show that $P(\theta | X)$ follows a Beta distribution:

  • $P(\theta | X) = Beta(a’, b’)$
  • $a’ = a + \text{count of successes}$, $b’ = b + \text{count of failures}$.

We can choose the original values $a = b = 1$: this amounts to chooses a non-informative prior $P(\theta) = Beta(1, 1) = \text{Uniform(0, 1)}$. As the sample size increases, the influence of prior becomes negligible.

Note: $E(Beta(a, b) = \frac{a}{a+b}$; its variance decreases as a and b get bigger.

# examples of beta pdf
import numpy as np
from matplotlib import pyplot as plt 
from scipy.stats import beta

x = np.linspace(0, 1, 100)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (12,4))

ax1.set_title("same a / a + b, different a + b")
ax1.plot(x, beta.pdf(x, 2, 4), label='2, 4')
ax1.plot(x, beta.pdf(x, 10, 20), label='10, 20')
ax1.plot(x, beta.pdf(x, 50, 100), label='50, 100')
ax1.legend(loc='upper right')

ax2.set_title("differents means and variances")
ax2.plot(x, beta.pdf(x, 6, 12), label='low mean, high variance')
ax2.plot(x, beta.pdf(x, 100, 50), label='high mean, low variance')
ax2.legend(loc='upper left')
<matplotlib.legend.Legend at 0x220b28a8d08>

png

Note: We can compare the probability of two independent events $\mu_1$ and $\mu_2$.

For $\mu_1 > \mu_2$, it is the area under the joint PDF $p(\mu_1 \cap \mu_2) = p(\mu_1) \times p(\mu_2)$ where $\mu_1 > \mu_2$.

This lets us stop the experiment when the expected value of the loss function $max(\mu_2 - \mu_1, 0)$ is smaller than a given threshold.


A/B Testing

Explore vs Exploit

The Multi-Armed Bandits states that you are in a casino playing slots. Three machines have three different pay rates: 30%, 20%, and 10% respectively.

Frequentist Statistics would:

  • perform a Chi-Square test after $N$ trials
  • calculate the p-value
  • identify if one of them is significantly better than the others

But if you were to test it, you would:

  • try each machine a few times
  • collect data
  • adapt your strategy to your conclusions

A typical strategy would be to:

  • explore at random to collect more data,
  • or exploit the one that seems to give the best results

Epsilon-Greedy Algorithm

The goal is to adapt the A/B testing over time: show the ad performing better more often.

  • explore with probability epsilon: show random ad
  • exploit otherwise: show best ad

This algorithm has some limitations:

  • it still shows underperforming ads
  • the probability of showing the best option is $\epsilon \times 1/ads + (1 - \epsilon) = 1 - \epsilon \times (1 - 1/ads)$
  • in other words, the loss is proportional to $N$

UCB1 - Upper Confidence Bound

We use another way of building confidence intervals: Chernoff-Hoeffding bounds.

  • CLT: $P(\mu \in \left[\hat{\mu} \pm F(N)\right]) = 0.95$
  • Chernoff-Hoeffding: $P(\mu \in \left[\hat{\mu} \pm \epsilon\right]) > F(N, \epsilon)$

We can use this confidence interval to bound the true pay rates of each arm of the problem. We can maximize the likelihood of winning by using the upper bound of each mean $\mu_j + \epsilon_j$:

\[\epsilon_j = \sqrt{\frac{2 \ln(N)}{N_j}}\]

The UCB1 calculates the upper value of each mean, then shows the ad with the highest number:

  • explore ads that have not been shown much (low $N_j$ compared to $N$)
  • exploit ads with high mean

Notes:

  • $\epsilon_j$ converges to zero when $N$ becomes very large
  • the loss is proportional to $\ln(N)$

Thompson Sampling

Assuming the likelihood is Bernoulli and the prior is Beta, we have shown that the posterior is also Beta.

We can draw random numbers from each Beta distribution: as the sample size increases, their mean will get closer to the actual parameter value and their variance will decrease. It means that:

  • the more confident we are, the closer from the actual parameter value the random number will be
  • the less confident we are, the more random this random number will be, ie. it will be more exploratory

Thompson sampling draws a random number from all current Beta distributions, then show the ad that drew the highest number.

  • explore ads with high variance
  • exploit ads with high mean and low variance


Remember the conjugate priors idea for Bernoulli distributions: a’ and b’ count the number of successes and failures that happened since we last updated the prior:

  • $P(\theta) = Beta(a, b)$
  • $P(\theta | X) = Beta(a’, b’)$
  • $a’ = a + \text{count of successes}$, $b’ = b + \text{count of failures}$.

We update a’ and b’ at each step, which means that we update a and b at every step: we leverage our posterior to udpate the new prior $P(\theta_{t+1})$.

  • $P(\theta_t) = Beta(a_t, b_t)$
  • $P(\theta_t | X_t) = Beta(a’_t, b’_t)$ where $a’_t = a_t + isSuccess$, $b’_t = b_t + isFailure$
  • $P(\theta_{t+1}) = Beta(a’_t, b’_t)$

Implementation

import numpy as np
import random
import matplotlib.pyplot as plt

class MAB:
    '''
    Multi-Armed-Bandit problem
    '''
    
    def __init__(self, N=5000):
        '''
        Initialize class with:
        + fixed ads + theoritical CTR
        + index of best ad based on CTR
        + array to track ads served and CTR for each ad
        '''
        # ads
        self.ads = np.array([['A', 0.75],['B', 0.55],['C', 0.65]])
        self.ads_count = self.ads.shape[0]
        self.best_ad_idx = self.ads[:, 1].astype(float).argmax()

        # number of trials in the experiment
        self.N = N
        
        # ads served during the experiment
        (self.ads_served, self.clicks) = self.initialize_experiment(inplace=False)
        
        # snapshots of thompson betas - plotting
        self.x = np.linspace(0, 1, 100)
        self.thompson_betas = {}
        
        
    
    def initialize_experiment(self, inplace=True):
        '''
        resets self.ads_served to zero.
        inplace for more clarity in the init phase
        + ads_served[0]: total visits
        + ads_served[1]: total clicks
        + clicks: captures if click or no click at each visit
        '''
        # ads_served[0]: total visits
        # ads_served[1]: total clicks
        
        # we initialize the visits at one to handle the ucb1 algorithm
        ads_served = np.zeros((self.ads_count, 2))
        ads_served[:, 0] = 1
        
        clicks = np.zeros(self.N)
        if inplace:
            self.ads_served = ads_served
            self.clicks = clicks
        else:
            return (ads_served, clicks)
        
        
    def new_ad_served(self, n, selected_ad):
        '''
        adds a new count for the [selected_ad]
        plus a new result that follows its theoritical CTR
        '''
        self.ads_served[selected_ad][0] += 1
        
        is_click = np.random.random() < self.ads[selected_ad, 1].astype(float)
        self.clicks[n] = is_click
        if is_click:
            self.ads_served[selected_ad][1] += 1
        
        
    def select_ad_epsilon_greedy(self, eps, **kwargs):
        '''
        shows either a random ad 
        or the best existing one
        '''
        if np.random.random() < eps:
            selected_ad = random.randrange(self.ads_count) # explore: show random ad
        else:
            mus = self.ads_served[:, 1] / self.ads_served[:, 0]
            selected_ad = mus.argmax() # exploit: show best ad
        
        return selected_ad

    
    def select_ad_ucb1(self, n, **kwargs):
        '''
        shows the ad with the highest upper bound for its mean value
        '''
        # n+1 because we have initialized all values first
        mus = self.ads_served[:, 1] / self.ads_served[:, 0]     # observed CTRs
        ucb = np.sqrt(2 * np.log(n+1) / self.ads_served[:, 0])  # UCB values
        decision_values = mus + ucb
            
        selected_ad = decision_values.argmax()
        return selected_ad
    
    
    def select_ad_thompson(self, n):
        '''
        shows the ad with the highest number 
        drawn from its beta distribution at random;
        snapshots beta distributions at regualr intervals
        '''
        # random numbers
        rnd_draws = np.empty(self.ads_count)

        # calculate a & b parameters
        beta_a = 1 + self.ads_served[:, 1]
        beta_b = 1 + (self.ads_served[:, 0] - self.ads_served[:, 1])

        # draw a random number from each distribution
        for ad in range(self.ads_count):
            rnd_draws[ad] = np.random.beta(beta_a[ad], beta_b[ad])
            
        selected_ad = rnd_draws.argmax()
        
        # saves beta distribution
        if n in [10, 25, 50, 200, 500, 1000]:
            self.snapshot_thompson_betas(n, beta_a, beta_b)

        return selected_ad
            
            
    def select_ad(self, method, n, **kwargs):
        '''
        select the ad to display based on the desired method
        '''
        if method == 'epsilon_greedy':
            selected_ad = self.select_ad_epsilon_greedy(**kwargs)
        elif method == 'ucb1':
            selected_ad = self.select_ad_ucb1(n)
        elif method == 'thompson':
            selected_ad = self.select_ad_thompson(n)
            
        return selected_ad
        
        
    def run_experiment(self, method, verbose=True, **kwargs):
        '''
        run the experiment using the selected method
        using the select_ad API
        '''
        # resets self.ads_served
        self.initialize_experiment()
        
        for n in range(self.N):
            selected_ad = self.select_ad(method, n, **kwargs)
            self.new_ad_served(n, selected_ad)
            
        if verbose:
            print(self.ads_served.astype(int))
            print('ads {} served: {:.0f}'.format(
                self.ads[self.best_ad_idx, 0],           # best ad name
                self.ads_served[self.best_ad_idx, 0],    # actual count for best ad
            ))
            

    def snapshot_thompson_betas(self, n, beta_a, beta_b):
        '''
        snapshot the current beta distributions of our ads
        (thompson sampling)
        '''
        ads_betas = np.empty((100, self.ads_count))
        
        for ad in range(self.ads_count):
            ads_betas[:, ad] = beta.pdf(self.x, beta_a[ad], beta_b[ad])

        self.thompson_betas[n] = ads_betas
                    
            
    def plot_thompson_snapshots(self):
        '''
        show beta distributions
        '''
        
        # init plot fig
        fig = plt.figure(figsize=(16,8))
        plot_shape = [2, 3]
        idx = 1
        
        # add ax for each snapshot
        for (obs_visits, obs_betas) in self.thompson_betas.items():
            ax1 = fig.add_subplot(plot_shape[0], plot_shape[1], idx)
            ax1.plot(self.x, obs_betas)
            ax1.set_title(obs_visits)
            idx += 1
                
        plt.show()
        
        
    def plot_ctr(self):
        '''
        show the ctr at each trial of the experiment
        '''
        cumulative_average_ctr = np.cumsum(self.clicks) / (np.arange(self.N) + 1)
        
        plt.plot(cumulative_average_ctr)
        for ad in range(self.ads_count):
            plt.plot(np.ones(self.N) * self.ads[ad, 1].astype(float))
        plt.gca().set_ylim((0,1))
        plt.xscale('log')
        plt.show()
mab = MAB(N = 30001)
mab.run_experiment('epsilon_greedy', eps=0.1)
mab.plot_ctr()
[[26790 20093]
 [ 1126   642]
 [ 2088  1395]]
ads A served: 26790

png

mab.run_experiment('ucb1')
mab.plot_ctr()
[[28488 21348]
 [  434   242]
 [ 1082   690]]
ads A served: 28488

png

mab.run_experiment('thompson')
mab.plot_ctr()
[[29731 22313]
 [   78    44]
 [  195   123]]
ads A served: 29731

png

mab.plot_thompson_snapshots()

png


Appendix

Examples of Beta Distribution

We want to measure the clickthrough rate of our website. The more data we have, the more confident we are of its value. In other words, the confidence interval narrows when the sample size increases.

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import beta

class Clickthrough:
    
    def __init__(self, true_ctr):
        '''
        init the ctr class true ctr
        and observations (total and actual ctr)
        '''
        self.true_ctr = true_ctr
        self.obs_visits = 0
        self.obs_clicks = 0
        self.obs_ctr = 0
        
        self.x = np.linspace(0, 1, 200)
        self.betas = {}
        
    def beta(self):
        '''
        saves the beta distribution of shape parameters
        self.obs_clicks and self.obs_visits
        '''
        y = beta.pdf(self.x, self.obs_clicks, self.obs_visits)
        self.betas[self.obs_visits] = [self.obs_ctr, y]

    def visit(self):
        '''
        simulates a visit that have a self.true_ctr likelihood
        of doing a ctr
        '''
        # random visit result
        self.obs_visits += 1
        visit_result = (np.random.random() < self.true_ctr)
        if visit_result:
            self.obs_clicks += 1

        # current observed ctr
        self.obs_ctr = self.obs_clicks / self.obs_visits
        
        # saves beta distribution
        if self.obs_visits in [0, 5, 10, 25, 50, 200, 500]:
            self.beta()
        
    def plot(self):
        '''
        show beta distributions
        '''

        # init plot fig
        fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(10,5))
        for (obs_visits, [obs_ctr, obs_beta]) in self.betas.items():
            ax.plot(self.x, obs_beta, label = '{} visits - obs_ctr: {:.2}'.format(obs_visits, obs_ctr))
            ax.legend()

        plt.show()
ctr = Clickthrough(0.3) 
for visits in range(501):
    ctr.visit()
    
ctr.plot()

png

CLT Confidence Interval vs Beta Distribution

Let’s consider a Bernoulli experiment of known probability of success.

At each step of the following experiment, we can display:

  • the sampling distribution of the sample probability (gaussian of known standard error)
  • the posterior beta distribution of the population probability based on the prior observations
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import beta, norm

# parameters
true_ctr = 0.5          # we assume a fair coin
a, b = 1, 1             # beta priors
T = 501                 # number of coin tosses
results = np.empty(T)   # results of each coin toss

# init plot fig
x = np.linspace(0, 1, 200)
plot_indices = (10, 20, 30, 50, 100, 500)
fig = plt.figure(figsize=(16,8))
plot_shape = [2, 3]
idx = 1

    
# trials
for i in range(T):
    
    # coin toss result
    result = 1 if np.random.random() < true_ctr else 0
    results[i] = result

    # update a and b
    a += result
    b += 1 - result

    if i in plot_indices:
        
        # add plot
        ax1 = fig.add_subplot(plot_shape[0], plot_shape[1], idx)
        
        # maximum likelihood estimate of ctr
        n = i + 1 # number of samples collected so far
        p = results[:i].mean()
        std = np.sqrt(p*(1-p)/n)

        # gaussian
        g = norm.pdf(x, loc=p, scale=std)
        ax1.plot(x, g, label='Gaussian Approximation')

        # beta
        posterior = beta.pdf(x, a=a, b=b)
        ax1.plot(x, posterior, label='Beta Posterior')
        ax1.set_title("N = %s" % n)
        
        idx +=1
        
        # get legend from the first subplot
        h, l = ax1.get_legend_handles_labels()

# apply the values to the graph legend
fig.legend(h, l, loc='lower center', ncol=3)

# show
plt.show()

png

CTR Example

#chi-square example
import pandas as pd
import scipy.stats as stats 

csv_path = r'..\courses material\Bayesian Machine Learning\advertisement_clicks.csv'
ctr = pd.read_csv(csv_path, dtype={'advertisement_id': str, 'action': int})

# build df
df = ctr.groupby('advertisement_id').agg({'action': ['count', 'sum']})
df.columns = ['Total', 'Click']
df['No Click'] = df['Total'] - df['Click']
df = df.drop(columns = ['Total'])

display(df)

# we reject the null hypothesis of no impact: ad B is better than A
chi2, p, _, _ = stats.chi2_contingency(df)
print('Chi-Square test: {:.3} - p-value: {:.3}'.format(chi2, p)) 
Click No Click
advertisement_id
A 304 696
B 372 628
Chi-Square test: 10.0 - p-value: 0.00154