Multiple Choice Models

Lucas S. Macoris (FGV-EAESP)

Tech-setup

Tech-setup

All coding steps will be done using Python. If you need help on setting up your machine, please refer to this link for help

  • Before you start, make sure to import and load all the necessary packages:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from matplotlib import pyplot as plt
from plotnine import *
from great_tables import GT, md
from mizani.formatters import percent_format
from statsmodels.discrete.discrete_model import MNLogit
from statsmodels.tools.tools import add_constant

Recap on Logit Models

  • From previous classes, the idea behind using \(\Lambda(X)\) lies on the latent variable approach: think about an unobserved component, \(Y^\star\), which is a continuous variable, such as how much a consumer values a product.

  • Although we do not observe \(Y^\star\), we do observe consumers’ decisions of buying or not buying the product, depending on a given threshold:

\[ Y = \begin{cases} 1, & \text{if } Y^* > 0 \\ 0, & \text{if } Y^* \leq 0 \end{cases} \]

  • Therefore, we can see that the probability of buying depends on a latent variable, which is not observed by the econometrician:

\[ P(Y = 1 | X) = P(Y^* > 0 | X) = P(X \beta + \varepsilon > 0 | X) \]

Logistic Regression (Logit) Models

  • We saw that Logit models are just an example of a case where the transformation function is:

\[ f(Y^*) = \Lambda(Y^*) = \frac{\exp(X\beta)}{1 + \exp(X\beta)} \]

  • Whenever we’re estimating a logit model, our transformation function, \(\Lambda(\cdot)\), is actually estimating \(\log[p/(1 - p)]\):

\[ \text{logit}(p) = \alpha + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_k x_k \]

  • The term \(\frac{p}{1 - p}\) is called odds-ratio, and is simply the ratio of the probability of success over the probability of failure

Hands-On Exercise

Description

  • You were hired as a consultant to work on a research project that aims to understand how individuals travel around Spain. A survey was applied to more than 100 customers seeking to understand their consumption patterns when it comes to travelling around the country using buses, cars, trains, or airplanes. You have been prompted with the following task:
  1. How customers decide which transportation method to use?
  2. What would have been the effect of decreasing transportation costs in terms of market-shares for each transportation?
  • You can download the transportation-dataset.csv data using the Download button below.

Note: the groceries dataset has been adapted to fit the purpose of the lecture. You can find the original dataset here

\(\rightarrow\) This example has been taken from (Train 2009)

From Discrete to Multiple Choice

  • In practice, decisions are, in general, more complicated than a simply binary choice:

    1. Consumers choose between a product given a finite set of choices;
    2. Passengers choose which transportation method is the preferred one for their trips;
    3. Individuals choose which payment method is the best to accommodate their budget constraints.
  • In more practical cases, the set of choices \(k\) is, in general, greater than \(2\). In this lecture, we will see that we can accommodate the estimation method used for Logit for cases where \(k > 2\).

  • This method is generally called multinomial logit, and like logit, is just an example of generalized linear models, a broader class of models that generalize the multiple linear regression model

Bridging Binary to Multi-Choice Models (continued)

  • Recall that, in a binary setting, our probability measure using a logit was:

\[ P(Y) = \frac{\exp(X\beta)}{1 + \exp(X\beta)} \] which we will call by \(p\)

  • Let’s write the probabilities of each scenario \(y=1\) and \(y=0\) for a given customer \(i\) as:

\[ \begin{cases} P(y_i = 1 \mid x_i) = p_{i,1} \\ P(y_i = 0 \mid x_i) = p_{i,0} \end{cases} \]

  • We can use the intuition on these probabilities to arrive at a general formulation.

Bridging Binary to Multi-Choice Models (continued)

  • Suppose that we consider \(y = 0\) (in our previous example, non-churned customers) as the baseline reference. Then, the logit regression for this customer \(i\) is simply the estimation of the log odds-ratio:

\[ \log \left( \frac{p_{i,1}}{1 - p_{i,1}} \right) = \log \left( \frac{p_{i,1}}{p_{i,0}} \right) = X\beta \]

  • In words, we are estimating the log ratio of probabilities, taking into account that the baseline category is \(y = 0\)

  • Relating this to the churn example, we are modeling the increase in the probability of churning (relative to non-churning), based on observations \(X\).

  • What if we have more than two categories?

Bridging Binary to Multi-Choice Models (continued)

  • Suppose now that we have \(k > 2\). For example, instead of thinking about a churned customer, we can observe its actual bank choice: \(\{Santander, KutxaBank, BankInter\}\)

  • Our set has now three potential choices. We can write the probability of customer \(i\) choosing bank \(k = z\) as:

\[ \begin{cases} P(y_i = 1 | x_i) = p_{i,1} \\ P(y_i = 2 | x_i) = p_{i,2}, \text{ such that} \quad \sum_{k=1}^{3} p_{i,k} = 1 \\ P(y_i = 3 | x_i) = p_{i,3} \end{cases} \]

Multinomial Logit

  • Let’s say that we want to fit a model to understand customer’s preferences around bank services providers. Formally, we have information on covariates \(X\). Then, we want to fit a model such that we can recover the probability of choosing alternative \(k\) based on \(X\).

  • For that, fix \(k = 1\) (in our case, Santander) to be the baseline category. Then:

\[ \log\bigg(\frac{p_{i,2}}{p_{i,1}}\bigg) = \alpha_2 + \beta_{2,1} + \dots + \beta_{2,j} \]

\[ \log\bigg(\frac{p_{i,3}}{p_{i,1}}\bigg) = \alpha_3 + \beta_{3,1} + \dots + \beta_{3,j} \]

  • In words, for each bank \(k\) other than \(k = 1\) (Santander), we will have a separate equation that measures the log odds-ratio of preferring bank \(k\) over Santander

  • Therefore, if the response variable has \(k\) possible categories, there will be \(k - 1\) equations.

About the dataset

  • We will use a different dataset to understand multi-choice models.

  • This data comes from Greene (2003) and consists of a survey about the preferred travel transportation method (air, bus, car, or train for a sample of individuals.

  • We observe the following characteristics:

    1. Characteristics of the choice vehicle cost, waiting time, travel time
    2. Characteristics of the individuals: income, family size
    3. The actual choice made by each individual: \(\{air, bus, train, car\}\)
  • We want to understand how both offer characteristics as well as individual characteristics drive the decision to use a specific transportation method

Estimating a Multinomial Logit

  • A Multinomial Logit model is estimated by maximizing the likelihood of observing the decisions for each consumer \(i\)

  • As before, the estimation of the parameters of this model by maximum likelihood proceeds by maximization of the multinomial likelihood with the probabilities viewed as functions of the parameters

  • Differently from the binary case, we’ll now have one equation for each \(k \neq 1\). Note that it really makes no difference which category we pick as the reference cell, because we can always convert from one formulation to the other—similar case with defining a dummy variable

Estimating a Multinomial Logit

Optimization terminated successfully.
         Current function value: 0.753382
         Iterations 11
                          MNLogit Regression Results                          
==============================================================================
Dep. Variable:                   mode   No. Observations:                  210
Model:                        MNLogit   Df Residuals:                      198
Method:                           MLE   Df Model:                            9
Date:                Fri, 04 Apr 2025   Pseudo R-squ.:                  0.4424
Time:                        13:05:46   Log-Likelihood:                -158.21
converged:                       True   LL-Null:                       -283.76
Covariance Type:            nonrobust   LLR p-value:                 5.853e-49
==============================================================================
  mode=bus       coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const        -10.6695      2.381     -4.482      0.000     -15.336      -6.003
size           0.1787      0.623      0.287      0.774      -1.042       1.399
income        -0.0317      0.023     -1.366      0.172      -0.077       0.014
travel         0.0481      0.009      5.308      0.000       0.030       0.066
------------------------------------------------------------------------------
  mode=car       coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const        -11.5148      2.325     -4.953      0.000     -16.071      -6.958
size           1.1665      0.545      2.140      0.032       0.098       2.235
income        -0.0071      0.021     -0.343      0.732      -0.048       0.034
travel         0.0463      0.009      5.124      0.000       0.029       0.064
------------------------------------------------------------------------------
mode=train       coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -9.3876      2.318     -4.049      0.000     -13.932      -4.844
size           0.9217      0.562      1.640      0.101      -0.180       2.023
income        -0.0633      0.022     -2.869      0.004      -0.107      -0.020
travel         0.0467      0.009      5.166      0.000       0.029       0.064
==============================================================================
# Read the CSV file into a Pandas DataFrame
Data = pd.read_csv('Assets/transportation-dataset.csv')

# Filter rows where choice is 'yes' and reset index
Data = Data[Data['choice'] == 'yes'].reset_index(drop=True)

# Define features and target variable
X = Data[['size', 'income', 'travel']]
y = Data['mode']

# Add a constant term
X = add_constant(X)

# Fit multinomial logistic regression model
multinomial_model = MNLogit(y, X)
multinomial_result = multinomial_model.fit()

# Display summary statistics
print(multinomial_result.summary())

Predicting probabilities

  • For each of the estimated categories \(k\) (bus, train, car), the probability that individual \(i\) has chosen it is:

\[ p_{k,i} = \frac{\exp(\alpha_k + \beta_{k,1} + \beta_{k,2} + \dots + \beta_{k,j})} {1 + \sum_{k=2}^{4} \exp(\alpha_k + \beta_{k,1} + \beta_{k,2} + \dots + \beta_{k,j})} \]

  • Recall that we’ve recovered this probability by fixing a reference category (in this case, air).
    Therefore, to recover the probabilities of individuals choosing air:

\[ p_{1,i} = 1 - \sum_{k=2}^{4} p_{k,i} \]

Predicting Probabilities (continued)

Summary Statistics
Mean Standard Deviation Median Minimum Maximum
air 27.62% 41.31% 0.36% 0.00% 99.92%
bus 14.29% 13.07% 10.92% 0.01% 48.02%
car 28.10% 24.37% 22.88% 0.05% 91.16%
train 30.00% 23.87% 28.82% 0.01% 74.96%
probabilities = multinomial_result.predict()

# Convert to tibble (DataFrame in pandas)
probabilities_df = pd.DataFrame(probabilities,columns=['air','bus','car','train'])

# Reshape DataFrame to long format
probabilities_df = probabilities_df.stack().reset_index()
probabilities_df.columns = ['obs_num', 'mode', 'value']

# Group by 'mode' and summarize statistics
summary_df = probabilities_df.groupby('mode').agg({'value': ['mean', 'std', 'median', 'min', 'max']})
summary_df.columns = ['_'.join(col).strip() for col in summary_df.columns.values]

# Display the summary DataFrame

#Table
Table= (
  GT(summary_df.reset_index())
  .cols_align('center')
  .tab_stub(rowname_col='mode')
  .fmt_percent()
  .cols_label(
    value_mean = 'Mean',
    value_std = 'Standard Deviation',
    value_median = 'Median',
    value_min = 'Minimum',
    value_max = 'Maximum'
    )
  .tab_header(title=md("**Summary Statistics**"))
  .opt_stylize(style=1,color='red')
)

Table.tab_options(table_width="100%",table_font_size="25px")

Assessing Performance

  • Because we’re using the very same maximization procedure as used in logit models, we can use the same maximum likelihood statistical tests:

    1. The Wald test
    2. The Likelihood-Ratio test
    3. The Lagrange Multiplier test
  • Also, we can tabulate the results in such a way that we’d ideally want all observations to lie at the diagonal cells of our matrix

Accuracy of the model is 64.76%
Air Bus Car Train
air 54 0 3 1
bus 1 8 5 16
car 5 5 30 19
train 0 4 15 44
# Define class labels mapping
class_mapping = {0: 'air', 1: 'bus', 2: 'car', 3: 'train'}

# Predict class labels
predicted_labels = multinomial_result.predict(X).idxmax(axis=1).map(class_mapping)
actual_labels = y

# Combine predicted labels with original data
Data['Predicted'] = predicted_labels
Data['Actual'] = actual_labels

# Create contingency table
contingency_table = pd.crosstab(index=Data['Actual'], columns=Data['Predicted']).reset_index()

# Calculate accuracy
accuracy = "{:.2%}".format((Data['Actual'] == Data['Predicted']).mean())

# Display the contingency table and accuracy
Table= (
  GT(contingency_table)
  .cols_align('center')
  .tab_stub(rowname_col='Actual')
  .cols_label(
    air = 'Air',
    bus = 'Bus',
    car = 'Car',
    train = 'Train',
    )
  .tab_header(title="Accuracy of the model is " + accuracy)
  .opt_stylize(style=1,color='red')
)

Table.tab_options(table_width="100%",table_font_size="25px")

Counterfactual exercises

  • As marketers, we’re generally interested in understanding how consumers would react as-if they were exposed to a given situation:

    1. What happens to the probability of using a plane as the size of the family grows?
    2. How many customers would upgrade their travel plans to air if they have received an increase in income?
    3. If the price of air decreases significantly, how would customers distribute among other transportation modes?
  • In what follows, we’ll analyze how these three changes affect the distribution of transportation choices.

Exercise 1: increasing size

  • What happens to the probability of using each mode as the size of the family grows?

  • In order to analyze that, we will:

    1. Make a copy of the original matrix of covariates, \(X\)
    2. Use a loop iterating different sets of customers, with size ranging from \(1\) to \(4\)
    3. Predict the probabilities for each choice and averaging them out across customers
    4. Append the results

Result: Consumers tend to switch over to cars more aggressively

Counterfactual Exercise 1 - Probabilities
Air Bus Car Train
1 29.10% 20.71% 21.31% 28.88%
2 26.34% 11.12% 30.24% 32.30%
3 23.19% 5.33% 38.63% 32.85%
4 19.58% 2.38% 46.42% 31.62%
5 15.24% 1.03% 54.15% 29.58%
# Define class labels mapping
class_mapping = {0: 'air', 1: 'bus', 2: 'car', 3: 'train'}

#Define size buckets
size_buckets = np.arange(1,6,1)

#Store the results
results=pd.DataFrame()

for i in size_buckets:

  # Create counterfactual DataFrame with current values of X except for and different sizes
  X_Temp=X.copy()
  X_Temp['size']=i

  # Predict probabilities for counterfactual data
  counterfactual_probs = pd.DataFrame({'estimate': multinomial_result.predict(X_Temp).mean(axis=0)}).T
  counterfactual_probs.columns=counterfactual_probs.columns.map(class_mapping)
  counterfactual_probs['size']=i
  counterfactual_probs=counterfactual_probs.reindex(columns=['size','air','bus','car','train'])
  
  #Append
  results=pd.concat([results,counterfactual_probs],axis=0,ignore_index=True)

# Display the contingency table and accuracy
Table= (
  GT(results)
  .cols_align('center')
  .tab_stub(rowname_col='size')
  .fmt_percent()
  .cols_label(
    size= 'Size',
    air = 'Air',
    bus = 'Bus',
    car = 'Car',
    train = 'Train',
    )
  .tab_header(title=md("**Counterfactual Exercise 1 - Probabilities**"))
  .opt_stylize(style=1,color='red')
)

Table.tab_options(table_width="100%",table_font_size="25px")

Exercise 2: increasing income

  • How many customers would upgrade their travel plans to air if they receive an increase in income?

  • In order to analyze that, we will:

    1. Make a copy of the original matrix of covariates, \(X\)
    2. Use a loop iterating different sets of customers, with income ranging from \(0\) to \(100\) by increments of \(10\)
    3. Predict the probabilities for each choice and averaging them out across customers
    4. Append the results
Counterfactual Exercise 2 - Probabilities
Air Bus Car Train
0 22.47% 11.01% 9.11% 57.41%
10 24.07% 13.01% 13.67% 49.25%
20 25.46% 14.72% 19.53% 40.29%
30 26.67% 15.82% 26.33% 31.17%
40 27.70% 16.11% 33.44% 22.75%
50 28.56% 15.57% 40.15% 15.71%
60 29.27% 14.38% 45.99% 10.35%
70 29.86% 12.79% 50.78% 6.57%
80 30.35% 11.04% 54.55% 4.06%
90 30.78% 9.31% 57.46% 2.45%
100 31.16% 7.72% 59.66% 1.46%
#Define income buckets
income_buckets = np.arange(0,110,10)

#Store the results
results=pd.DataFrame()

for i in income_buckets:

  # Create counterfactual DataFrame with current values of X except for and different sizes
  X_Temp=X.copy()
  X_Temp['income']=i

  # Predict probabilities for counterfactual data
  counterfactual_probs = pd.DataFrame({'estimate': multinomial_result.predict(X_Temp).mean(axis=0)}).T
  counterfactual_probs.columns=counterfactual_probs.columns.map(class_mapping)
  counterfactual_probs['income']=i
  counterfactual_probs=counterfactual_probs.reindex(columns=['income','air','bus','car','train'])
  
  #Append
  results=pd.concat([results,counterfactual_probs],axis=0,ignore_index=True)

# Display the contingency table and accuracy
Table= (
  GT(results)
  .cols_align('center')
  .tab_stub(rowname_col='income')
  .fmt_percent()
  .cols_label(
    income= 'Income Level',
    air = 'Air',
    bus = 'Bus',
    car = 'Car',
    train = 'Train',
    )
  .tab_header(title=md("**Counterfactual Exercise 2 - Probabilities**"))
  .opt_stylize(style=1,color='red')
)

Table.tab_options(table_width="100%",table_font_size="25px")

Exercise 3: decreasing travel

  • How many customers would upgrade their travel plans to air if travel costs decrease?

  • In order to analyze that, we will:

    1. Make a copy of the original matrix of covariates, \(X\)
    2. Create a range of different customers, with travel costs set to \(10\%-90\%\) of the original cost for each customer
    3. Predict the probabilities for each choice and averaging them out across customers
    4. Append the results
Counterfactual Exercise 3 - Probabilities
Air Bus Car Train
0% 99.97% 0.00% 0.02% 0.01%
10% 99.03% 0.03% 0.64% 0.30%
20% 90.16% 0.81% 4.80% 4.23%
30% 69.24% 3.70% 13.76% 13.29%
40% 60.21% 5.23% 17.65% 16.91%
50% 54.74% 6.39% 19.33% 19.54%
60% 48.35% 7.84% 21.23% 22.57%
70% 41.30% 9.57% 23.50% 25.63%
80% 35.30% 11.28% 25.54% 27.88%
90% 30.86% 12.84% 27.05% 29.24%
100% 27.62% 14.29% 28.10% 30.00%
#Define travel buckets
travel_buckets = np.arange(0, 1.1, 0.1)

#Store the results
results=pd.DataFrame()

for i in travel_buckets:

  # Create counterfactual DataFrame with current values of X except for and different sizes
  X_Temp=X.copy()
  X_Temp['travel']=X_Temp['travel']*i

  # Predict probabilities for counterfactual data
  counterfactual_probs = pd.DataFrame({'estimate': multinomial_result.predict(X_Temp).mean(axis=0)}).T
  counterfactual_probs.columns=counterfactual_probs.columns.map(class_mapping)
  counterfactual_probs['travel']= "{:.0%}".format(i)
  counterfactual_probs=counterfactual_probs.reindex(columns=['travel','air','bus','car','train'])
  
  #Append
  results=pd.concat([results,counterfactual_probs],axis=0,ignore_index=True)

# Display the predicted probabilities DataFrame
results=results.reindex(columns=['travel','air','bus','car','train'])


# Display the contingency table and accuracy
Table= (
  GT(results)
  .cols_align('center')
  .tab_stub(rowname_col='travel')
  .fmt_percent()
  .cols_label(
    travel= '% of Original Travel Cost',
    air = 'Air',
    bus = 'Bus',
    car = 'Car',
    train = 'Train',
    )
  .tab_header(title=md("**Counterfactual Exercise 3 - Probabilities**"))
  .opt_stylize(style=1,color='red')
)

Table.tab_options(table_width="100%",table_font_size="25px")

Limitations of the Multinomial Logit

  • Recall that we estimated our predicted probabilities by looking at:

\[ p_{k,i} = \frac{\exp(\alpha_k + \beta_{k,1} + \beta_{k,2} + \dots + \beta_{k,j})}{1 + \sum_{k=2}^{4} \exp(\alpha_k + \beta_{k,1} + \beta_{k,2} + \dots + \beta_{k,j})} \]

  • As a consequence, the probabilities of two choices are the same if they have the same characteristics
  • This creates room for unreasonable substitution patterns. One of the most important properties of the multinomial logit model is the Independence from Irrelevant Alternatives (IIA)
  • The IIA property states that for any individual, the ratio of probabilities of choosing two alternatives is independent of the availability or attributes of any other alternatives

Limitations of the Multinomial Logit (continued)

  • Consider the probability of choosing mode=bus \(k = 1\) and mode=train \(k = 2\). According to the multinomial model, the probabilities of choosing each alternative are:

\[ p_{2,i} = \frac{\exp(\alpha_2 + \beta_{2,1} + \beta_{2,2} + \dots + \beta_{2,j})}{1 + \sum_{k=2}^{4} \exp(\alpha_k + \beta_{k,1} + \beta_{k,2} + \dots + \beta_{k,j})} \]

\[ p_{3,i} = \frac{\exp(\alpha_3 + \beta_{3,1} + \beta_{3,2} + \dots + \beta_{3,j})}{1 + \sum_{k=2}^{4} \exp(\alpha_k + \beta_{k,1} + \beta_{k,2} + \dots + \beta_{k,j})} \]

  • Problem: this ratio is independent of the availability and attributes of \(k = 4\) car:

    \[ \frac{p_{2,i}}{p_{3,i}} = \frac{\exp(\alpha_2 + \beta_{2,1} + \beta_{2,2} + \dots + \beta_{2,j})}{\exp(\alpha_3 + \beta_{3,1} + \beta_{3,2} + \dots + \beta_{3,j})} \equiv \frac{\exp^{bus}}{\exp^{train}} \]

The Red-Bus/Blue-Bus Paradox

  • The IIA property limits responses to changes predicted by the multinomial logit model

  • A well-known paradox illustrating this is the red-bus/blue-bus paradox:

    1.Suppose you have two alternatives with identical properties: car and red bus

    1. If their observable attributes are exactly the same, we should expect the market share of each alternative to be the same (i.e., \(50\%\) each)
  • Now, assume a blue bus category is introduced with the same observable characteristics. You would expect the new probabilities to be:

    1. Car: \(50\%\)
    2. Red bus: \(25\%\)
    3. Blue bus: \(25\%\)
  • Problem: due to the IIA, a multinomial logit model would predict that each option has 33%, which is clearly counterintuitive, as it violates rational decision-making - remember, in terms of valued features, red and blue buses are essentially the same!

Limitations of the Multinomial Logit (continued)

  • To which extent do the implications of the IIA affect our problem? In order to see that, suppose that you have an increase of \(10\%\) in the cost of using bus

  • If that is the case, individuals who stop using buses due to cost increases are predicted to distribute themselves among the remaining modes in proportion to the initial probabilities of choosing the remaining modes

  • In other words, the ratio of probabilities across the other choices remain intact

Does this sound reasonable? Likely not! If there’s an increase in bus prices, we would expect air and bus passengers to react differently to these changes:

  1. Since bus and train are closely related, one would expect marginal bus users to move towards train

  2. As a consequence, the amount of marginal users moving to air should be much less

Rethinking our use of multinomial logit

  • How we can change this? Adding individual preference heterogeneity into the choices:

\[ V_{i,k}= \underbrace{\alpha_k+X\beta_{k}}_{\text{Average utility for choice k}} + \underbrace{\mu_{i,k}}_{\text{Customer-specific utility for k}} \]

  1. In the red bus/blue bus paradox, this would allow \(50\%\) of the customers to really prefer cars

  2. In our practical example, this would allow air passengers to stick with their choices regardless of the changes in bus fares

  • Implementations:

    1. Use of nested logit models: first, they choose whether they’ll go with high/low cost; after that, they choose the specific transportation method

    2. BLP estimation - see pyBLP for an implementation using Python

References

Train, Kenneth E. 2009. Discrete Choice Methods with Simulation. 2nd ed. Cambridge, England: Cambridge University Press.