Using Machine Learning to Estimate Price Elasticity

During the final semester of my grad school analytics program, I was tasked with estimating price elasticity for wine. At first glance, this seemed like a simple objective. In economics, I was taught that price elasticity is the percent change in demand given the percent change in price (i.e. how price sensitive consumers are). All you have to do is look at quantity purchased at each price to determine price elasticity. Why would you even need machine learning to solve this type of problem?

To understand how flawed this simplistic methodology is, let’s consider the following question:

Is price the only variable that influences whether you purchase a product or service?

If you answered “no”, then you could imagine the potential gap between economic theory and reality. In order to estimate price elasticity, you likely need to consider many variables, in addition to, the price of the product or service. Otherwise, your elasticity model will likely suffer from poor predictive performance.

The typical machine learning development process with some personal modifications. Added the “Planning + Research” component as well as the “Exploratory Data Analysis” text to the data prep stage.

Planning + Research

Because I have never attempted to estimate elasticity outside of a simple financial analysis, this problem required some research. This led to one of the most stimulating conversations I have had in a while as I described the problem to a colleague with a strong physics background. After defining price elasticity for him, we attempted to approach the problem with first principles using derivatives (i.e. measuring change) from calculus. This discussion, in combination with a post from a “Quantitative Analysis” course at UW, pushed me towards using multi-linear regression with log transformations for my problem. The interpretation of log transformed variables (and their coefficients) reflects percent change between variables, which was the desired outcome.

The algorithm necessary to estimate price elasticity for wine, as well as each desired dimension, became the following:

  1. Filter on the specific subset of sales data relevant to the dimension which you are estimating elasticity (e.g. if estimating the price elasticity for red wine, filter on only red wine sales)
  2. Perform a log transformation on the future sales target variable and on the current price feature
  3. Train a linear regression model to accurately predict future sales
  4. The price elasticity estimate will be the coefficient of the log transformed, price feature
  5. Repeat steps 1-4 for each elasticity estimate

Data Preparation

After establishing the high level plan, I proceeded to prep the data. This is typically the most time consuming part of any model development process. The goal was to identify and perform the necessary steps to preprocess and transform the data prior to any analysis. While the data was readily available, the database documentation left much to be desired. The relationships between tables were not obvious nor were the primary keys. Furthermore, the data definitions left more questions than answers. To illustrate some of the ambiguity, let’s imagine the following mock product table. What do you think Sold Price means in the table below?

Product IDProductVolumeCasesSold Price
1236-pk wine can50051.10
43420 bottles wine100020.75
6451 btl wine100104.25

Without clear documentation, I was left pondering the following questions:

Is sold price the price per unit, product, or case?

What is a standard unit across all beverages?

Does “sold price” include taxes?

Estimating elasticity with so many unknowns would not only be inaccurate, but also unethical. To fill in this gap of knowledge, I needed to ask dozens of questions to our stakeholders. Some examples include:

  • Why are some prices zero?
  • When there are columns that appear equivalent, which one should be the most reliable?
  • Why does so much information appear to be missing in column “X”?
  • Why isn’t this “ID” column all unique values?
  • Which records can be filtered out prior to the elasticity estimation?

The process of acquiring this domain and database-specific knowledge is an often overlooked skill in data professionals. One misstep or miscalculated assumption during data prep can lead to the wrong conclusions downstream.

PS: This is why many data science teams are becoming more embedded within business functions rather than separate departments like IT.

Exploratory Data Analysis

During this prep stage, I was exploring the data as well. In order to build a model that is explainable, you need to understand the distribution of the data, various patterns, and find features that could explain elasticity. Visualizations are a data professional’s best friend during this step.

Using histograms, I was able to understand consumer purchase behavior across different U.S. states, years, accounts (i.e. wineries), varietals (e.g. merlot), price bands (e.g. $10-15 per bottle), sales channels (e.g. online), etc. For the slice of data that I was provided, it was an imbalanced data set across several dimensions. A large percentage of the consumer purchases were in California and most of the purchases were from only one account. Furthermore, the number of accounts was inconsistent across years since a few were added as the years advanced.

Seeing a time series of purchases over the years suggested that there might be some seasonality in consumer purchase behavior. This observation heavily influenced my feature engineering decisions in the next step of the machine learning development cycle.

Perhaps the most interesting observations were the increased percentage of purchases via online channels and seeing the dramatic fall and rise of sales during the COVID-affected years. These significant changes in purchasing behavior shaped some of our modeling decisions later on, such as excluding data prior to 2016.

Feature Engineering

The line between data prep and feature engineering can be a bit blurry. After pre-processing the data from the data tables, I worked on transforming the relevant records into “model readable” formats. For scikit-learn, a popular machine learning library in python, this requires transforming all features into numeric values. Features such as location of purchase (e.g. California) had to be transformed into a “dummy column” with a binary value (i.e. 1 for yes, 0 for no). This dummy variable creation process was applied to all categorical columns of interest and was easily applied using the method, “get_dummies” from the pandas library. See the code example below.

import pandas as pd
dummy_variable_columns = ['Day_of_week', 'Month', 'Status']
sales_dataframe_new = pd.get_dummies(sales_dataframe_old, columns=dummy_variable_columns)

One of the most common, and often frustrating, feature engineering steps included the modification of the date dimensions. I extracted various time slices (e.g. month, week, day, year) from the order date to enable testing for seasonality in consumer behavior across different intervals of time. The time series of sales appeared so much like a sine graph that I had flash backs to trigonometry class in high school*. This inspired me to research and create sine and cosine transformations of time to be tested as features in the model.

With sales appearing as a sine curve, I decided to test aggregations of the dataframe (i.e. feature vector and sales) across different cycle lengths of time. This somewhat complex task was easily accomplished thanks to another method in pandas, called “resample”. This method transformed the dataframe so that each column would be aggregated across any slice of time I desired (e.g. every 45 days). See the below example of code.

sales_dataframe_new_45daycycle = sales_dataframe_new.resample('45D', on='Order_Date', origin='start', label='right', closed='right').sum()

The result of the code snippet above is that I could now see the total sales for each record across 45-day cycles rather than orders on specific dates. Instead of 5 different orders of the same red wine, I would see the total sales value, total volume, total order count, total order count in each month, etc. for each 45-day cycle. Converting all columns to numeric format was a required prerequisite to this reaggregation.

Now that the dataframe is in this new structure, I calculated the average price per volume for each record and performed a log transformation on this value and on the total quantity sold. Because we need to estimate future demand, I had to create a column that was the future log transformed quantity sold. This was facilitated by the “shift” method in pandas. See the below code example.

sales_dataframe_new_45daycycle['log_quantity_sold_future'] = sales_dataframe_new_45daycycle['log_quantity_sold_current'].shift(periods=-1)

After running this code, each record in our dataframe had the current price and the future period’s sales. This setup is required to train the logistic model so current price, a known feature, is used to explain future quantity demanded — an unknown target since you cannot know future sales.

*Any readers remember SOHCAHTOA, the fun acronym to help remember the different trig functions?

Model Training

While communicating the model training stage to a non-technical audience, it is easy to present the illusion of a complex task. In reality, training the model is often one of the easier steps in the dev cycle thanks to the availability of mature machine learning libraries now available. The code below mirrors what I did to create a training and test set, standardize the data, train the logistic regression model, and fit the model to the scaled feature vector (i.e. log transformed price, time dimension, etc.) as well as the target vector (i.e. log transformed future demand).

import numpy as np
import sklearn

# Set up the train and test sets
train_x, test_x, train_y, test_y = train_test_split(sales_dataframe_new_45daycycle.loc[:, sales_dataframe_new_45daycycle.columns != 'log_quantity_sold_future'].to_numpy(),
                                                 sales_dataframe_new_45daycycle['log_quantity_sold_future'].to_numpy(),
test_size=0.20,
 random_state=20)

#Standardize the feature vector and target vector
scaler = preprocessing.StandardScaler().fit(train_x)
train_x_scaled = scaler.transform(train_x)
test_x_scaled = scaler.transform(test_x)

# Build and fit the linear regression model
lr = LinearRegression()
lr.fit(train_x_scaled, train_y)

The code snippet above omits the feature selection/regularization process. The feature selection process was primarily driven by excluding features that were poorly correlated with future demand.

Model Evaluation

While R-squared and root mean squared error (RMSE) are the most common evaluation metrics taught in school, I have often found mean absoluter error (MAE) and mean absolute percentage error (MAPE) as more explainable evaluation metrics. The latter metrics provide a more intuitive measure of the distance between actual and predicted values.

For one of the multi-linear regression models, here were the model evaluation metrics:

Evaluation MetricValue
RMSE.15
MAE.13
MAPE1.4%
R^20.75

Using the MAPE, I was able to communicate to stakeholders that our model’s predictions were off by an average of 1.4% compared to the actual sales values. Additionally, the feature vector explained 75% of the deviation in future demand. This model was successful in terms of explainability and was extremely accurate in terms of predictions.

Conclusion

Overall, I concluded that wine was a mostly inelastic good; however, consumer behavior varied depending on each subset of wine (i.e. different sales channels, varietals, etc.). Assuming that consumers are inelastic in all situations and will continue to be inelastic in the future would be a mistake. Additionally, there were subsets of data that had too small of a sample size to have confidence in their respective elasticity estimates. To improve this analysis, I would add substitute product prices (e.g. beer) into the feature vector, as well as, research alternative ways to estimate elasticity. It would be interesting to repeat this process with a more elastic good (e.g. fast food), to see what the estimates reveal.

~ The Data Generalist
Data Science Career Advisor

Note: All text, data, and code were modified to obscure any sensitive information.


Other Recent Posts

One Comment

Leave a Reply