Let’s say you’re going to do some business modeling for a friend who’s thinking of opening a flower shop, producing something like this at the end. If you ask your friend the following question…
Can you give me the scale and shape parameters of the lognormal distribution representing estimated customer spend?
You’re likely to be given a blank stare in response. You’re more likely to find success with this line of questioning:
What would you estimate as the range of spending for 90% of customers? Do you think customers are most likely to spend in the middle of that, or closer to the bottom?
Estimation is difficult, but if you’re doing it at all, thinking about it in terms of percentiles tends to be an intuitive approach.
Unfortunately, statistical software doesn’t work this way. Distributions are parametrized in terms of means, standard deviations, shape parameters, and so on. So, how do we go from “90\% of customers spend between $10 and $90” to a probability distribution in software?
- Find a paper titled Determining distribution parameters from quantiles
- Implement it in Python!
This post simply demonstrates some Python code examples for deriving normal and log-normal distributions from percentile estimates.
If you’re interested in understanding the maths behind this and learning about other distributions, I encourage you to check out the paper. You can also read this blog post, which links to this Python code for Gamma and Beta distributions.
First we just import some libraries and set up some plotting functions.
from scipy.stats import lognorm, norm import numpy as np import matplotlib.pyplot as plt
def show_pdf(ax, dist): """ Plot the PDF of a distribution to an axis. """ minx, maxx = dist.ppf(0.0001), dist.ppf(0.9999) x = np.linspace(minx, maxx, 1000) ax.plot(x, dist.pdf(x)) def show_percentiles(ax, dist, low, high): """ Fill the bottom 'low'-th percentiles and top 'high'-th percentiles of the distribution on an axis. """ minx, maxx = dist.ppf(0.0001), dist.ppf(0.9999) x1, x2 = dist.ppf([low, high]) xs1 = np.linspace(minx, x1, 1000) xs2 = np.linspace(x2, maxx, 1000) ax.fill_between(xs1, dist.pdf(xs1)) ax.fill_between(xs2, dist.pdf(xs2))
If we estimate that 5\% of customers spend under $10, and 95\% spend under $90, then we can write that as
Generalizing, we’re interesed in finding the scale and location of distributions parametrized as follows:
Here’s the solution for a normal distribution. Here, is the inverse Cumulative Distribution Function of the standard normal distribution. Again, check out the paper to see the derivation!
Here’s a python function which implements that:
def norm_from_percentiles(x1, p1, x2, p2): """ Return a normal distribuion X parametrized by: P(X < p1) = x1 P(X < p2) = x2 """ p1ppf = norm.ppf(p1) p2ppf = norm.ppf(p2) location = ((x1 * p2ppf) - (x2 * p1ppf)) / (p2ppf - p1ppf) scale = (x2 - x1) / (p2ppf - p1ppf) return norm(loc=location, scale=scale)
In the examples below, I’ve produced two normal distributions using percentile estimates.
show_percentiles function - which fills in the top and bottom percentiles - demonstrates that
norm_from_percentiles works as intended. Given only the percentiles, it fills in charts in at the expected
The second example demonstrates that the percentiles don’t need to be centered. You can parametrize using any two percentiles!
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4)) dist = norm_from_percentiles(10, 0.05, 90, 0.95) show_pdf(ax1, dist) show_percentiles(ax1, dist, 0.05, 0.95) ax1.set(title="5% below 10, 5% above 90") dist = norm_from_percentiles(5, 0.05, 10, 0.50) show_pdf(ax2, dist) show_percentiles(ax2, dist, 0.05, 0.50) ax2.set(title="5% below 5, 50% above 10") plt.suptitle("Normal distributions");
The paper stipulates that for a log-normal distribution, you can simply replace and with and :
and as we shall see, that seems to work! However, following Scipy’s documentation on
lognorm, we also need to make a final transformation from this parametrization to SciPy’s
A common parametrization for a lognormal random variable
Yis in terms of the
mu, and standard deviation,
sigma, of the unique normally distributed random variable
exp(X) = Y. This parametrization corresponds to setting
s = sigmaand
scale = exp(mu).
Here’s the Python code:
def lognorm_from_percentiles(x1, p1, x2, p2): """ Return a log-normal distribuion X parametrized by: P(X < p1) = x1 P(X < p2) = x2 """ x1 = np.log(x1) x2 = np.log(x2) p1ppf = norm.ppf(p1) p2ppf = norm.ppf(p2) scale = (x2 - x1) / (p2ppf - p1ppf) mean = ((x1 * p2ppf) - (x2 * p1ppf)) / (p2ppf - p1ppf) return lognorm(s=scale, scale=np.exp(mean))
The examples below use the same parameters as the normal distributions above, but look very different (they have a heavy right-skew). This highlights the importance of choosing the correct distribution.
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4)) dist = lognorm_from_percentiles(10, 0.05, 90, 0.95) show_pdf(ax1, dist) show_percentiles(ax1, dist, 0.05, 0.95) ax1.set(title="5% below 10, 5% above 90") dist = lognorm_from_percentiles(5, 0.05, 10, 0.50) show_pdf(ax2, dist) show_percentiles(ax2, dist, 0.05, 0.50) ax2.set(title="5% below 5, 50% above 10") plt.suptitle("Log-Normal distributions");
Sometimes, the best you have is your best guess, and it’s much easier for people to guess things like “10% of people would pay less than X” than it is for people to directly guess the location and scale parameters of a distribution.
All of this has left me wondering: why isn’t percentile parametrization more commonly found in maths packages and statistics libraries? I suppose it’s just not a core functionality, and as we’ve seen it’s pretty easy to implement if you need, but I feel like wider adoption of percentile parametrization could go a long way to making simple statistical modeling more accessible.
One great exception is Guesstimate (of the Flower Shop example from earlier), a great tool which allows users to define and combine a wide variety of distributions using percentiles. That being said, you’re limited to 90% confidence intervals and in Guesstimate, whereas here we’ve demonstrated that you can use any percentiles you like.
The math is clearly out there, some people are clearly taking interest in this, and now this implimentation is out there too. Next time you need to model something using estimates, consider using percentiles!
( View notebook in Github )