Non-Parametric Confidence Interval with Bootstrap

Category > Statistics

Jan 04, 2019

non-parametric confidence-interval bootstrap(stats) statistics

The code snippet assumes Anaconda 5.2.0 version of Python virtual environment

Acknowledgement

I would like to acknowledge Micahel Pyrcz, Associate Professor at the University of Texas at Austin in the Petroleum and Geosystems Engineering, for developing course materials that helped me write this article.

Check out his Youtube Lecture on Bootstrap, and Boostrap Excel numerical demo on his Github repo to help yourself better understand the statistical theories and concepts.

Bootstrap is a non-parametric statistical technique to resample from known samples to estimate uncertainty in summary statistics. When there are small, limited number of samples, it gives a more accurate forecast model than directly obtaining a forecast model from the limited sample pool (assuming that the sample set of data is reasonable representation of the population). It is non-parametric because it does not require any prior knowledge of the distribution (shape, mean, standard devation, etc..).

Advantages of Bootstrap

One great thing about Bootstrapping is that it is distribution-free. You do not need to know distribution shape, mean, standard devation, skewness, kurtosis, etc... All you need is just a set of sample data that is representative of a population. The fact that Bootstrapping does not depend on a type of distribution leads to another great advantage - It can calculate uncertainty in any confidence interval of any kind of distribution.

For example, the analytical solution to calculate a confidence interval in any statistics of a distribution is as follows:

CI of mean = stats of interest $\pm$ $($distribution score $\times$ Standard Error $)$

There are three problems with analytically solving for confidence interval of a statistic.

First, the variable in the equation, distribution score, depends on the type of the distribution. If you do not know the distribution shape of your population, it is very difficult to calculate the confidence interval of a statistic.

Second, not all statistics have a formula to calculate its Standard Error. For example, there exists an equation to calculate the standard error of a mean:

Standard Error = $\sigma_{sample} \ \mathbin{/} \ \sqrt{N}$

But there is no equation to calculate the standard error of a median. If you want to obtain confidence intervals for other statistics (ex: skewness, kurtosis, IQR, etc...), it will be very difficult to do so, simply because there are no equations for them.

Third, some statistics have analytical solutions for its standard error calculation, but it is so convoluted that Bootstrapping is simpler. A classic example is obtaining a CI for the correlation coefficient given a sample from a bivariate normal distribution.

Bootstrapping calculates confidence intervals for summary statistics numerically, not analytically, and this is why it can calculate ANY summary stats for ANY distribution.

Methodology

One goal of inferential statistics is to determine the value of a parameter of an entire population. It is typically too expensive or even impossible to measure this directly. So we use statistical sampling. We sample a population, measure a statistic of this sample, and then use this statistic to say something about the corresponding parameter of the population.

Bootstrapping is a type of resampling method to save time and money taking measurements. From a sample pool of size N, it picks a random value N times with replacement, and create M number of new Bootstrapped-sample pools. The term with replacement here means that you put back the sample you drew to the original sample pool after adding it to a new Bootstrapped-sample pool. Think of it this way: you randomly choose a file from a folder in your PC, and you copy and paste the randomly-chosen file into a new folder. You do not cut and paste the file, but you copy and paste the file into a new folder. You will have M number of folders (M is an arbitrary number of your choice), each containing N number of files.

Bootstrapping resamples the original sample pool to generate multiple smaller population of the true population. Each Bootstrap simulation is done by selecting a random value from the sample pool. For example, lets assume that you have the following sample pool of integers:

Sample Integers = [12, 433, 533, 14, 65, 42, 64]

From the sample pool of size N=7, you choose a random value N=7 times, and create a new sample pool of size N=7. In Bootstrap, each newly created sample pool is called a realization. You generate many of these realizations, and use them to calculate uncertainties in summary stats.

Realization 1 = [12, 533, 533, 533, 12, 14, 42]

Realization 2 = [65, 14, 14, 65, 433, 64, 14]

Realization 3 = [433, 64, 533, 14, 14, 64, 12]

Realization 4 = [14, 65, 65, 433, 533, 42, 12]

Notice the duplicate data in the realizations (Ex: 533, 533, 533). Duplicates in realizations exist because each data in realization is randomly chosen from the original sample pool with replacement.

Warning!

It is extremly important that the N size for each Bootstrap realization matches the N size of the original sample pool.

We use Bootstrap to numerically estimate the confidence interval (CI). It's an alternative tool to analytically solve for CI. Observing how CI is analytically calculated may help one to understand why the value of N is important. Let's take the CI of a mean for example. Recall that the CI of a mean represents how far a sample mean can deviate from the true population mean.

In case of a Gaussian, or Gaussian-like distribution (ex: student-t), the equation to analytically solve for confidence interval of a mean is as follows:

CI of mean = sample mean $\pm$ $($z-score $\times$ Standard Error $)$

Standard Error = $\sigma_{sample} \ \mathbin{/} \ \sqrt{N}$

where $N$ is the number of measured samples. If you increase the number of samples, the standard error of a mean decreases. This logically makes sense, because the more samples you have, the more accurate the estimation of the true population mean becomes.

The size of each Bootstrap realization, N, works the similar way, except that the random sample in each realization is not from the true population, but from a measured sample pool. Increasing the N-value will falsely make you to calculate smaller confidence interval.

It can be observed that the CI obtained by using a wrong N-value for Bootstrap generates narrower CI. As a result, the CI of the sample mean does not cover the true population mean, returning a misleading estimation.

In summary, Bootstrapping is used for three reasons:

Bootstrap can obtain confidence interval in any statistics.
Bootstrap does not assume anything about a distribution.
Bootstrap helps when there are too few number of samples.

Imports

In [2]:

import pandas as pd
import numpy as np
import scipy.stats
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
%matplotlib notebook

1.A. Confidence Intervals in Summary Stats: US Male Height - Gaussian Distribution¶

Bootstrap simulation can be run to obtain confidence intervals in various population parameters: mean, stdev, variance, min, or max. In this example, we will work with the height distribution of the US Male population, which tends to be Gaussian. However, the fact that the distribution Gaussian is totally unrelated to Bootstrap simulation, because it does not assume anything about the distribution.

Bootstrapping can give us confidence intervals in any summary statistics like the following:

By 95% chance, the following statistics will fall within the range of:

Mean : 75.2 ~ 86.2, with 80.0 being the average
Standard Deviation : 2.3 ~ 3.4 with 2.9 being the average
Min : 54.3 ~ 57.2, with 55.2 being the average
Max : 77.8 ~ 82.4, with 79.8 being the average
Skew : -0.053 ~ 0.323, with 0.023 being the average

1.A.0. Bootstrap Scripts¶

Bootstrap Simulator

In [3]:

def bootstrap_simulation(sample_data, num_realizations):
    n = sample_data.shape[0]
    boot = []
    for i in range(num_realizations):
        real = np.random.choice(sample_data.values.flatten(), size=n)
        boot.append(real)
        
    columns = ['Real ' + str(i + 1) for i in range(num_realizations)]
    
    return pd.DataFrame(boot, index=columns).T

Summary Statistics Calculator

In [4]:

def calc_sum_stats(boot_df):
    sum_stats = boot_df.describe().T[['mean', 'std', 'min', 'max']]
    sum_stats['median'] = boot_df.median()
    sum_stats['skew'] = boot_df.skew()
    sum_stats['kurtosis'] = boot_df.kurtosis()
    sum_stats['IQR'] = boot_df.quantile(0.75) - boot_df.quantile(0.25)
    return sum_stats.T

Visualization Script

In [5]:

def visualize_distribution(dataframe, ax_):
    dataframe = dataframe.apply(lambda x: x.sort_values().values)
    
    for col, label in zip(dataframe, dataframe.columns):
        fit = scipy.stats.norm.pdf(dataframe[col], np.mean(dataframe[col]), np.std(dataframe[col]))
        ax_.plot(dataframe[col], fit)
    ax_.set_ylabel('Probability')

Generate Confidence Intervals

In [6]:

def calc_bounds(conf_level):
    
    assert (conf_level < 1), "Confidence level must be smaller than 1"
    
    margin = (1 - conf_level) / 2
    upper = conf_level + margin
    lower = margin
    return margin, upper, lower

def calc_confidence_interval(df_sum_stats, conf_level): 
    
    margin, upper, lower = calc_bounds(conf_level)
    
    conf_int_df = df_sum_stats.T.describe(percentiles=[lower, 0.5, upper]).iloc[4:7, :].T
    conf_int_df.columns = ['P' + str(round(lower * 100, 1)), 'P50', 'P' + str(round(upper * 100, 1))]
    return conf_int_df 

def print_confidence_interval(conf_df, conf_level):
    print('By {}% chance, the following statistics will fall within the range of:\n'.format(round(conf_level * 100, 1)))
    
    margin, upper, lower = calc_bounds(conf_level)
    
    upper_str = 'P' + str(round(upper * 100, 1))
    lower_str = 'P' + str(round(lower * 100, 1))
    
    for stat in conf_df.T.columns:
        lower_bound = round(conf_df[lower_str].T[stat], 1)
        upper_bound = round(conf_df[upper_str].T[stat], 1)

        mean = round(conf_df['P50'].T[stat], 1)
        print("{0:<10}: {1:>10}  ~ {2:>10} , AVG = {3:>5}".format(stat, lower_bound, upper_bound, mean))

1.A.1 Sample Data Description¶

100 samples of US male height data is provided in my Github Repo - sample_data/US_Male_Height.csv. Summary statistics of the sample data can be calculated. Your goal is to calculate the confidence intervals for the summary stats.

In [7]:

# height data

height_data = pd.read_csv('sample_data/US_Male_Height.csv')
height_data.index = ['Male ' + str(i + 1) for i in range(height_data.shape[0])]
height_data.round(1).T

Out[7]:

	Male 1	Male 2	Male 3	Male 4	Male 5	Male 6	Male 7	Male 8	Male 9	Male 10	...	Male 91	Male 92	Male 93	Male 94	Male 95	Male 96	Male 97	Male 98	Male 99	Male 100
Height (in)	70.8	72.8	72.5	67.3	72.7	73.6	65.0	67.1	70.8	70.6	...	71.7	66.4	72.9	74.5	73.5	70.5	73.1	63.6	68.7	73.0

1 rows × 100 columns

In [8]:

height_summary_stats = calc_sum_stats(height_data)
height_summary_stats

Out[8]:

	Height (in)
mean	69.881971
std	3.169548
min	63.143732
max	77.762886
median	69.894434
skew	-0.059779
kurtosis	-0.700743
IQR	5.154145

Visualization

In [9]:

fig, ax = plt.subplots(figsize=(8, 4))
ax.set_xlabel('Height (inches)');
fig.suptitle('Original Sample Data Distribution: Gaussian Distribution')

visualize_distribution(height_data, ax);

Based on the distribution plot of the original sample data, we can observe that the distribution indeed looks Gaussian. However, the fact that it looks like Gaussian does not matter at all when Bootstrapping, because Bootstrapping does not assume anything about the distribution.

1.A.2 Resampling From the Sample Data¶

Each Bootstrap resampling (realization) can be done in one-line with numpy.random.choice(). Each realization is an array of size N, where N is the length of the original sample data. There are M number of realizations, where M is an arbitrary number of your choice.

Results

In [10]:

M = 100                                                 # number of realizations - arbitrary
bootstrap_data = bootstrap_simulation(height_data, M)
bootstrap_data.round(1).head(10)

Out[10]:

	Real 1	Real 2	Real 3	Real 4	Real 5	Real 6	Real 7	Real 8	Real 9	Real 10	...	Real 91	Real 92	Real 93	Real 94	Real 95	Real 96	Real 97	Real 98	Real 99	Real 100
0	72.7	67.9	65.0	69.2	70.4	64.9	70.3	66.3	67.6	72.2	...	74.1	66.3	73.0	68.5	65.3	72.5	72.7	69.2	66.4	72.3
1	68.4	70.5	65.3	64.5	71.8	69.2	70.5	71.6	65.3	67.9	...	72.5	68.6	66.1	71.7	67.3	74.1	67.9	71.3	72.9	65.0
2	71.3	74.1	72.8	72.9	68.3	67.9	73.1	65.0	73.6	72.8	...	69.5	72.5	72.5	73.3	69.2	74.1	73.0	65.5	67.9	63.1
3	72.5	73.0	71.4	68.5	73.3	70.5	70.5	70.6	68.5	69.0	...	64.5	69.2	66.0	69.5	72.5	70.3	67.9	68.3	73.6	73.5
4	67.2	73.5	73.6	67.2	64.5	72.9	72.8	66.4	69.2	66.8	...	66.3	73.6	71.3	73.1	71.6	72.2	64.9	69.0	71.7	70.2
5	69.1	66.0	65.5	69.1	71.7	70.6	66.0	73.0	72.2	69.9	...	73.0	66.3	69.0	67.9	69.4	69.9	69.5	68.7	72.4	67.3
6	72.2	68.5	72.9	63.1	73.6	73.1	70.8	75.5	69.9	70.6	...	68.4	65.0	68.5	68.8	67.2	72.2	65.5	70.6	72.2	66.8
7	73.1	72.5	69.0	72.5	71.6	68.7	73.5	66.2	71.6	74.4	...	73.6	68.5	72.2	73.1	72.3	72.1	66.8	77.8	72.8	69.5
8	67.3	69.5	74.5	66.8	69.1	65.0	69.9	70.6	65.0	73.1	...	67.1	72.8	70.5	70.8	73.6	72.5	71.8	67.9	67.2	70.6
9	66.3	72.9	65.5	72.5	72.4	70.6	73.1	69.5	67.2	68.7	...	71.6	69.2	65.0	71.3	71.4	69.1	73.3	70.2	66.1	67.6

10 rows × 100 columns

In [11]:

boot_sum_stats = calc_sum_stats(bootstrap_data)
boot_sum_stats.round(1)

Out[11]:

	Real 1	Real 2	Real 3	Real 4	Real 5	Real 6	Real 7	Real 8	Real 9	Real 10	...	Real 91	Real 92	Real 93	Real 94	Real 95	Real 96	Real 97	Real 98	Real 99	Real 100
mean	69.4	70.0	69.9	70.1	70.0	69.5	70.2	70.2	69.7	70.5	...	69.7	69.9	69.7	69.9	70.1	70.0	70.0	69.5	70.1	69.4
std	3.1	3.1	3.2	3.3	3.3	3.1	3.1	3.2	2.9	3.0	...	3.1	3.3	2.8	3.0	3.2	2.8	3.2	3.2	3.2	3.3
min	63.1	63.6	64.1	63.1	63.1	63.1	63.1	63.1	63.6	63.1	...	63.1	63.1	63.1	63.6	63.6	64.5	64.1	63.1	63.1	63.1
max	75.5	77.8	76.1	77.8	76.1	76.1	76.1	77.8	77.8	77.8	...	75.5	77.8	77.8	75.5	77.8	76.1	77.8	77.8	77.8	76.1
median	69.1	70.3	70.5	70.4	70.9	69.2	70.5	70.6	69.5	70.7	...	69.9	69.5	69.5	69.9	69.9	69.9	70.4	69.2	70.6	69.1
skew	-0.1	-0.0	-0.2	-0.0	-0.4	-0.0	-0.3	-0.3	0.2	-0.2	...	-0.4	-0.0	-0.0	-0.3	-0.1	0.1	0.2	-0.1	-0.2	-0.0
kurtosis	-1.1	-0.8	-1.1	-0.4	-0.9	-0.7	-0.9	-0.6	-0.4	-0.5	...	-0.9	-0.7	-0.3	-0.9	-0.8	-0.9	-0.6	-0.7	-0.8	-1.0
IQR	5.4	5.2	5.9	4.2	5.2	4.9	5.5	5.0	4.9	4.2	...	5.2	5.4	4.9	4.7	4.8	4.4	5.1	4.8	5.3	5.7

8 rows × 100 columns

Visualize

In [13]:

fig, ax = plt.subplots(figsize=(8, 4))
ax.set_xlabel('Height (inches)');
fig.suptitle('Distribution of Bootstrap-Simulated Data: Gaussian')
visualize_distribution(bootstrap_data, ax);

Each line in the plot represents one Bootstrap realization. There are 100 realizations, each having 100 random samples.

1.A.3 Uncertainty Models in Summary Statistics with Blox Plots¶

In [16]:

f = plt.figure()
plt.suptitle('Uncertainty Models for Various Statistics: US Male Height - Gaussian')

gs = gridspec.GridSpec(2, 4)
ax1 = plt.subplot(gs[0, 0:4])
ax2 = plt.subplot(gs[1, 0])
ax3 = plt.subplot(gs[1, 1])
ax4 = plt.subplot(gs[1, 2])
ax5 = plt.subplot(gs[1, 3])

boot_sum_stats.T[['mean', 'min', 'max', 'median']].boxplot(ax=ax1)
boot_sum_stats.T[['std']].boxplot(ax=ax2)
boot_sum_stats.T[['IQR']].boxplot(ax=ax3)
boot_sum_stats.T[['skew']].boxplot(ax=ax4)
boot_sum_stats.T[['kurtosis']].boxplot(ax=ax5)

ax5.set_ylim([-3, 3]);

1.A.4 Confidence Interval in Summary Statistics¶

Confidence intervals of summary statistics usually have a confidence level of 90%, 95%, or 99%. In this case, we will choose 95% confidence level.

In [17]:

confidence_level = 0.95

conf_int = calc_confidence_interval(boot_sum_stats, confidence_level)
conf_int.round(1)

Out[17]:

	P2.5	P50	P97.5
mean	69.4	69.9	70.5
std	2.8	3.2	3.4
min	63.1	63.1	64.5
max	75.4	77.8	77.8
median	69.1	70.0	71.1
skew	-0.5	-0.1	0.3
kurtosis	-1.1	-0.7	-0.1
IQR	4.0	4.9	5.8

In [18]:

print_confidence_interval(conf_int, confidence_level)

By 95.0% chance, the following statistics will fall within the range of:

mean      :       69.4  ~       70.5 , AVG =  69.9
std       :        2.8  ~        3.4 , AVG =   3.2
min       :       63.1  ~       64.5 , AVG =  63.1
max       :       75.4  ~       77.8 , AVG =  77.8
median    :       69.1  ~       71.1 , AVG =  70.0
skew      :       -0.5  ~        0.3 , AVG =  -0.1
kurtosis  :       -1.1  ~       -0.1 , AVG =  -0.7
IQR       :        4.0  ~        5.8 , AVG =   4.9

1.B. Confidence Intervals in Summary Stats: Rock Permeability - Lognormal Distribution¶

It was previously stated that Bootstrapping does not assume anything about the distribution. Is that really true? The previous example of the US Male Height distribution was a Gaussian distribution. But what if the distribution of our interest is not Gaussian?

In this example, rock pearmeability, which has a lognormal distribution, will be used to show that Bootstrap does not depend on the type of the distribution.

1.B.0. Bootstrap Scripts¶

The sample scripts used for US Male Height example will be used for Bootstrap simulation. Same scripts can be used for both Gaussian and lognormal distribution because Bootstrapping does not assume anything about the distribution.

1.B.1. Sample Data Description¶

105 samples of permeability data is provided in Github Repo - sample_data/PoroPermSampleData.xlsx. Permeability data is taken at many times at different depth of a wellbore.

Summary statistics of the sample data can be calculated. Your goal is to calculate the confidence intervals for the summary stats.

In [19]:

# permeability data
perm_depth_data = pd.read_excel('sample_data/PoroPermSampleData.xlsx', sheet_name='Sheet1')[['Depth', 'Permeability (mD)']]
perm_data = perm_depth_data['Permeability (mD)'].to_frame()

# visualize
fig = plt.figure()
ax = plt.axes()

ax.plot(perm_depth_data['Permeability (mD)'], perm_depth_data['Depth']);
ax.invert_yaxis()
ax.set_title('Permeability Along A Wellbore')
ax.set_xlabel('Permeability (mD)')
ax.set_ylabel('Depth (ft)');

In [20]:

perm_summary_stats = calc_sum_stats(perm_data)
perm_summary_stats

Out[20]:

	Permeability (mD)
mean	161.008972
std	80.900128
min	43.534147
max	573.461883
median	144.329837
skew	1.625086
kurtosis	5.498080
IQR	102.580432

Visualization

In [21]:

fig, ax = plt.subplots(figsize=(8, 4))
ax.set_xlabel('Permeability (mD)')
fig.suptitle('Original Sample Data Distribution: Lognormal Distribution')
visualize_distribution(perm_data, ax);

Based on the distribution of the original sample data, we can observe that the distribution looks lognormal. The uncertainty in summary statistics can be calculated using Bootstrap the same way it was done for the US Male Height (Gaussian) distribution, because Bootstrap does not depend on the shape of the distribution.

Warning!

Outlier removal on rock permeability cannot be done directly, as this is a lognormal distribution. Recall that the typical outlier removal method assumes the distribution to be Gaussian. If you want to detect outliers for non-Gaussian distributions, you have to first transform the distribution into Gaussian.

1.B.2 Resampling From the Sample Data¶

Results

In [22]:

M = 100                      # number of realizations - arbitrary
boot_perm_data = bootstrap_simulation(perm_data, M)
boot_perm_data.round(1).head(10)

Out[22]:

	Real 1	Real 2	Real 3	Real 4	Real 5	Real 6	Real 7	Real 8	Real 9	Real 10	...	Real 91	Real 92	Real 93	Real 94	Real 95	Real 96	Real 97	Real 98	Real 99	Real 100
0	61.9	258.6	138.8	61.9	285.3	156.1	179.7	125.9	58.8	89.6	...	151.0	227.4	59.1	573.5	258.6	56.3	240.6	89.6	132.7	182.6
1	170.5	61.0	143.7	214.1	264.2	244.1	144.7	160.5	83.9	258.6	...	58.8	279.8	244.1	92.1	213.7	160.5	240.6	146.0	141.8	138.8
2	143.7	86.8	117.6	92.7	83.9	104.0	187.9	138.8	162.3	132.1	...	258.4	380.1	89.6	89.6	123.3	77.5	102.7	193.1	133.2	234.5
3	166.9	151.0	240.6	265.5	183.1	65.6	59.1	305.1	103.5	131.6	...	214.9	128.9	210.8	108.6	193.1	125.9	77.5	151.5	112.3	58.8
4	161.0	146.0	89.6	84.9	129.1	43.5	170.5	97.7	190.9	197.8	...	56.3	85.0	53.4	79.4	58.8	92.7	102.7	190.9	126.3	161.0
5	104.0	132.1	129.1	144.4	184.8	263.5	151.0	170.5	162.9	311.6	...	61.0	156.1	170.5	264.2	244.1	85.0	112.3	117.6	224.4	265.5
6	305.1	77.5	213.7	84.9	240.6	58.8	224.4	234.5	128.9	193.1	...	66.9	138.8	240.6	66.9	166.9	84.7	305.1	80.4	53.4	264.2
7	286.8	171.4	92.1	84.9	116.9	245.7	141.8	135.8	206.6	116.9	...	162.3	244.1	187.9	151.0	84.9	85.0	573.5	170.5	83.3	117.6
8	142.7	187.9	131.6	117.6	244.1	214.1	182.6	134.7	132.7	132.7	...	123.3	104.0	65.6	86.8	84.9	193.1	56.3	136.9	156.1	311.6
9	58.8	132.1	380.1	136.9	65.6	244.1	134.7	77.8	321.1	79.4	...	128.9	83.9	182.6	132.0	117.6	234.5	227.4	187.9	80.4	138.8

10 rows × 100 columns

In [23]:

boot_perm_sum_stats = calc_sum_stats(boot_perm_data)
boot_perm_sum_stats.round(1)

Out[23]:

	Real 1	Real 2	Real 3	Real 4	Real 5	Real 6	Real 7	Real 8	Real 9	Real 10	...	Real 91	Real 92	Real 93	Real 94	Real 95	Real 96	Real 97	Real 98	Real 99	Real 100
mean	159.5	152.6	146.9	155.7	160.3	161.7	161.5	164.4	146.6	162.7	...	143.9	150.0	166.7	155.6	155.2	151.5	164.7	171.3	143.8	148.1
std	80.0	64.1	72.6	77.2	72.1	98.0	88.0	90.7	65.3	73.3	...	78.4	68.0	64.7	78.0	68.8	74.8	106.0	96.3	63.1	62.0
min	56.3	43.5	43.5	43.5	53.4	43.5	43.5	43.5	43.5	43.5	...	43.5	43.5	53.4	43.5	43.5	53.4	53.4	43.5	43.5	43.5
max	573.5	305.1	380.1	573.5	380.1	573.5	573.5	573.5	321.1	321.1	...	573.5	380.1	380.1	573.5	380.1	321.1	573.5	573.5	380.1	380.1
median	143.7	138.8	132.1	144.4	138.8	143.7	142.7	138.8	136.9	145.6	...	132.0	133.2	151.0	141.8	138.8	133.2	133.2	144.7	132.7	138.8
skew	1.8	0.6	1.1	1.7	0.7	1.6	2.2	2.0	0.6	0.4	...	2.1	0.8	0.6	1.8	0.7	0.6	1.9	2.2	0.9	0.8
kurtosis	6.1	-0.5	1.0	6.9	-0.3	4.3	7.7	6.4	-0.2	-0.9	...	8.1	0.5	0.3	6.7	0.2	-0.7	4.6	7.0	1.0	1.1
IQR	79.3	80.8	92.7	113.0	102.4	129.2	102.4	86.7	95.2	114.0	...	77.3	98.2	78.7	102.6	86.7	112.7	132.3	79.2	82.1	80.8

8 rows × 100 columns

Visualize

In [29]:

fig, ax = plt.subplots(figsize=(8, 4))
fig.suptitle('Distribution of Bootstrap-Simulated Data: Lognormal')
ax.set_xlabel('Permeability (mD)')
visualize_distribution(boot_perm_data, ax);

1.B.3 Uncertainty Models in Summary Statistics with Blox Plots¶

In [25]:

f = plt.figure()
plt.suptitle('Uncertainty Models for Various Statistics: Rock Permeability - Lognormal')

gs = gridspec.GridSpec(2, 4)
ax1 = plt.subplot(gs[0, 0:4])
ax2 = plt.subplot(gs[1, 0])
ax3 = plt.subplot(gs[1, 1])
ax4 = plt.subplot(gs[1, 2])
ax5 = plt.subplot(gs[1, 3])

boot_perm_sum_stats.T[['mean', 'min', 'max', 'median']].boxplot(ax=ax1)
boot_perm_sum_stats.T[['std']].boxplot(ax=ax2)
boot_perm_sum_stats.T[['IQR']].boxplot(ax=ax3)
boot_perm_sum_stats.T[['skew']].boxplot(ax=ax4)
boot_perm_sum_stats.T[['kurtosis']].boxplot(ax=ax5)

ax4.set_ylim([-3, 3])
ax5.set_ylim([-10, 10]);

Observe the positive skewness in the boxplot summary statistics. This is consistent with the left-justified lognormal distribution of the permeability plot.

1.B.4 Confidence Interval in Summary Statistics¶

Confidence intervals of summary statistics usually have a confidence level of 90%, 95%, or 99%. In this case, we will choose 90% confidence level.

In [26]:

confidence_level = 0.9

conf_int_perm = calc_confidence_interval(boot_perm_sum_stats, confidence_level)
conf_int_perm.round(1)

Out[26]:

	P5.0	P50	P95.0
mean	146.1	160.0	171.3
std	64.4	77.4	96.4
min	43.5	43.5	54.3
max	321.1	573.5	573.5
median	132.7	144.3	156.1
skew	0.4	1.4	2.2
kurtosis	-0.6	4.0	8.6
IQR	78.4	98.1	120.9

In [27]:

print_confidence_interval(conf_int_perm, confidence_level)

By 90.0% chance, the following statistics will fall within the range of:

mean      :      146.1  ~      171.3 , AVG = 160.0
std       :       64.4  ~       96.4 , AVG =  77.4
min       :       43.5  ~       54.3 , AVG =  43.5
max       :      321.1  ~      573.5 , AVG = 573.5
median    :      132.7  ~      156.1 , AVG = 144.3
skew      :        0.4  ~        2.2 , AVG =   1.4
kurtosis  :       -0.6  ~        8.6 , AVG =   4.0
IQR       :       78.4  ~      120.9 , AVG =  98.1

Statistics

Uncertainty Modeling with Monte-Carlo Simulation

How do casinos earn money? The answer is simple - the longer you play, the bigger the chance of you losing the money. Monte-Carlo simulation can construct its profit forecast model.

2019-01-03

9 min reading

Statistics

Transforming Non-Normal Distribution to Normal Distribution

Many statistical & machine learning techniques assume normality of data. What are the options you have if your data is not normally distributed? Transforming non-normal data to normal data using Box-Cox transformation is one of them.

2019-02-25

13 min reading

Statistics

Comprehensive Confidence Intervals for Python Developers

This post covers everything you need to know about confidence intervals: from the introductory conceptual explanations, to the detailed discussions about the variations of different techniques, their assumptions, strength and weekness, when to use, and when not to use.

2019-09-08

66 min reading

Machine Learning

Robust Linear Regressions In Python

In regression, managing outliers is key for accurate predictions. Techniques like OLS can be skewed by outliers. This analysis compares OLS, RANSAC, Huber, and Theil-Sen methods, showing how each deals with outliers differently, using theory and Python examples, to guide the best model choice for different outlier scenarios.

2023-11-15

10 min reading

Acknowledgement

Warning!

1.A. Confidence Intervals in Summary Stats: US Male Height - Gaussian Distribution¶

1.A.0. Bootstrap Scripts¶

1.A.1 Sample Data Description¶

1.A.2 Resampling From the Sample Data¶

1.A.3 Uncertainty Models in Summary Statistics with Blox Plots¶

1.A.4 Confidence Interval in Summary Statistics¶

1.B. Confidence Intervals in Summary Stats: Rock Permeability - Lognormal Distribution¶

1.B.0. Bootstrap Scripts¶

1.B.1. Sample Data Description¶

Warning!

1.B.2 Resampling From the Sample Data¶

1.B.3 Uncertainty Models in Summary Statistics with Blox Plots¶

1.B.4 Confidence Interval in Summary Statistics¶

Related Posts

Uncertainty Modeling with Monte-Carlo Simulation

Transforming Non-Normal Distribution to Normal Distribution

Comprehensive Confidence Intervals for Python Developers

Robust Linear Regressions In Python