Eric Kim

Eric Kim

Data Scientist, Petroleum Engineer

Machine Learning

Robust Linear Regressions In Python

In regression, managing outliers is key for accurate predictions. Techniques like OLS can be skewed by outliers. This analysis compares OLS, RANSAC, Huber, and Theil-Sen methods, showing how each deals with outliers differently, using theory and Python examples, to guide the best model choice for different outlier scenarios.

2023-11-15
10 min reading
Oil and Gas

Quantifying The Impact Of Completion Delay Since Drilled On Production

In times of unfavorable commodity prices, operators may delay completion after drilling in the hope of a price recovery. The study conducted in this article shows why this may not be a financially sound idea for certain basins by quantifying the impact of DUC time on normalized EURs.

2023-03-03
26 min reading
Oil and Gas

DUC Wells and Their Economic Complications

During the Covid-19 Pandemic, the operators opted not to drill new wells, but instead completed their existing DUC wells to meet demand while conserving cash. This post explains the concept and the economic impact of DUC wells on the US energy industry.

2023-02-09
11 min reading
Machine Learning

Multiple Linear Regression and Visualization in Python

Data scientists love linear regression for its simplicity. Strengthen your understanding of linear regression in multi-dimensional space through 3D visualization of linear models. This post comes with detailed scikit-learn code snippets for multiple linear regression.

2019-11-18
10 min reading
Statistics

Comprehensive Confidence Intervals for Python Developers

This post covers everything you need to know about confidence intervals: from the introductory conceptual explanations, to the detailed discussions about the variations of different techniques, their assumptions, strength and weekness, when to use, and when not to use.

2019-09-08
66 min reading
Natural Language Processing

Optimize Computational Efficiency of Skip-Gram with Negative Sampling

When training your NLP model with Skip-Gram, the very large size of vocabs imposes high computational cost on your machine. Since the original Skip-Gram model is unable to handle this high cost, we use an alternative, called Negative Sampling.

2019-05-26
22 min reading
Natural Language Processing

Demystifying Neural Network in Skip-Gram Language Modeling

The past couple of years, neural networks in Word2Vec have nearly taken over the field of NLP, thanks to their state-of-art performance. But how much do you understand about the algorithm behind it? This post will crack the secrets behind neural net in Word2Vec.

2019-05-06
20 min reading
Natural Language Processing

Understanding Multi-Dimensionality in Vector Space Modeling

How does word vectors in Natural Language Processing capture meaningful relationships among words? How can you quantify those relationships? Addressing these questions starts from understanding the multi-dimensional nature of NLP applications.

2019-04-16
18 min reading
Statistics

Transforming Non-Normal Distribution to Normal Distribution

Many statistical & machine learning techniques assume normality of data. What are the options you have if your data is not normally distributed? Transforming non-normal data to normal data using Box-Cox transformation is one of them.

2019-02-25
13 min reading
Others

Parse PDF Files While Retaining Structure with Tabula-py

If you ever tried to do anything with data provided to you in PDFs, you know how painful it is — it's hard to copy-and-paste rows of data out of PDF files. Try tabula-py to extract data into a CSV or Excel spreadsheet using a simple, easy-to-use interface.

2019-02-02
10 min reading
Others

Creating a Jupyter Notebook-Powered Data Science Blog with Pelican

Are you interested in hosting your own data science blog powered by Jupyter Notebook like this blog? Take a look at Aegis-Jupyter theme I made with Pelican. The set of codes that runs this blog is open-source, available on my Github Repo.

2019-01-27
7 min reading
Statistics

Non-Parametric Confidence Interval with Bootstrap

Bootstrapping is a type of non-parametric re-sampling method used for statistical & machine learning techniques. One application of bootstrapping is that it can compute confidence intervals of any distribution, because it's distribution-free.

2019-01-04
7 min reading
Statistics

Uncertainty Modeling with Monte-Carlo Simulation

How do casinos earn money? The answer is simple - the longer you play, the bigger the chance of you losing the money. Monte-Carlo simulation can construct its profit forecast model.

2019-01-03
9 min reading