Machine Learning

Multiple Linear Regression and Visualization in Python

Data scientists love linear regression for its simplicity. Strengthen your understanding of linear regression in multi-dimensional space through 3D visualization of linear models. This post comes with detailed scikit-learn code snippets for multiple linear regression.

10 min reading

Comprehensive Confidence Intervals for Python Developers

This post covers everything you need to know about confidence intervals: from the introductory conceptual explanations, to the detailed discussions about the variations of different techniques, their assumptions, strength and weekness, when to use, and when not to use.

66 min reading
Natural Language Processing

Optimize Computational Efficiency of Skip-Gram with Negative Sampling

When training your NLP model with Skip-Gram, the very large size of vocabs imposes high computational cost on your machine. Since the original Skip-Gram model is unable to handle this high cost, we use an alternative, called Negative Sampling.

22 min reading
Natural Language Processing

Demystifying Neural Network in Skip-Gram Language Modeling

The past couple of years, neural networks in Word2Vec have nearly taken over the field of NLP, thanks to their state-of-art performance. But how much do you understand about the algorithm behind it? This post will crack the secrets behind neural net in Word2Vec.

20 min reading
Natural Language Processing

Understanding Multi-Dimensionality in Vector Space Modeling

How does word vectors in Natural Language Processing capture meaningful relationships among words? How can you quantify those relationships? Addressing these questions starts from understanding the multi-dimensional nature of NLP applications.

18 min reading

Transforming Non-Normal Distribution to Normal Distribution

Many statistical & machine learning techniques assume normality of data. What are the options you have if your data is not normally distributed? Transforming non-normal data to normal data using Box-Cox transformation is one of them.

13 min reading

Spatial Simulation 1: Basics of Variograms

Tobler's first law of geography states that "everything is related to everything else, but near things are more related than distant things." Variogram shows the correlation between two spatial data points over distances.

10 min reading

Parse PDF Files While Retaining Structure with Tabula-py

If you ever tried to do anything with data provided to you in PDFs, you know how painful it is — it's hard to copy-and-paste rows of data out of PDF files. Try tabula-py to extract data into a CSV or Excel spreadsheet using a simple, easy-to-use interface.

10 min reading

Creating a Jupyter Notebook-Powered Data Science Blog with Pelican

Are you interested in hosting your own data science blog powered by Jupyter Notebook like this blog? Take a look at Aegis-Jupyter theme I made with Pelican. The set of codes that runs this blog is open-source, available on my Github Repo.

7 min reading

Non-Parametric Confidence Interval with Bootstrap

Bootstrapping is a type of non-parametric re-sampling method used for statistical & machine learning techniques. One application of bootstrapping is that it can compute confidence intervals of any distribution, because it's distribution-free.

7 min reading

Uncertainty Modeling with Monte-Carlo Simulation

How do casinos earn money? The answer is simple - the longer you play, the bigger the chance of you losing the money. Monte-Carlo simulation can construct its profit forecast model.

9 min reading