Data Insight blog

Jun 8, 2015

Spark & Python Notebooks IV: Logistic Regression & Model Selection

The third episode in our Spark series introduced the MLlib library and its Statistics and Exploratory Data Analysis capabilities. This fourth notebook explains how to use the library to build a classifier using Logistic Regression on a large dataset. It also describes two different approaches to model selection.

Jun 1, 2015

Spark & Python Notebooks III: Statistics and EDA

Episodes one and two in our Spark series introduced how to work with RDDs. The third one introduces Spark’s MLlib library for machine learning, starting with its Statistics and Exploratory Data Analysis capabilities.

May 28, 2015

Spark & Python Notebooks II: key/value RDDs

Previously, we introduced the basics of working with Spark RDDs in Python. In this new notebook, we deal with data aggregations and key/value pair RDDs.

May 12, 2015

Spark & Python Notebooks I: the basics

This is a collection of IPython notebooks intended to train the reader on different Spark concepts, from basic to advanced, by using the Python language.