Spark & Python Notebooks IV: Logistic Regression & Model Selection
The third episode in our Spark series introduced the MLlib library and its Statistics and Exploratory Data Analysis capabilities. This fourth notebook explains how to use the library to build a classifier using Logistic Regression on a large dataset. It also describes two different approaches to model selection.
Spark & Python Notebooks III: Statistics and EDA
Episodes one and two in our Spark series introduced how to work with RDDs. The third one introduces Spark’s MLlib library for machine learning, starting with its Statistics and Exploratory Data Analysis capabilities.
Spark & Python Notebooks II: key/value RDDs
Previously, we introduced the basics of working with Spark RDDs in Python. In this new notebook, we deal with data aggregations and key/value pair RDDs.
Spark & Python Notebooks I: the basics
This is a collection of IPython notebooks intended to train the reader on different Spark concepts, from basic to advanced, by using the Python language.