The fifth episode in our Spark series introduced Decision Trees with MLlib. This new notebook moves away from MLlib for a while in order to introduce SparkSQL and the concept of Dataframe, that will speed up our analysis and make it easier to communicate.
Instructions
A good way of using these notebooks is by first cloning the GitHub repo, and then
starting your own IPython notebook in
pySpark mode. For example, if we have a standalone Spark installation
running in our localhost
with a maximum of 6Gb per node assigned to IPython:
MASTER="spark://127.0.0.1:7077" SPARK_EXECUTOR_MEMORY="6G" IPYTHON_OPTS="notebook --pylab inline" ~/spark-1.3.1-bin-hadoop2.6/bin/pyspark
Notice that the path to the pyspark
command will depend on your specific
installation. So as requirement, you need to have
Spark installed in
the same machine you are going to start the IPython notebook
server.
For more Spark options see here. In general it works the rule of passign options
described in the form spark.executor.memory
as SPARK_EXECUTOR_MEMORY
when
calling IPython/pySpark.
Datasets
We will be using datasets from the KDD Cup 1999.
Notebooks
The following notebooks can be examined individually, although there is a more or less linear ‘story’ when followed in sequence. By using the same dataset they try to solve a related set of tasks with it.
Spark SQL and Data Frames
In this notebook a schema is inferred for our network interactions dataset. Based on that, we use
Spark’s SQL DataFrame
abstraction to perform a more structured exploratory data analysis.
This is an ongoing project. New notebooks will be available soon. The best way to be up to date is to watch our GitHub repo.