Data Insight blog

Oct 2, 2015

Exploring geographical data with SparkR and ggplot2

The present analysis will make use of SparkR’s power to analyse large datasets in order to explore the 2013 American Community Survey dataset, more concretely its geographical features. For that purpose, we will aggregate data using the different tools introduced in the SparkR documentation and our series of notebooks, and then use ggplot2 mapping capabilities to put the different aggregations into a geographical context.

Sep 28, 2015

Linear Models with SparkR 1.5: uses and present limitations

In this analysis we will use SparkR machine learning capabilities in order to try to predict property value in relation to other variables in the 2013 American Community Survey dataset. You can also check the associated Jupyter notebook. By doing so we will show the current limitations of SparkR’s MLlib and also those of linear methods as a predictive method, no matter how much data we have.

Sep 14, 2015

An OnLine Spectral Search ENgine using Python with Spark, Flask, and AngularJS

Our engine provides a RESTful-like API to perform on-line spectral search for proteomics spectral data. It is based on the SpectraST algorithm for spectral search and uses PRIDE Cluster spectral libraries. It also features an AngularJS web user interface.

Aug 29, 2015

A scalable on-line movie recommender using Spark and Flask

This Apache Spark tutorial will guide you step-by-step into how to use the MovieLens dataset to build a movie recommendations web service using collaborative filtering with Spark’s Alternating Least Saqures implementation and Python/Flask.

Jul 1, 2015

Spark & Python Notebooks VI: SQL & Dataframes

The fifth episode in our Spark series introduced Decision Trees with MLlib. This new notebook moves away from MLlib for a while in order to introduce SparkSQL and the concept of Dataframe, that will speed up our analysis and make it easier to communicate.

Jun 15, 2015

Spark & Python Notebooks V: Decision Trees & Model Selection

The fourth episode in our Spark series introduced Logistic Regression with MLlib. This new notebook explains how to use the library to build a classifier using Decision Trees on a large dataset. It also shows how powerful trees are in order to understand our data and even perform model selection.

Jun 8, 2015

Spark & Python Notebooks IV: Logistic Regression & Model Selection

The third episode in our Spark series introduced the MLlib library and its Statistics and Exploratory Data Analysis capabilities. This fourth notebook explains how to use the library to build a classifier using Logistic Regression on a large dataset. It also describes two different approaches to model selection.

Jun 1, 2015