A look at the world wine market using Python, Pandas, and Seaborn

In this article we want to have a look at present wine market prices by region and appellation from the point of view of the Wine.com website catalog. We will use Python-based libraries such as Pandas and Seaborn.

Exploring geographical data with SparkR and ggplot2

The present analysis will make use of SparkR’s power to analyse large datasets in order to explore the 2013 American Community Survey dataset, more concretely its geographical features. For that purpose, we will aggregate data using the different tools introduced in the SparkR documentation and our series of notebooks, and then use ggplot2 mapping capabilities to put the different aggregations into a geographical context.

Linear Models with SparkR 1.5: uses and present limitations

In this analysis we will use SparkR machine learning capabilities in order to try to predict property value in relation to other variables in the 2013 American Community Survey dataset. You can also check the associated Jupyter notebook. By doing so we will show the current limitations of SparkR’s MLlib and also those of linear methods as a predictive method, no matter how much data we have.

A visual on tuberculosis evolution using Python and Bokeh

In this second approach to the World situation of infectious tuberculosis from 1990 to 2007, we want to make a point about how a simple visual representation of tabular data, a Bokeh heatmap in this case, can provide a lot of information that, although is already there in the tabular data, might be more difficult to percieve.

Building data products with Python

The following is a repository containing the code for a wine reviews and recommendations web application, in different stages as git tags. The idea is that you can follow the tutorials through the tags listed below, and learn the different concepts explained in them. We will use Python technologies such as Django, Pandas, or Scikit-learn. The tutorials also include instructions on how to deploy the web using a Koding account.

An OnLine Spectral Search ENgine using Python with Spark, Flask, and AngularJS

Our engine provides a RESTful-like API to perform on-line spectral search for proteomics spectral data. It is based on the SpectraST algorithm for spectral search and uses PRIDE Cluster spectral libraries. It also features an AngularJS web user interface.

World differences in infectious tuberculosis prevalence 1990-2007

In this first approach to the world situation regarding infectious tuberculosis we want to have a look at how different countries have been affected by the disease in the period from 1990 to 2007. By doing so we want to better understand different trends in the prevalence of this important disease. Which countries are getting better and worse? Are there more or less clear groups of countries based on how much are the affected and how their situation is changeing?

A web-based Sentiment Classifier using R and Shiny

The purpose of many data science projects is to end up with a model that can be used within an organisation to solve a particular problem. If this is our case, we need to determine the right representation of that model so it can be shared in the easiest, cheapest, and most effective way. Web data products are an ideal vehicle for delivering machine learning models. The Web can be accessed almost everywhere and by multiple users. Moreover, the typical web application deployment cycle allows us to do easy updates.

A scalable on-line movie recommender using Spark and Flask

This Apache Spark tutorial will guide you step-by-step into how to use the MovieLens dataset to build a movie recommendations web service using collaborative filtering with Spark’s Alternating Least Saqures implementation and Python/Flask.

Data Science Engineering, your way

Today we just made public a series of tutorials on Data Science Engineering. In them we will try to compare how different concepts in the discipline can be implemented in the two dominant ecosystems nowadays: R and Python.

Spark & Python Notebooks VI: SQL & Dataframes

The fifth episode in our Spark series introduced Decision Trees with MLlib. This new notebook moves away from MLlib for a while in order to introduce SparkSQL and the concept of Dataframe, that will speed up our analysis and make it easier to communicate.

Spark & Python Notebooks V: Decision Trees & Model Selection

The fourth episode in our Spark series introduced Logistic Regression with MLlib. This new notebook explains how to use the library to build a classifier using Decision Trees on a large dataset. It also shows how powerful trees are in order to understand our data and even perform model selection.

Spark & Python Notebooks IV: Logistic Regression & Model Selection

The third episode in our Spark series introduced the MLlib library and its Statistics and Exploratory Data Analysis capabilities. This fourth notebook explains how to use the library to build a classifier using Logistic Regression on a large dataset. It also describes two different approaches to model selection.

Spark & Python Notebooks III: Statistics and EDA

Episodes one and two in our Spark series introduced how to work with RDDs. The third one introduces Spark’s MLlib library for machine learning, starting with its Statistics and Exploratory Data Analysis capabilities.

Spark & Python Notebooks II: key/value RDDs

Previously, we introduced the basics of working with Spark RDDs in Python. In this new notebook, we deal with data aggregations and key/value pair RDDs.

Spark & Python Notebooks I: the basics

This is a collection of IPython notebooks intended to train the reader on different Spark concepts, from basic to advanced, by using the Python language.

Ridge regression model selection with R

If recently we used best subset as a way of reducing the unnecessary model complexity, this time we are going to use the Ridge regression technique.

Best subset model selection with R

Linear regression models are easy to fit and interpret. Moreover, they are suprisingly accurate in many real world situations where the relationship between the response and the predictors is approximately linear. However, it is often the case that not all the variables used in a multiple regression model are in associated with the response.