Exploring geographical data with SparkR and ggplot2

The present analysis will make use of SparkR’s power to analyse large datasets in order to explore the 2013 American Community Survey dataset, more concretely its geographical features. For that purpose, we will aggregate data using the different tools introduced in the SparkR documentation and our series of notebooks, and then use ggplot2 mapping capabilities to put the different aggregations into a geographical context.

Linear Models with SparkR 1.5: uses and present limitations

In this analysis we will use SparkR machine learning capabilities in order to try to predict property value in relation to other variables in the 2013 American Community Survey dataset. You can also check the associated Jupyter notebook. By doing so we will show the current limitations of SparkR’s MLlib and also those of linear methods as a predictive method, no matter how much data we have.

A web-based Sentiment Classifier using R and Shiny

The purpose of many data science projects is to end up with a model that can be used within an organisation to solve a particular problem. If this is our case, we need to determine the right representation of that model so it can be shared in the easiest, cheapest, and most effective way. Web data products are an ideal vehicle for delivering machine learning models. The Web can be accessed almost everywhere and by multiple users. Moreover, the typical web application deployment cycle allows us to do easy updates.

Data Science Engineering, your way

Today we just made public a series of tutorials on Data Science Engineering. In them we will try to compare how different concepts in the discipline can be implemented in the two dominant ecosystems nowadays: R and Python.

Ridge regression model selection with R

If recently we used best subset as a way of reducing the unnecessary model complexity, this time we are going to use the Ridge regression technique.

Best subset model selection with R

Linear regression models are easy to fit and interpret. Moreover, they are suprisingly accurate in many real world situations where the relationship between the response and the predictors is approximately linear. However, it is often the case that not all the variables used in a multiple regression model are in associated with the response.