Data Science Engineering, your way

Today we just made public a series of tutorials on Data Science Engineering. In them we will try to compare how different concepts in the discipline can be implemented in the two dominant ecosystems nowadays: R and Python.

We will do this from a neutral point of view. Our opinion is that each environment has good and bad things, and any data scientist should know how to use both in order to be as prepared as posible for job market or to start personal project.

To get a feeling of what is going on regarding this hot topic, we refer the reader to DataCamp’s Data Science War infographic. Their infographic explores what the strengths of R are over Python and vice versa, and aims to provide a basic comparison between these two programming languages from a data science and statistics perspective.

Far from being a repetition from the previous, our series of tutorials will go hands-on into how to actually perform different data science taks such as working with data frames, doing aggregations, or creating different statistical models such in the areas of supervised and unsupervised learning.

We will use real-world datasets, and we will build some real data products. This will help us to quickly transfer what we learn here to actual data analysis situations.

This is what we have so far, so keep an eye on the GitHub repository and get involved!

And if your are interested in Big Data products, then you might find interesting our series of tutorials on using Apache Spark and Python.

Tutorials

This is a growing list of tutorials explaining concepts and applications in Python and R.

Introduction to Data Frames

An introduction to the basic data structure and how to use it in Python/Pandas and R.

Exploratory Data Analysis

About this important task in any data science engineering project.

Dimensionality Reduction and Clustering

About using Principal Component Analysis and k-means Clustering to better represent and understand our data.

Text Mining and Sentiment Classification

How to use text mining techniques to analyse the positive or non-positive sentiment of text documents using just linear methods.

Applications

These are some of the applications we have built using the concepts explained in the tutorials.

A web-based Sentiment Classifier using R and Shiny

How to build a web applications where we can upload text documents to be sentiment-analysed using the R-based framework Shiny.

Building Data Products with Python

Using a wine reviews and recommendations website as a leitmotif, this series of tutorials, with its own separate repository tagged by lessons, digs into how to use Python technologies such as Django, Pandas, or Scikit-learn, in order to build data products.

Red Wine Quality Data analysis with R

Using R and ggplot2, we perform Exploratory Data Analysis of this reference dataset about wine quality.

Information Retrieval algorithms with Python

Where we show our own implementation of a couple of Information Retrieval algorithms: vector space model, and tf-idf.