Data Insight blog

Oct 19, 2015

A look at the world wine market using Python, Pandas, and Seaborn

In this article we want to have a look at present wine market prices by region and appellation from the point of view of the Wine.com website catalog. We will use Python-based libraries such as Pandas and Seaborn.

Sep 23, 2015

A visual on tuberculosis evolution using Python and Bokeh

In this second approach to the World situation of infectious tuberculosis from 1990 to 2007, we want to make a point about how a simple visual representation of tabular data, a Bokeh heatmap in this case, can provide a lot of information that, although is already there in the tabular data, might be more difficult to percieve.

Sep 20, 2015

Building data products with Python

The following is a repository containing the code for a wine reviews and recommendations web application, in different stages as git tags. The idea is that you can follow the tutorials through the tags listed below, and learn the different concepts explained in them. We will use Python technologies such as Django, Pandas, or Scikit-learn. The tutorials also include instructions on how to deploy the web using a Koding account.

Sep 14, 2015

An OnLine Spectral Search ENgine using Python with Spark, Flask, and AngularJS

Our engine provides a RESTful-like API to perform on-line spectral search for proteomics spectral data. It is based on the SpectraST algorithm for spectral search and uses PRIDE Cluster spectral libraries. It also features an AngularJS web user interface.

Sep 3, 2015

World differences in infectious tuberculosis prevalence 1990-2007

In this first approach to the world situation regarding infectious tuberculosis we want to have a look at how different countries have been affected by the disease in the period from 1990 to 2007. By doing so we want to better understand different trends in the prevalence of this important disease. Which countries are getting better and worse? Are there more or less clear groups of countries based on how much are the affected and how their situation is changeing?

Aug 29, 2015

A scalable on-line movie recommender using Spark and Flask

This Apache Spark tutorial will guide you step-by-step into how to use the MovieLens dataset to build a movie recommendations web service using collaborative filtering with Spark’s Alternating Least Saqures implementation and Python/Flask.

Aug 19, 2015

Data Science Engineering, your way

Today we just made public a series of tutorials on Data Science Engineering. In them we will try to compare how different concepts in the discipline can be implemented in the two dominant ecosystems nowadays: R and Python.

Jul 1, 2015

Spark & Python Notebooks VI: SQL & Dataframes

The fifth episode in our Spark series introduced Decision Trees with MLlib. This new notebook moves away from MLlib for a while in order to introduce SparkSQL and the concept of Dataframe, that will speed up our analysis and make it easier to communicate.

Jun 15, 2015

Spark & Python Notebooks V: Decision Trees & Model Selection

The fourth episode in our Spark series introduced Logistic Regression with MLlib. This new notebook explains how to use the library to build a classifier using Decision Trees on a large dataset. It also shows how powerful trees are in order to understand our data and even perform model selection.

Jun 8, 2015

Spark & Python Notebooks IV: Logistic Regression & Model Selection

The third episode in our Spark series introduced the MLlib library and its Statistics and Exploratory Data Analysis capabilities. This fourth notebook explains how to use the library to build a classifier using Logistic Regression on a large dataset. It also describes two different approaches to model selection.

Jun 1, 2015

Spark & Python Notebooks III: Statistics and EDA

Episodes one and two in our Spark series introduced how to work with RDDs. The third one introduces Spark’s MLlib library for machine learning, starting with its Statistics and Exploratory Data Analysis capabilities.

May 28, 2015

Spark & Python Notebooks II: key/value RDDs

Previously, we introduced the basics of working with Spark RDDs in Python. In this new notebook, we deal with data aggregations and key/value pair RDDs.

May 12, 2015

Spark & Python Notebooks I: the basics

This is a collection of IPython notebooks intended to train the reader on different Spark concepts, from basic to advanced, by using the Python language.

Dec 13, 2014

Scoring using the Vector Space Model

Previously we discussed tf-idf as a way to calculate how relevant a search term is given a set of indexed documents. When having multiple terms, we used overlap score measure consisting in the sum of the tf-idf for each term in the given input. A more general and flexible way of scoring multi-term searches is using the vector space model.

Nov 16, 2014

Term Frequency - Inverse Document Frequency 101

Let us expose here a basic and beautiful Information Retrieval concept such as tf-idf. In order to do so, we will use Python to define a basic in-memory “search engine” that will allow us to add documents and search for them. The search results will contain the relevant documents together with the tf-idf value.