A visual on tuberculosis evolution using Python and Bokeh

In this second approach to the World situation of infectious tuberculosis from 1990 to 2007, we want to make a point about how a simple visual representation of tabular data, a Bokeh heatmap in this case, can provide a lot of information that, although is already there in the tabular data, might be more difficult to percieve.

This article is part of our Data Journalism with Python repository.

From Wikipedia, the free encyclopedia

Tuberculosis, MTB, or TB (short for tubercle bacillus), in the past also called phthisis, phthisis pulmonalis, or consumption, is a widespread, and in many cases fatal, infectious disease caused by various strains of mycobacteria, usually Mycobacterium tuberculosis. Tuberculosis typically attacks the lungs, but can also affect other parts of the body. It is spread through the air when people who have an active TB infection cough, sneeze, or otherwise transmit respiratory fluids through the air. Most infections do not have symptoms, known as latent tuberculosis. About one in ten latent infections eventually progresses to active disease which, if left untreated, kills more than 50% of those so infected.

For our visualisation we will use Bokeh, a Python interactive visualization library that targets modern web browsers for presentation and it works great also with Jupyter notebooks. In words of its authors, Bokeh’s goal is to provide elegant, concise construction of novel graphics in the style of D3.js, but also deliver this capability with high-performance interactivity over very large or streaming datasets.

Loading data

From our previous notebook, we know how to download and get our data into a Pandas data frame.

import urllib

tb_existing_url_csv = 'https://docs.google.com/spreadsheets/d/1X5Jp7Q8pTs3KLJ5JBWKhncVACGsg5v4xu6badNs4C7I/pub?gid=0&output=csv'
local_tb_existing_file = 'tb_existing_100.csv'
existing_f = urllib.urlretrieve(tb_existing_url_csv, local_tb_existing_file)

import pandas as pd

existing_df = pd.read_csv(local_tb_existing_file, index_col = 0, thousands  = ',')
existing_df.index.names = ['country']
existing_df.columns.names = ['year']

And we already know how our tabular data looks like.

existing_df.head()

year	1990	1991	1992	1993	1994	1995	1996	1997	1998	1999	2000	2001	2002	2003	2004	2005	2006	2007
country
Afghanistan	436	429	422	415	407	397	397	387	374	373	346	326	304	308	283	267	251	238
Albania	42	40	41	42	42	43	42	44	43	42	40	34	32	32	29	29	26	22
Algeria	45	44	44	43	43	42	43	44	45	46	48	49	50	51	52	53	55	56
American Samoa	42	14	4	18	17	22	0	25	12	8	8	6	5	6	9	11	9	5
Andorra	39	37	35	33	32	30	28	23	24	22	20	20	21	18	19	18	17	19

The Gapminder website presents itself as a fact-based worldview. It is a comprehensive resource for data regarding different countries and territories indicators. For this notebook again we will use a dataset related to estimated prevalence (existing cases) per 100K coming from the World Health Organization (WHO). We invite the reader to repeat the process with the new cases and deaths datasets and share the results. Our data contains 207 countries, and number of cases from the period from 1990 to 2007.

A visual representation of our data using Bokeh

The first thing we need to do is import the Bokeh library as follows.

import bokeh

You probably will need to install it. You have instructions here.

These are some other imports we will use.

from bokeh.charts import HeatMap, show, output_notebook, output_file
from bokeh.palettes import YlOrRd9 as palette

When working with iPython/Jupyter notebooks, we generate output by using output_notebook as follows.

output_notebook()

Or we can also (and in addition) output an html page as follows.

output_file("tuberculosis_heatmap.html")

And finally we create a HeatMap object as follows, using our existing_df data frame and setting dimensions, title, and palette colours.

# Reverse the color order so dark red is highest prevalence
palette = palette[::-1]  

# Create a heatmap
hm = HeatMap(
    existing_df, 
    title="Infectious Tuberculosis Prevalence 1990-2007",
    height=3000,
    width=800, 
    palette=palette)

The only thing remaining is to show the heat map with a simple call.

show(hm)

You can see the results here. Or as a html version of our notebook on GitHub. But please, be aware that some of the Bokeh features don’t work properly with GitHub previews. But they work great at least in a IPython/Jupyter server.

What we see

If we look at the chart by row, we are looking at a country evolution in time. If we look at the chart by columns instead, we can see how the world situation was in a given year. Darker tones indicate a higher prevalence of the disease, while lighter ones indicate a lower prevalence. Yoy can zoom in and out, or traverse accross the diagram using the controls. By hoovering on top of a cell, you will see the actual value.

Do you notice how quickly we can see certain situations while looking at a visual representation of the data versus a tabular one? For example:

There are a bunch of countries that started to improve their situation but by the end of the considered period the got worse again (e.g. Zimbabwe, Togo, Swaziland, Sierra Leone, Namibia, Dibouti).
We can see how Dibouiti was the only memeber of our Cluster 6. It has the darkest tones in the heat map and although it got a bit better in between 1996-2002, it started to have more cases after that.
Countries like China, Brazil, India, Nepal, or Kiribati have greatly improved their situation. We knew that, and we can percieve this very quickly in our diagram.

In general we can see how much better we are with visual encodings (colour, position) than with numbers in a table. And this didn’t take much effort. With just a bunch of Python lines and the help of an amazing library like Bokeh, we can start creating awareness about some very serious world issues. We are all very busy nowadays, and if we can understand something in 30 secons instead of 10 minutes, that can make a difference!