World differences in infectious tuberculosis prevalence 1990-2007

In this first approach to the world situation regarding infectious tuberculosis we want to have a look at how different countries have been affected by the disease in the period from 1990 to 2007. By doing so we want to better understand different trends in the prevalence of this important disease. Which countries are getting better and worse? Are there more or less clear groups of countries based on how much are the affected and how their situation is changeing?

This should be enough for a first approach that opens the door to future works in finding the reasons for this situation and how the different approaches to fight the disease have been doing.

From Wikipedia, the free encyclopedia

Tuberculosis, MTB, or TB (short for tubercle bacillus), in the past also called phthisis, phthisis pulmonalis, or consumption, is a widespread, and in many cases fatal, infectious disease caused by various strains of mycobacteria, usually Mycobacterium tuberculosis. Tuberculosis typically attacks the lungs, but can also affect other parts of the body. It is spread through the air when people who have an active TB infection cough, sneeze, or otherwise transmit respiratory fluids through the air. Most infections do not have symptoms, known as latent tuberculosis. About one in ten latent infections eventually progresses to active disease which, if left untreated, kills more than 50% of those so infected.

The Gapminder website presents itself as a fact- based worldview. It is a comprehensive resource for data regarding different countries and territories indicators. For this article, we will use a dataset related to estimated prevalence (existing cases) per 100K coming from the World Health Organization (WHO). We invite the reader to repeat the process with the new cases and deaths datasets and share the results. Our data contains up to 207 countries. The first few of them looks as follows:

import urllib

tb_existing_url_csv = 'https://docs.google.com/spreadsheets/d/1X5Jp7Q8pTs3KLJ5JBWKhncVACGsg5v4xu6badNs4C7I/pub?gid=0&output=csv'
local_tb_existing_file = 'tb_existing_100.csv'
existing_f = urllib.urlretrieve(tb_existing_url_csv, local_tb_existing_file)

import pandas as pd

existing_df = pd.read_csv(local_tb_existing_file, index_col = 0, thousands  = ',')
existing_df.index.names = ['country']
existing_df.columns.names = ['year']

existing_df.head()

year	1990	1991	1992	1993	1994	1995	1996	1997	1998	1999	2000	2001	2002	2003	2004	2005	2006	2007
country
Afghanistan	436	429	422	415	407	397	397	387	374	373	346	326	304	308	283	267	251	238
Albania	42	40	41	42	42	43	42	44	43	42	40	34	32	32	29	29	26	22
Algeria	45	44	44	43	43	42	43	44	45	46	48	49	50	51	52	53	55	56
American Samoa	42	14	4	18	17	22	0	25	12	8	8	6	5	6	9	11	9	5
Andorra	39	37	35	33	32	30	28	23	24	22	20	20	21	18	19	18	17	19

The reader can check the complete datasets from the Gapminder website.

A visual to the world distribution of existing cases

Data tables are accurate and complete. However they are not always the best way to process information, specially in our case where we want to understand how each country relates to others. The following chart is a two dimensional representation of the same data, that originally has 18 different dimensions, one for each year. In the vertical axis we represent the first dimension of variation within countries, that in this case is related with where the country relates to others regarding the number of cases as an static magnitude (e.g. the sum across the years). Those countries at the top of the chart are countries less affected by the disease.

In the horizontal axis we have coded how the number of existing cases has changed from 1990 to 2007. Countries in the left hand side of the chart have increased the number of cases in that period of time.

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
pca.fit(existing_df)

existing_2d = pca.transform(existing_df)
existing_df_2d = pd.DataFrame(existing_2d)
existing_df_2d.index = existing_df.index
existing_df_2d.columns = ['PC1','PC2']

%matplotlib inline

ax = existing_df_2d.plot(kind='scatter', x='PC2', y='PC1', figsize=(16,8))

for i, country in enumerate(existing_df.index):
    ax.annotate(country, (existing_df_2d.iloc[i].PC2, existing_df_2d.iloc[i].PC1))

png

There are more countries at the very top and on the right hand side. That is, there are more countries with relatively less cases and that have improved their situation with time. But there are some facts that makes us want to keep exploring our data:

The distribution is not uniform. There is a high concentration of countries at the very top. This means that there are a group of countries able to deal with the disease in a way that is not accesible to many more (i.e. those countries over and under the 0 line in the vertical axis respectively).
There are countries that, far from improving its situation in the 1990-2007 period has most of them have done, ended up with more cases in 2007 than they had in 1990.
There is a country, Djibouti that lies quite outside of the rest and might need closer attention.

Groping countries based on their situation

By clustering our original dataset, we will colour the previous chart in order to identify up to six different groups. This is the minimal number of clusters that we have observed better (and more consistently) separates them in order to have a look at the center of each cluster, and try to understand different world tendencies and differences.

import numpy as np
from sklearn.cluster import KMeans

np.random.seed(1234)
kmeans = KMeans(n_clusters=6)
clusters = kmeans.fit(existing_df)
existing_df_2d['cluster'] = pd.Series(clusters.labels_, index=existing_df_2d.index)

import numpy as np
styles = ['b','r','y','c','g','k']
existing_df_2d.plot(
    kind='scatter',
    x='PC2',y='PC1',
    c=[styles[i] for i in existing_df_2d.cluster.astype(np.int)],
    s=100,
    figsize=(16,8))




<matplotlib.axes._subplots.AxesSubplot at 0x7f275545c8d0>

png

Most clusters divide the distribution vertically (i.e. based on the absolute number of cases) but there is also a division in the horizontal axis that we will have a look at. In order to complement this view, let’s have a look at the same data as a line chart that represents how each cluster representative evolves over time.

cluster_centers_df = pd.DataFrame(
    clusters.cluster_centers_, 
    index=['World Cluster 1', 'World Cluster 2', 'World Cluster 3', 'World Cluster 4', 'World Cluster 5', 'World Cluster 6']) 
cluster_centers_df.columns = existing_df.columns

ax = cluster_centers_df.T.plot(
    figsize=(16,8),
    style=styles,
    linewidth = 3)
ax.set_ylabel("Existing cases per 100K")




<matplotlib.text.Text at 0x7f2755465110>

png

Let’s analyse each of these world clusters or groups in detail.

World Cluster 1

sum(clusters.labels_ == 0)

90

There are 90 countries in the first cluster, and these are:

existing_df_2d[existing_df_2d.cluster == 0].index

Index([u'Albania', u'Algeria', u'American Samoa', u'Andorra', u'Anguilla', u'Antigua and Barbuda', u'Argentina', u'Australia', u'Austria', u'Bahamas', u'Barbados', u'Belgium', u'Belize', u'Bermuda', u'British Virgin Islands', u'Bulgaria', u'Canada', u'Cayman Islands', u'Chile', u'Colombia', u'Cook Islands', u'Costa Rica', u'Cuba', u'Cyprus', u'Czech Republic', u'Denmark', u'Dominica', u'Egypt', u'Estonia', u'Fiji', u'Finland', u'France', u'French Polynesia', u'Germany', u'Greece', u'Grenada', u'Hungary', u'Iceland', u'Iran', u'Ireland', u'Israel', u'Italy', u'Jamaica', u'Japan', u'Jordan', u'Kuwait', u'Lebanon', u'Libyan Arab Jamahiriya', u'Luxembourg', u'Malta', u'Mauritius', u'Mexico', u'Monaco', u'Montserrat', u'Netherlands', u'Netherlands Antilles', u'New Caledonia', u'New Zealand', u'Norway', u'Oman', u'Panama', u'Poland', u'Portugal', u'Puerto Rico', u'Saint Kitts and Nevis', u'Saint Lucia', u'Saint Vincent and the Grenadines', u'Samoa', u'San Marino', u'Saudi Arabia', u'Singapore', u'Slovakia', u'Slovenia', u'Spain', u'Sweden', u'Switzerland', u'Syrian Arab Republic', u'Macedonia, FYR', u'Tonga', u'Trinidad and Tobago', u'Tunisia', u'Turkey', u'Turks and Caicos Islands', u'United Arab Emirates', u'United Kingdom', u'Virgin Islands (U.S.)', u'United States of America', u'Uruguay', u'Venezuela', u'West Bank and Gaza'], dtype='object')

The center of this cluster gives us an idea of how an average country might look like:

cluster_centers_df.loc['World Cluster 1']




year
  37.277778
  35.688889
  35.733333
  34.400000
  33.511111
  32.422222
  30.800000
  30.511111
  29.300000
  26.777778
  24.355556
  23.577778
  22.022222
  20.933333
  20.488889
  19.922222
  19.255556
  19.111111
Name: World Cluster 1, dtype: float64

Although this group of countries might be a bit too large and heterogeneous, and probably needs further refinement, it is a good grouping when comparing their situation to that of other countries in the world. In any case it contains those countries with less number of existing cases in our set.

World Cluster 2

sum(clusters.labels_ == 1)

15

There are just 15 countries in this small group. These are:

existing_df_2d[existing_df_2d.cluster == 1].index

Index([u'Bangladesh', u'Bhutan', u'Cambodia', u'Korea, Dem. Rep.', u'Kiribati', u'Mali', u'Mauritania', u'Namibia', u'Philippines', u'Sierra Leone', u'South Africa', u'Swaziland', u'Timor-Leste', u'Togo', u'Zambia'], dtype='object')

Apart from the world cluster 6, that is formed by a single country as we will see, this group is represented by the centroid with the highest prevalence of tuberculosis per 100K:

cluster_centers_df.loc['World Cluster 2']




year
  715.933333
  702.800000
  686.400000
  669.800000
  654.200000
  641.066667
  618.400000
  618.466667
  599.000000
  603.333333
  612.733333
  626.200000
  582.400000
  577.200000
  568.266667
  571.466667
  567.400000
  561.866667
Name: World Cluster 2, dtype: float64

These are by all means the countries with the most tuberculosis cases every year. We can also see that the number of cases descends progressively. The total decrease between 1990 and 2007 is over 20%.

World Cluster 3

sum(clusters.labels_ == 2)

20

This is a very important group of countries. There are 20 of them:

existing_df_2d[existing_df_2d.cluster == 2].index

Index([u'Botswana', u'Burkina Faso', u'Burundi', u'Central African Republic', u'Chad', u'Congo, Rep.', u'Cote d'Ivoire', u'Congo, Dem. Rep.', u'Equatorial Guinea', u'Ethiopia', u'Guinea', u'Kenya', u'Lesotho', u'Mozambique', u'Nigeria', u'Rwanda', u'Senegal', u'Uganda', u'Tanzania', u'Zimbabwe'], dtype='object')

The average country that represents them looks like this:

cluster_centers_df.loc['World Cluster 3']




year
  259.85
  278.90
  287.30
  298.05
  309.00
  322.95
  335.00
  357.65
  369.65
  410.85
  422.25
  463.75
  492.45
  525.25
  523.60
  519.90
  509.80
  513.50
Name: World Cluster 3, dtype: float64

This is the only cluster where the number of cases has increased over the years, and is about to overtake the first position by 2007. It represents a country that might be in the middle of an humanitarian crisis and probably being affected by other infectious diseases such as HIV. We can see how the number of existing cases per 100K has doubled in the 1990-2007 period.

World Cluster 4

sum(clusters.labels_ == 3)

51

The fourth cluster contains 51 countries.

existing_df_2d[existing_df_2d.cluster == 3].index

Index([u'Armenia', u'Azerbaijan', u'Bahrain', u'Belarus', u'Benin', u'Bosnia and Herzegovina', u'Brazil', u'Brunei Darussalam', u'Cameroon', u'Comoros', u'Croatia', u'Dominican Republic', u'Ecuador', u'El Salvador', u'Eritrea', u'Georgia', u'Guam', u'Guatemala', u'Guyana', u'Honduras', u'Iraq', u'Kazakhstan', u'Kyrgyzstan', u'Latvia', u'Lithuania', u'Malaysia', u'Maldives', u'Micronesia, Fed. Sts.', u'Morocco', u'Nauru', u'Nicaragua', u'Niue', u'Northern Mariana Islands', u'Palau', u'Paraguay', u'Qatar', u'Korea, Rep.', u'Moldova', u'Romania', u'Russian Federation', u'Seychelles', u'Sri Lanka', u'Suriname', u'Tajikistan', u'Tokelau', u'Turkmenistan', u'Ukraine', u'Uzbekistan', u'Vanuatu', u'Wallis et Futuna', u'Yemen'], dtype='object')

Represented by its centroid.

cluster_centers_df.loc['World Cluster 4']




year
  130.607843
  133.411765
  125.607843
  127.549020
  124.823529
  127.705882
  121.686275
  130.509804
  125.823529
  124.450980
  110.588235
  106.607843
  121.098039
  103.019608
  101.803922
   97.294118
   96.176471
   91.686275
Name: World Cluster 4, dtype: float64

This cluster is pretty close to the first and larger one. It contains many american countries, some european countries, etc. Some of them are large and rich, such as Russia or Brazil. Structurally the differece with the countries in World Cluster 1 may reside in a larger number of cases per 100K. They also seem to be decreasing the number of cases slightly faster than World Cluster 1. These two reasons made them form a different group.

World Cluster 5

sum(clusters.labels_ == 4)

30

This group contains 50 countries:

existing_df_2d[existing_df_2d.cluster == 4].index

Index([u'Afghanistan', u'Angola', u'Bolivia', u'Cape Verde', u'China', u'Gabon', u'Gambia', u'Ghana', u'Guinea-Bissau', u'Haiti', u'India', u'Indonesia', u'Laos', u'Liberia', u'Madagascar', u'Malawi', u'Mongolia', u'Myanmar', u'Nepal', u'Niger', u'Pakistan', u'Papua New Guinea', u'Peru', u'Sao Tome and Principe', u'Solomon Islands', u'Somalia', u'Sudan', u'Thailand', u'Tuvalu', u'Viet Nam'], dtype='object')

With a tendency given by:

cluster_centers_df.loc['World Cluster 5']




year
  444.500000
  435.200000
  426.166667
  417.400000
  409.233333
  400.566667
  378.600000
  365.366667
  358.033333
  354.433333
  350.600000
  326.733333
  316.166667
  308.500000
  297.866667
  288.800000
  284.966667
  280.800000
Name: World Cluster 5, dtype: float64

It is a relatively large cluster. Still countries with lots of cases, but definitively less than the first cluster. We see countries such as India or China here, the larger countries on earth (if you Google about it, you’ll see that China itself has reduced its cases by 85% over time) and american countries such as Peru or Bolivia. In fact, this is the cluster with the fastest decrease in the number of existing cases as we see in the line chart.

World Cluster 6

sum(clusters.labels_ == 5)

1

The last group is formed by a single country, Djibouti.

existing_df.loc['Djibouti']




year
  1485
  1477
  1463
  1442
  1414
  1381
   720
   669
   698
   701
   761
   775
   932
   960
  1034
  1046
  1093
  1104
Name: Djibouti, dtype: int64

It is so badly affected by the disease that it forms a separate cluster on its own. You can google about ‘high prevalence tuberculosis Djibouti’ and see how this situation has been and still is subject to study and actions. Wikipedia also has an article about health in Djibuti.

In Djibouti, malnutrition is severe and the incidence of tuberculosis is high. Malaria is endemic. There were 3,111 reported cases of tuberculosis in 2009. The city of Djibouti’s publicly supplied water is suspect because the system is in disrepair.

Our dataset estimates existing cases of tuberculosis per 100K. We don’t know if that number in Wikipedia makes reference to new or existing cases, but it seems to be by absolute terms and not per 100K. In both cases, new and existing, if we compare the WHO numbers for 2007, it seems a reduction in the number of cases (e.g. the number of estimated new cases in 2007 was 6,769). The population of the country is 810,179.

Conclusions

During our analysis, we have seen how most countries improved their situation regarding the prevalence of infectious tuberculosis during the time lapse we considered. However we were also able to discover a group of countries with a high prevalence of the disease that, far from improving their situation, are increasing the number of cases.

We have just scratched the surface. There are other analysis we can carry out by using complementary datasets form the World Health Organisation, such as detection rates, new cases, or deaths. We can also cross link this datasets with others related to HIV/AIDS in order to confirm the co-ocurrence of both diseases.

This analysis also opens the door to analyise the impact of Tuberculosis programs. Once we are able to identify different groups of countries and their tendencies, we can follow them over time and associate changes (e.g. prevalence or incidence) with the time when these programs started.

All these approaches should create awareness on the real situation of a disease that is causing millions of deaths in some countries, while it was about to be neglected in some others.

Your turn!

This post is part of a series about doing Data Journalism using Python. Now is your turn to go and fork the repo and modify the IPython/Jupyter notebook assiciated to this post and contribute with your own analysis and conclusions.