Exploring geographical data with SparkR and ggplot2

The present analysis will make use of SparkR’s power to analyse large datasets in order to explore the 2013 American Community Survey dataset, more concretely its geographical features. For that purpose, we will aggregate data using the different tools introduced in the SparkR documentation and our series of notebooks, and then use ggplot2 mapping capabilities to put the different aggregations into a geographical context.

But why using SparkR? Well, ggplot is not intended to be used with datasets of millions of points. We first need to process and aggregate data in order to make it usable. And there is where SparkR comes in handy. Using regular R (e.g. dplyr) for doing data aggregations on really large datasets may not scale well. By opening the door to parallel and distributed data processing using R on Spark, we really start to solve our scalability problems.

Creating a SparkSQL context and loading data

In order to explore our data, we first need to load it into a SparkSQL data frame. But first we need to init a SparkSQL context. The first thing we need to do is to set up some environment variables and library paths as follows. Remember to replace the value assigned to SPARK_HOME with your Spark home folder.

# Set Spark home and R libs
Sys.setenv(SPARK_HOME='/home/cluster/spark-1.5.0-bin-hadoop2.6')
    .libPaths(c(file.path(Sys.getenv('SPARK_HOME'), 'R', 'lib'), .libPaths()))

Now we can load the SparkR library as follows.

library(SparkR)

Attaching package: ‘SparkR’

The following objects are masked from ‘package:stats’:

    filter, na.omit

The following objects are masked from ‘package:base’:

    intersect, rbind, sample, subset, summary, table, transform

And now we can initialise the Spark context as in the official documentation. In our case we are using a standalone Spark cluster with one master and seven workers. If you are running Spark in local node, use just master='local'. Additionally, we require a Spark package from Databricks to read CSV files (more on this in the previous notebook).

sc <- sparkR.init(master='spark://169.254.206.2:7077', sparkPackages="com.databricks:spark-csv_2.11:1.2.0")

Launching java with spark-submit command /home/cluster/spark-1.5.0-bin-hadoop2.6/bin/spark-submit  --packages com.databricks:spark-csv_2.11:1.2.0 sparkr-shell /tmp/RtmpG5Cugb/backend_port5c4357ff10ae 

And finally we can start the SparkSQL context as follows.

sqlContext <- sparkRSQL.init(sc)

Now that we have our SparkSQL context ready, we can use it to load our CSV data into data frames. We have downloaded our 2013 American Community Survey dataset files in notebook 0, so they should be stored locally. Remember to set the right path for your data files in the first line, ours is /nfs/data/2013-acs/ss13husa.csv.

housing_a_file_path <- file.path('', 'nfs','data','2013-acs','ss13husa.csv')
housing_b_file_path <- file.path('', 'nfs','data','2013-acs','ss13husb.csv')

Now let’s read into a SparkSQL dataframe. We need to pass four parameters in addition to the sqlContext:

The file path.
header='true' since our csv files have a header with the column names.
Indicate that we want the library to infer the schema.
And the source type (the Databricks package in this case).

And we have two separate files for both, housing and population data. We need to join them.

housing_a_df <- read.df(sqlContext, 
                            housing_a_file_path, 
                            header='true', 
                            source = "com.databricks.spark.csv", 
                            inferSchema='true')

housing_b_df <- read.df(sqlContext, 
                            housing_b_file_path, 
                            header='true', 
                            source = "com.databricks.spark.csv", 
                            inferSchema='true')

housing_df <- rbind(housing_a_df, housing_b_df)

Let’s check that we have everything there by counting the files and listing a few of them.

nrow(housing_df)

head(housing_df)

.	RT	SERIALNO	DIVISION	PUMA	REGION	ST	ADJHSG	ADJINC	WGTP	NP	ellip.h	wgtp71	wgtp72	wgtp73	wgtp74	wgtp75	wgtp76	wgtp77	wgtp78	wgtp79	wgtp80
1	H	84	6	2600	3	1	1000000	1007549	0	1	⋯	0	0	0	0	0	0	0	0	0	0
2	H	154	6	2500	3	1	1000000	1007549	51	4	⋯	86	53	59	84	49	15	15	20	50	16
3	H	156	6	1700	3	1	1000000	1007549	449	1	⋯	161	530	601	579	341	378	387	421	621	486
4	H	160	6	2200	3	1	1000000	1007549	16	3	⋯	31	24	33	7	7	13	18	23	23	5
5	H	231	6	2400	3	1	1000000	1007549	52	1	⋯	21	18	37	49	103	38	49	51	46	47
6	H	286	6	900	3	1	1000000	1007549	76	1	⋯	128	25	68	66	80	26	66	164	88	24

Variables to explore

The main variable we will explore is ST, or state code. We will use this variable to aggregate other variables described in our data dictionary by doing so we will be able to map different variables, each one in a given state in the USA map.

These are the variables we will aggregate and map using state codes:

VALP or property value.
ACR or property lot size.
Cost-related variables (e.g. electricity and gas)
YBL or year first build.
GRPIP or gross rent as the percentage of household income.

Exploring property value by state using ggplot maps

To create a map of the United States with the states colored according to VALP, we can use ggplot2’s map_data function. This functionality requires the R package maps, that you might need to install from the R shell.

# you might need to install maps by running from the R console:
# install.packages("maps")
    
library(ggplot2)
IRkernel::set_plot_options(width=10)
states_map <- map_data("state")

Our map contains a series of geometric shapes, latitudes, codes, and region names.

str(states_map)

'data.frame':   15537 obs. of  6 variables:
 $ long     : num  -87.5 -87.5 -87.5 -87.5 -87.6 ...
 $ lat      : num  30.4 30.4 30.4 30.3 30.3 ...
 $ group    : num  1 1 1 1 1 1 1 1 1 1 ...
 $ order    : int  1 2 3 4 5 6 7 8 9 10 ...
 $ region   : chr  "alabama" "alabama" "alabama" "alabama" ...
 $ subregion: chr  NA NA NA NA ...

The region variable there is the state name in the case of the state map, corresponding to the US states. We will need to match that with our housing dataset state code some way. But let’s first reduce the housing dataset to what we want to represent in the map.

The dataset we want to visualise is the average property value by state. We can use SparkR as follows.

housing_avg_valp <- collect(
        agg(
            groupBy(housing_df, "ST"),
            AVG_VALP=avg(housing_df$VALP)
        )
)
head(housing_avg_valp)

.	ST	AVG_VALP
1	31	141933.6
2	32	208622
3	33	267999
4	34	372496.1
5	35	194513.9
6	36	360343.5

Now we need to map the ST column to state names, so we can associate the right value to the right polygon in the state_map map. We have provided a csv file containing this mapping. Let’s read it into a regular R data frame.

state_names <- read.csv("states.csv")
head(state_names)

.	st	name	code
1	1	Alabama	AL
2	2	Alaska	AK
3	4	Arizona	AZ
4	5	Arkansas	AR
5	6	California	CA
6	8	Colorado	CO

So all we have to do is replace the state code with with the state name. For example, using a factor variable.

housing_avg_valp$region <- factor(housing_avg_valp$ST, levels=state_names$st, labels=tolower(state_names$name))

The previous code uses factor(vector, levels, labels) in order to create a vector of factor values from our previous vector of codes. Remember that columns in a data frame are vectors. The levels are the possible values we can have in the source vector (the state codes) and the labels will be the values they will be mapped to (the state names). By doing so we will end up with the names we need to merge our dataset with the map we just created before. Let’s have a look at what we ended up with.

head(housing_avg_valp)

.	ST	AVG_VALP	region
1	31	141933.6	nebraska
2	32	208622	nevada
3	33	267999	new hampshire
4	34	372496.1	new jersey
5	35	194513.9	new mexico
6	36	360343.5	new york

And then we are ready to merge the dataset with the map as follows.

merged_data <- merge(states_map, housing_avg_valp, by="region")
head(merged_data)

.	region	long	lat	group	order	subregion	ST	AVG_VALP
1	alabama	-87.46201	30.38968	1	1	NA	1	155384
2	alabama	-87.48493	30.37249	1	2	NA	1	155384
3	alabama	-87.52503	30.37249	1	3	NA	1	155384
4	alabama	-87.53076	30.33239	1	4	NA	1	155384
5	alabama	-87.57087	30.32665	1	5	NA	1	155384
6	alabama	-87.58806	30.32665	1	6	NA	1	155384

Finally, we can use ggplot2 with a geom_polygon to plot the previous merged data.

ggplot(merged_data, aes(x = long, y = lat, group = group, fill = AVG_VALP)) + geom_polygon(color = "white") + theme_bw()

enter image description here

Hey, that property value map seems to make sense, right? States like California (West Coast), New York, or Washington DC (North East Coast) have the highest average property value, while interior states have the lowest.

Exploring lot size

From now one, we will use the same code to generate a map. We will just change the variable we aggregate to get the average.

The next variable we will explore is ACR or lot size. It can take the following values:

NA: not a one-family house or mobile home)
1: House on less than one acre
2: House on one to less than ten acres
3: House on ten or more acres

We can consider this variable as an ordinal one where 1 represents the smaller lot and 3 the biggest. If we aggregate the average by state code, we could visualise what states tend to have larger properties on average. Let’s do it as follows.

housing_avg <- collect(agg(
        groupBy(filter(housing_df, "ACR='1' OR ACR='2' OR ACR='3'"), "ST"),
        AVG=avg(housing_df$ACR)
))

housing_avg$region <- factor(housing_avg$ST, levels=state_names$st, labels=tolower(state_names$name))
merged_data <- merge(states_map, housing_avg, by="region")
ggplot(merged_data, aes(x = long, y = lat, group = group, fill = AVG)) + geom_polygon(color = "white") + theme_bw()

enter image description here

Although there isn’t an exact match, there seems to be a string relationship between lot size and state population. This relationship shows to be inverse.

Exploring costs

There are up to four different variables reflecting the cost of a given house. In the following comparative plots, we will visualise the average values for each of them for a house grouped by state. Electricity and gas are monthly costs and fuel and water are yearly ones.

housing_avg <- collect(agg(
            groupBy(housing_df,"ST"),
            Electricity=avg(housing_df$ELEP),
            Gas=avg(housing_df$GASP),
            Fuel=avg(housing_df$FULP),
            Water=avg(housing_df$WATP)
))

housing_avg$region <- factor(housing_avg$ST, levels=state_names$st, labels=tolower(state_names$name))
merged_data <- merge(states_map, housing_avg, by="region")

We will use the function grid.arrange from the package gridExtra (it might need installation).

library(gridExtra)
p1 <- ggplot(merged_data, aes(x=long, y=lat, group=group, fill=Electricity)) + geom_polygon(color="white") + theme_bw()
p2 <- ggplot(merged_data, aes(x=long, y=lat, group=group, fill=Gas)) + geom_polygon(color="white") + theme_bw()
p3 <- ggplot(merged_data, aes(x=long, y=lat, group=group, fill=Fuel)) + geom_polygon(color="white") + theme_bw()
p4 <- ggplot(merged_data, aes(x=long, y=lat, group=group, fill=Water)) + geom_polygon(color="white") + theme_bw()
    
grid.arrange(p1, p2, p3, p4, ncol=2)

enter image description here

In the previous maps we can see how different utilities have different costs on average across the United States. For example, while the cost per month in the average house for electricity and water tends to be higher in the southern states, gas (and maybe fuel, but this is quite uniform) cost tends to be the other way around: higher cost for the average northern house. This might be related with temperature and the use of water or air conditioner versus the use of gas heatings.

Exploring the year a property has been built

The year a property was first built can take the following values:

bb: N/A (GQ)
01: 1939 or earlier
02: 1940 to 1949
03: 1950 to 1959
04: 1960 to 1969
05: 1970 to 1979
06: 1980 to 1989
07: 1990 to 1999
08: 2000 to 2004
09: 2005
10: 2006
11: 2007
12: 2008
13: 2009
14: 2010
15: 2011
16: 2012
17: 2013

Without really paying attention to the actual value, we will average the codes. They are order chronologically, and in the map we will be able to see states with older properties, on average, in darker colours.

housing_avg <- collect(agg(
            groupBy(housing_df, "ST"),
            AVG=avg(housing_df$YBL)
))

housing_avg$region <- factor(housing_avg$ST, levels=state_names$st, labels=tolower(state_names$name))
merged_data <- merge(states_map, housing_avg, by="region")
ggplot(merged_data, aes(x = long, y = lat, group = group, fill = AVG)) + geom_polygon(color = "white") + theme_bw()

enter image description here

For example, states in the north east coast, such as New York, have on average older properties. States like Nevada have the most recent ones (on average). In the west coast, California has the oldest properties. All this makes sense historically.

Exploring rent and income

The last variable we are going to explore is GRPIP or the gross rent as a percentage of household income. This is a numeric variable, not a categorical one, so the interpretation is quite straightforward. Let’s map it.

housing_avg <- collect(agg(
            groupBy(housing_df, "ST"),
            AVG=avg(housing_df$GRPIP)
))

housing_avg$region <- factor(housing_avg$ST, levels=state_names$st, labels=tolower(state_names$name))
merged_data <- merge(states_map, housing_avg, by="region")
ggplot(merged_data, aes(x = long, y = lat, group = group, fill = AVG)) + geom_polygon(color = "white") + theme_bw()

enter image description here

Not surprisingly, states such as California, New York, and Florida have the higher gross rents as a percentage of the household income (getting close to 50%), while states such as Wyoming, North Dakota, and South Dakota, have the lowest percentages. As we said, not a surprise, but it is good to have those confirmations so we know that our chart has certain accuracy. Then we can learn about other states we don’t know that much about.

Conclusions

We have shown how, by using SparkR and ggplot2 mapping capabilities we can visualise large datasets in an efficient way by solving possible scalability issues. Although the 2013 American Community Survey dataset cannot be considered dramatically large, it is big enough to be problematic for a standard R environment and traditional libraries such as ggplot2 or dplyr. The guys behind SparkR have tried to replicate a good amount of the functions and abstractions we normally use in our R data processing pipelines, but using the distributed computation capabilities of Spark clusters.

Now is your turn. There are many other interesting variables to explore (e.g. taxes, number of people living in property, language spoken, etc). Go to the repo and fork the notebook so you can reproduce what we did and add your own analysis.