Construct a choropleth plot to display Covid-19 fatalities per capita by US County for the previous seven days
Note from the editors: is a Medium publication primarily based on the study of data science and machine learning. We are not health professionals or epidemiologists, and the opinions of this article should not be interpreted as professional advice. To learn more about the coronavirus pandemic, you can click here.
For most of us on planet earth at this time, going through a global pandemic is an unprecedented experience. Sometimes I think about the period between the next-most-recent global pandemic of 1918–1920 and the present day. These 100 years have encompassed some of the most dramatic changes the world has seen to this point in almost any area you can think of, including data and technology. I wonder how those living in 1918 got their information about the pandemic and its spread? Today we have many information sources available and ready access to data displays, including predictive models, that describe and project the pandemic’s trajectory. As a data scientist, though, you might find that, even after reviewing what’s available, you have questions that remain unanswered, and would like to have the ability to monitor the quickly-evolving situation for yourself. Fortunately, through an intensive effort by the Johns Hopkins Center for Systems Science and Engineering (https://systems.jhu.edu/research/public-health/ncov/) you can access daily Covid-19 case counts and fatalities globally by country or for the United States specifically by state, county or territory. You can then use this information to design the data reporting and displays that you need to best navigate difficult decisions. As an example, in this report I’ll show you how to:
1. Access the Covid-19 data from Johns Hopkins on Github,
2. Perform a few simple data manipulations to transform the data, and
3. Construct a choropleth plot (https://rud.is/b/2016/03/29/easier-composite-u-s-choropleths-with-albersusa/) using the albersusa R package to display Covid-19 fatalities per capita by US County for the previous seven days.
You can start by loading the packages below. Note that if you don’t have the albersusa package installed you can do that using the install_github() function from the devtools package.
# Data on COVID-19 from Johns Hopkins Center for Systems Science and Engineering
# https://github.com/CSSEGISandData/COVID-19# Blog post to demonstrate chloroplot package
#https://rud.is/b/2016/03/29/easier-composite-u-s-choropleths-with-albersusa/rm(list = ls())
library(albersusa) # devtools::install_github("hrbrmstr/albersusa")
The data can be read using the following. Note that there are also data sets for US cases, and global cases and fatalities in the same or adjacent directories.
# read time series data on US deaths by county
The data is updated daily. For this example we want to use only the data from the most recent 8 days (because we need to subtract — you’ll see below). We are also saving the fips code (to identify US counties) and the population of each county so that we can compute per capita values.
# Collect data consisting of the last 8 days
tot_days <- ncol(cov19_us_dat)
begin_day <- tot_days - 7
fips <- cov19_us_dat$FIPS
pop <- cov19_us_dat$Population
us_recent <- cov19_us_dat[,begin_day:tot_days]
Note that the data from these files is in cumulative form, so if we want the daily counts we’ll need to subtract each day’s total from the day before it. A simple way to do that is below.
# Convert cumulative data to new values each day
us_daily <- us_recent[,-1] — us_recent[,-8]
us_fatal <- data.frame(fips, pop, us_daily)
colnames(us_fatal) <- c(‘fips’, ‘population’,colnames(us_daily))
Lastly, there are a few data processing steps:
1. Save fips codes only for counties located in one of the 50 US States or DC.
2. Deal with some cumulative counts that actually decrease from the previous day (possibly due to data clarifications or adjustments).
3. Convert the population to count per 100,000.
4. Compute deciles of the per capita rates. Choosing to group into 10 bins is a judgement call and there may be a better choice as the data change over time. Note that the first 5 deciles of counties experienced 0 deaths in the last 7 days, so there end up being only five groups.
5. When reading in fips codes, unless you can consistently store them as type ‘character’ they may convert to type ‘integer’ and lose their leading zero. The str_pad function can address this.
6. There was apparently a change in the fips code for Oglala Lakota, SD that needs to be addressed manually to avoid a large white square in the plot in the middle of South Dakota.
us_fatal <- us_fatal %>%
filter(fips %in% (1001:56045)) %>%
mutate(sum = rowSums(.[3:9]),
sum = ifelse(sum > 0, sum, 0),
pop_100_000 = population/100000,
percap = sum/pop_100_000,
percap_quint = quantcut(percap, q=10, dig.lab = 3),
fips = as.character(fips),
fips = str_pad(fips, 5, 'left','0'))
# replace fips for Oglala Lakota, SD (formerly Shannon, SD)
us_fatal$fips <- replace(us_fatal$fips, us_fatal$fips == "46102", "46113")
One option for setting up plot colors:
# Set up plot colors according to the number of levels of quintiles (many 0's)
no_colors <- length(unique(us_fatal$percap_quint))
quin_colors <- c("navy", "skyblue", "springgreen", "violet","violetred")
start_color <- 5 - no_colors + 1
plot_colors <- quin_colors[start_color:5]
And lastly, creating the figure using ggplot():
cmap <- fortify(counties_composite(), region=”fips”)gg <- ggplot()
gg <- gg + geom_cartogram(data=cmap, map=cmap,
aes(x=long, y=lat,map_id=id),show.legend = FALSE,
color=”#2b2b2b”, size=0.05, fill=NA)
gg <- gg + geom_cartogram(data=us_fatal, map=cmap,
gg <- gg + scale_fill_manual(name=”per 100,000 pop”, values = plot_colors)
gg <- gg + labs(title=’Covid-19 Fatalities per capita by US County, Last 7 days’,
subtitle=paste(‘Data source: Johns Hopkins Center for Systems Science and Engineering, ‘, Sys.Date(), sep=’’))
gg <- gg + theme_bw()
gg <- gg + xlab(NULL)
gg <- gg + ylab(NULL)
gg <- gg + theme(axis.ticks.x = element_blank())
gg <- gg + theme(axis.ticks.y = element_blank())
gg <- gg + theme(axis.text.x = element_blank())
gg <- gg + theme(axis.text.y = element_blank())
gg <- gg + theme(legend.position=c(0.85, 0.30))
gg <- gg + theme(plot.title=element_text(size=24, hjust = 0.5))
gg <- gg + theme(plot.subtitle=element_text(size=18, hjust = 0.5))
gg <- gg + theme(legend.text=element_text(size=14))
gg <- gg + theme(legend.title=element_text(size=18))png(paste(“choro.png”, sep = “”), width = 1200, height = 800, units = “px”)