Wednesday, May 20, 2020

Graphing COVID-19


There's been a lot of debate online about how various countries have fared in the face of the COVID-19 pandemic.  Often the countries chosen are cherry-picked, or the measure of deaths (i.e., totals, not rates) says more about the size of the countries' populations than their success or failure in any meaningful sense.

There are just 8 major countries with higher reported death rates than the US.  But the fact that all of them are well-functioning economies and democracies in Western Europe would tend to refute the notion that economic or political backwardness is the key to their high death rates.

In my opinion, the key disadvantages that the US and Western Europe have, relative to the Third World and islands like New Zealand, is that the former are so large and interconnected (in terms of travel).  A Western reluctance to wear masks is also a major disadvantage relative to East Asian countries.

To illustrate such comparisons, I've written an R script to download the latest daily data from the European Centre for Disease Prevention and Control (ECDC), and turn daily death rates for selected countries into a line graph, such as here (click on graph for higher resolution; the graph is followed by my R script):




# R Code to download and graph ECDC historical daily data
# David Pittelli -- woodedpaths.blogspot.com
# https://www.ecdc.europa.eu/en/publications-data/download-todays-data-geographic-distribution-covid-19-cases-worldwide

# Load packages************
library(ggplot2)
library(RColorBrewer)

# Download data from web************
df = read.csv("https://opendata.ecdc.europa.eu/covid19/casedistribution/csv", na.strings = "", fileEncoding = "UTF-8-BOM", stringsAsFactors=FALSE)

# Subset rows and add columns ***********************

# eliminate NAs in countryterritoryCode
df = df[!is.na(df$countryterritoryCode),]

# Add deaths per 100 Million column
df$DPC = (df$deaths * 100000000) / df$popData2018

# Add 7-day moving average for deaths per 100 Million
# vectors are reversed since MA can only look upwards (or in both directions)
df$DPCMA = rev(filter(rev(df$DPC), rep(1/7,7), sides = 1))

# Add date column in format 'mm/dd/yyyy'
df$date2 = as.Date(df$dateRep, "%d/%m/%Y")

# Select Countries and dates to graph***********************
df2 = df[df$geoId == "ES" | df$geoId == "IT" | df$geoId == "UK" | df$geoId == "SE" | df$geoId == "US" | df$geoId == "DE" |  df$geoId == "KR"|  df$geoId == "NZ",]

# Just look since March 1
df2 = df2[df2$date2 > as.Date("2020-02-29"), ]


# Today's Date to Label Graph****************
today = paste(substr(date(), 5, 10), substr(date(), 21, 24), sep = ", ")

# Manual color selection****************
ggplotColours <- 360="" function="" h="c(0," n="6," o:p="">
  if ((diff(h)%%360) < 1) h[2] <- -="" 360="" h="" n="" o:p="">
    hcl(h = (seq(h[1], h[2], length = n)), c = 100, l = 65)
}

# Build plot****************
p = ggplot(data = df2, aes(x = date2, y = DPCMA, group = geoId, color = geoId)) +
     geom_line(size = 2) +
      theme_bw() +
      ggtitle(paste("Daily Death Rate from COVID-19, 7-Day smoothing,", today)) +
      theme(plot.title = element_text(size = 24)) +
      theme(axis.title.x = element_text(size = 14)) +
      theme(axis.title.y = element_text(size = 14)) +
      theme(legend.text = element_text(size = 14)) +
      xlab("Date") + ylab("Daily Deaths per 100 Million People") +
      theme(legend.title=element_blank()) +
    annotate("text", x = as.Date(18340, origin="1970-01-01"), y = 1800, label="David Pittelli -- woodedpaths.blogspot.com\nData from ecdc.europa.eu", col = "black")

# Assign colors and legend labels (or just type "p" for standard colors & 2-letter abbreviations)
p +  scale_colour_manual(breaks=c("ES", "IT", "UK", "SE", "US", "DE", "KR", "NZ"),
        labels=c("Spain", "Italy", "UK", "Sweden", "US", "Germany", "S. Korea", "New Zealand"),
        values =c(ggplotColours(7), "black"))

Friday, April 24, 2020

Mapping the World of COVID-19


A few days ago I was looking at the COVID-19 data at https://www.worldometers.info/coronavirus/

The website has a table showing data on cases, deaths, recoveries and testing for almost every country in the world.  You can sort the table by any of its columns, but I wanted to be able to see geographic patterns by looking at a map.  So, I decided to practice my R skills by writing code to webscrape the table data and produce a world choropleth map with colors coded to per capita death rates. (Click on map for higher resolution.)


Here is the R code.  I hope it is sufficiently documented to follow, and also that you can see how to modify it to make a choropleth from just about any website's table of country-level data:

# Webscrape COVID
# David Pittelli -- woodedpaths.blogspot.com
# Some code from SNHU IT 697 II
# http://selectorgadget.com/ will give CSS codes such as 'td' for table data.
# Data source: https://www.worldometers.info/coronavirus/

# Load functions************
library(rvest)
library(rworldmap)

#scrape table data************
url='https://www.worldometers.info/coronavirus/'
d=read_html(url)

#extract the table data ('td' in html) ************
scraped_td = html_nodes(d, 'td')

# convert the html data to text and remove non-country leading info************
TAB = html_text(scraped_td)
TAB = TAB[105:length(TAB)]

#build matrix of data, then convert to data frame************
M=data.frame(matrix(TAB, ncol=13,byrow=T), stringsAsFactors=FALSE)
df=M[,c("X1", "X2", "X9", "X10")]

# Clear out unneeded variables************
rm(list=setdiff(ls(), "df"))

# Rename columns ("CPM" = cases per million; "DPM" = deaths per million)************
colnames(df) = c("Country","Cases", "CPM", "DPM")

# Remove redundant rows (totals and repeats of every country) ************
lastrow = min(grep("Total:", df$Country)) - 1
df = df[1:lastrow, ]

# Fix countries with abbreviated names************
df[df$Country == "S. Korea", "Country"] = "South Korea"
df[df$Country == "UAE", "Country"] = "United Arab Emirates"
df[df$Country == "DRC", "Country"] = "Democratic Republic of the Congo"
df[df$Country == "CAR", "Country"] = "Central African Republic"
df[df$Country == "Czechia", "Country"] = "Czech Republic"
df[df$Country == "North Macedonia", "Country"] = "Macedonia"
df[df$Country == "Eswatini", "Country"] = "Swaziland"

# Add countries with missing data************
df = rbind(df, c("Turkmenistan", 0, 0, 0, 0))
df = rbind(df, c("North Korea", 0, 0, 0, 0))
df = rbind(df, c("Kazakhstan", 0, 0, 0, 0))
df = rbind(df, c("Tajikistan", 0, 0, 0, 0))
df = rbind(df, c("Antarctica", 0, 0, 0, 0))

# Remove commas from text-numbers and change from chr to numeric************
df$Cases = as.numeric(gsub(",","",df$Cases))
df$CPM = as.numeric(gsub(",","",df$CPM))
df$DPM = as.numeric(gsub(",","",df$DPM))

# Change NA values to 0
df[is.na(df$DPM), "DPM"] = 0

# Find highest DPM for country with Population > 1 million
# Only San Marino has higher DPM (as of 4/24/2020)
df$Pop = df$Cases / df$CPM # Approx. Population in millions
hmax = max(df[df$Pop > 1 & !is.na(df$Pop) & !is.na(df$DPM), "DPM"])

# Save with a file name including today's date************
today = paste(substr(date(), 5, 10), substr(date(), 21, 24))
filename = paste(today, "COVID table.csv")
write.csv(df, filename)

#join data to a map************
COVIDMap = joinCountryData2Map(df,
                               nameJoinColumn="Country",
                               joinCode="NAME",

                            verbose=TRUE)

# plot the map************
colorbreaks = c(0, 0.01, 0.1, 5, hmax/32, hmax/16, hmax/8, hmax/4, hmax/2, hmax/2 + 0.01, hmax/2 + 0.02, hmax)

mapCountryData(COVIDMap,
     nameColumnToPlot='DPM',
     catMethod = colorbreaks,
     colourPalette = "heat",
     oceanCol = "light sky blue",
     borderCol = "black",
     mapTitle = paste("COVID-19 Deaths per Million People,", today))

# Add labels, note that lat/lon is set to erroneous number to be in right place for zoomed graphic
text(x=155, y=45, label="?", col = "dark red", cex = 2)
text(x=100,y=-90, label="David Pittelli -- woodedpaths.blogspot.com", col = "black", cex = 0.8)