Monday, July 06, 2020

Graphing COVID-19

There's been a lot of debate online about how various countries have fared in the face of the COVID-19 pandemic.  Often the countries chosen are cherry-picked, or the measure of deaths (i.e., totals, not rates) says more about the size of the countries' populations than their success or failure in any meaningful sense.

There are just 6 major countries with higher reported death rates than the US.  But the fact that all of them are well-functioning economies and democracies in Western Europe would tend to refute the notion that economic or political backwardness is the key to their high death rates.

In my opinion, the key disadvantages that the US and Western Europe have, relative to less wealthy countries and isolated islands like New Zealand, is that the former are so large and interconnected (in terms of travel).  A Western, especially US, reluctance to wear masks is also a major disadvantage relative to East Asian countries.

To illustrate such comparisons, I've written an R script to download the latest daily data from the European Centre for Disease Prevention and Control (ECDC), and turn daily death rates for selected countries into a line graph, such as here (click on graph for higher resolution; the graph is followed by my R script).

Note that Spain has revised figures downward for a reputed overcount of prior deaths; rather than go back and alter prior days' numbers, they did this by logging 1,918 "negative deaths" on May 25.  Later they revised numbers upward by 1,179 on June 19.


# **********************
# R Code to download and graph ECDC historical daily data
# David Pittelli -- woodedpaths.blogspot.com
# https://www.ecdc.europa.eu/en/publications-data/download-todays-data-geographic-distribution-covid-19-cases-worldwide

setwd("C:/Users/YourDirectory")
rm(list=ls())

# Load functions************
library(ggplot2)
library(scales)
library(RColorBrewer)

# Manual color adjust****************
ggplotColours <- 360="" function="" h="c(0," n="6," o:p="">
  if ((diff(h)%%360) < 1) h[2] <- -="" 360="" h="" n="" o:p="">
    hcl(h = (seq(h[1], h[2], length = n)), c = 100, l = 65)

}

# Download data from web************
df = read.csv("https://opendata.ecdc.europa.eu/covid19/casedistribution/csv", na.strings = "", fileEncoding = "UTF-8-BOM", stringsAsFactors=FALSE)

# convert date info in format 'mm/dd/yyyy'
df$date2 = as.Date(df$dateRep, "%d/%m/%Y")

# Subset rows and add columns ***********************
# eliminate NAs in countryterritoryCode
df = df[!is.na(df$countryterritoryCode),]

# Add deaths per 100 Million column
df$DPC = (df$deaths * 100000000) / df$popData2019

# Add cases per 100 Million column
df$CPC = (df$cases * 100000000) / df$popData2019

# Add 7-day moving averages
# vectors are reversed since MA can only look upwards (or in both directions)
df$DPCMA = rev(filter(rev(df$DPC), rep(1/7,7), sides = 1))
df$CPCMA = rev(filter(rev(df$CPC), rep(1/7,7), sides = 1))
df$deathsMA = rev(filter(rev(df$deaths), rep(1/7,7), sides = 1))

# Add day of week  and 1-7 *****************************
df$dayOfWeek = weekdays(as.Date(df$date2))
df$dayNum = setNames(0:6, c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))[weekdays(as.Date(df$date2))]

# Save with a file name including today's date***********
today = paste(substr(date(), 5, 10), substr(date(), 21, 24))
filename = paste(today, "ECDC COVID table.csv")
write.csv(df, filename)

# Just look since March 1
df = df[df$date2 > as.Date("2020-02-29"), ]

# Graph ***************************************
# Select Countries
df2 = df[df$geoId == "ES" | df$geoId == "IT" | df$geoId == "UK" | df$geoId == "SE" | df$geoId == "US" | df$geoId == "DE" |  df$geoId == "KR"|  df$geoId == "NZ",]

# Annotated plot, 7-day Moving Average, linear scale
p = ggplot(data = df2, aes(x = date2, y = DPCMA, group = geoId, color = geoId)) +
     geom_line(size = 1.2) +
      theme_bw() +
      ggtitle(paste("Daily Death Rate from COVID-19, 7-Day smoothing,", today)) +
      theme(plot.title = element_text(size = 24)) +
      theme(axis.title.x = element_text(size=14)) +
      theme(axis.title.y = element_text(size=14)) +
      theme(axis.text.x = element_text(size=14)) +
      theme(axis.text.y = element_text(size=14)) +
      theme(legend.text = element_text(size = 14)) +
      xlab("Date") + ylab("Daily Deaths per 100 Million People") +
      theme(legend.title=element_blank())

# Manual colors; black for US
p +  scale_colour_manual(breaks=c("ES", "IT", "UK", "SE", "US", "DE", "KR", "NZ"), labels=c("Spain", "Italy", "UK", "Sweden", "US", "Germany", "S. Korea", "New Zealand"),
        values =c(ggplotColours(7), "black")) +
      annotate("text", x = as.Date(18425, origin="1970-01-01"), y = 1750, label="David Pittelli -- woodedpaths.blogspot.com\nData from ecdc.europa.eu", col = "black")

Friday, April 24, 2020

Mapping the World of COVID-19


A few days ago I was looking at the COVID-19 data at https://www.worldometers.info/coronavirus/

The website has a table showing data on cases, deaths, recoveries and testing for almost every country in the world.  You can sort the table by any of its columns, but I wanted to be able to see geographic patterns by looking at a map.  So, I decided to practice my R skills by writing code to webscrape the table data and produce a world choropleth map with colors coded to per capita death rates. (Click on map for higher resolution.)


Here is the R code.  I hope it is sufficiently documented to follow, and also that you can see how to modify it to make a choropleth from just about any website's table of country-level data:


# Webscrape COVID
# David Pittelli -- woodedpaths.blogspot.com
# Some code from SNHU IT 697 II
# http://selectorgadget.com/ will give CSS codes such as 'td' for table data.
# Data source: https://www.worldometers.info/coronavirus/

#install.packages('rvest')
#install.packages'rworldmap')

# Load functions************
library(rvest)
library(rworldmap)

setwd("C:/Users/David/Documents/2020 Choropleth")
rm(list=ls())

#scrape table data************
url='https://www.worldometers.info/coronavirus/'
d=read_html(url)

#extract the table data ('td' in html) ************
scraped_td = html_nodes(d, 'td')

# convert the html data to text ************
Tab = html_text(scraped_td)

# extract table head and convert to text for column names ************
# scraped_th = html_nodes(d, 'th')
# TabHead = html_text(scraped_th)

# Remove non-country leading info.  Start with the "1" followed by the Country with most Deaths***
first = min(grep("^1$", Tab)) # The first "1" as whole string in Tab is probably where we want to start
# Use anchors. ^ is the start. $ is the end.
# The start position was 115, then went to 153 on or slightly before 2020 06 01 run.

last = min(grep("Total", Tab)) - 2

Tab2 = Tab[first:last]

#build matrix of data, then convert to data frame************
df=data.frame(matrix(Tab2, ncol=19,byrow=T), stringsAsFactors=FALSE)
df=df[, c(2:16)]

colnames(df) = c("Country", "TotalCases", "NewCases", "TotalDeaths", "NewDeaths", "TotalRecovered", "NewRecovered", "ActiveCases", "CriticalCases", "CPM", "DPM", "TotalTests", "TPM", "Population", "Region")

# Clear out unneeded variables************
# rm(list=setdiff(ls(), "df"))

# Fix countries with abbreviated names************
df[df$Country == "S. Korea", "Country"] = "South Korea"
df[df$Country == "UAE", "Country"] = "United Arab Emirates"
df[df$Country == "DRC", "Country"] = "Democratic Republic of the Congo"
df[df$Country == "CAR", "Country"] = "Central African Republic"
df[df$Country == "Czechia", "Country"] = "Czech Republic"
df[df$Country == "North Macedonia", "Country"] = "Macedonia"
df[df$Country == "Eswatini", "Country"] = "Swaziland"

# Add countries with missing data************
# df = rbind(df, c("Turkmenistan", rep(0, 9), NA, rep(0, 4))) # claims no cases, obviously untrue
# df = rbind(df, c("North Korea", rep(0, 15))) # claims no cases, obviously untrue
df = rbind(df, c("Antarctica", rep(0, 15)))
df = rbind(df, c("Kosovo", rep(0, 9), 66/1.845, 0, 0, 1845000, 'Europe')) # deaths / millions 7/4/20
# source: https://en.wikipedia.org/wiki/COVID-19_pandemic_in_Kosovo#Cases
# df = rbind(df, c("Kazakhstan", rep(0, 15))) # they got data by July
# df = rbind(df, c("Tajikistan", rep(0, 15))) # they got data by July

# Remove commas from text-numbers and change from chr to numeric************
for (i in 2:14) {
df[ , i] = as.numeric(gsub(",","",df[ , i]))
}

# Change NA values to 0
df[is.na(df$DPM), "DPM"] = 0
df[is.na(df$CPM), "CPM"] = 0

# Find highest DPM for country with Population > 1 million
# Only San Marino has higher DPM (as of 4/24/2020)
hmax = max(df[df$Population > 1000000 & !is.na(df$Population) & !is.na(df$DPM), "DPM"])

# Save with a file name including today's date************
today = paste(substr(date(), 5, 10), substr(date(), 21, 24))
filename = paste(today, "COVID table.csv")
write.csv(df, filename)

#join data to a map************
COVIDMap = joinCountryData2Map(df,
                               nameJoinColumn="Country",
                               joinCode="NAME",
                            verbose=TRUE)

# plot the map************
colorbreaks = c(0, 0.01, 0.1, 5, hmax/32, hmax/16, hmax/8, hmax/4, hmax/2, hmax/2 + 0.01, hmax/2 + 0.02, hmax)

mapCountryData(COVIDMap,
     nameColumnToPlot='DPM',
     catMethod = colorbreaks,
     colourPalette = "heat",
     oceanCol = "light sky blue",
     missingCountryCol = "dimgrey",
     borderCol = "black",
     mapTitle = paste("COVID-19 Deaths per Million People,", today))

# Add labels, note that lat/lon is set to erroneous number to be in right place for zoomed graphic
text(x=155, y=45, label="?", col = "dark red", cex = 2)
text(x=100,y=-90, label="David Pittelli -- woodedpaths.blogspot.com", col = "black", cex = 0.8)