Friday, April 24, 2020

Mapping the World of COVID-19


A few days ago I was looking at the COVID-19 data at https://www.worldometers.info/coronavirus/

The website has a table showing data on cases, deaths, recoveries and testing for almost every country in the world.  You can sort the table by any of its columns, but I wanted to be able to see geographic patterns by looking at a map.  So, I decided to practice my R skills by writing code to webscrape the table data and produce a world choropleth map with colors coded to per capita death rates. (Click on map for higher resolution.)


Here is the R code.  I hope it is sufficiently documented to follow, and also that you can see how to modify it to make a choropleth from just about any website's table of country-level data:

# Webscrape COVID
# David Pittelli -- woodedpaths.blogspot.com
# Some code from SNHU IT 697 II
# http://selectorgadget.com/ will give CSS codes such as 'td' for table data.
# Data source: https://www.worldometers.info/coronavirus/

# Load functions************
library(rvest)
library(rworldmap)

#scrape table data************
url='https://www.worldometers.info/coronavirus/'
d=read_html(url)

#extract the table data ('td' in html) ************
scraped_td = html_nodes(d, 'td')

# convert the html data to text and remove non-country leading info************
TAB = html_text(scraped_td)
TAB = TAB[105:length(TAB)]

#build matrix of data, then convert to data frame************
M=data.frame(matrix(TAB, ncol=13,byrow=T), stringsAsFactors=FALSE)
df=M[,c("X1", "X2", "X9", "X10")]

# Clear out unneeded variables************
rm(list=setdiff(ls(), "df"))

# Rename columns ("CPM" = cases per million; "DPM" = deaths per million)************
colnames(df) = c("Country","Cases", "CPM", "DPM")

# Remove redundant rows (totals and repeats of every country) ************
lastrow = min(grep("Total:", df$Country)) - 1
df = df[1:lastrow, ]

# Fix countries with abbreviated names************
df[df$Country == "S. Korea", "Country"] = "South Korea"
df[df$Country == "UAE", "Country"] = "United Arab Emirates"
df[df$Country == "DRC", "Country"] = "Democratic Republic of the Congo"
df[df$Country == "CAR", "Country"] = "Central African Republic"
df[df$Country == "Czechia", "Country"] = "Czech Republic"
df[df$Country == "North Macedonia", "Country"] = "Macedonia"
df[df$Country == "Eswatini", "Country"] = "Swaziland"

# Add countries with missing data************
df = rbind(df, c("Turkmenistan", 0, 0, 0, 0))
df = rbind(df, c("North Korea", 0, 0, 0, 0))
df = rbind(df, c("Kazakhstan", 0, 0, 0, 0))
df = rbind(df, c("Tajikistan", 0, 0, 0, 0))
df = rbind(df, c("Antarctica", 0, 0, 0, 0))

# Remove commas from text-numbers and change from chr to numeric************
df$Cases = as.numeric(gsub(",","",df$Cases))
df$CPM = as.numeric(gsub(",","",df$CPM))
df$DPM = as.numeric(gsub(",","",df$DPM))

# Change NA values to 0
df[is.na(df$DPM), "DPM"] = 0

# Find highest DPM for country with Population > 1 million
# Only San Marino has higher DPM (as of 4/24/2020)
df$Pop = df$Cases / df$CPM # Approx. Population in millions
hmax = max(df[df$Pop > 1 & !is.na(df$Pop) & !is.na(df$DPM), "DPM"])

# Save with a file name including today's date************
today = paste(substr(date(), 5, 10), substr(date(), 21, 24))
filename = paste(today, "COVID table.csv")
write.csv(df, filename)

#join data to a map************
COVIDMap = joinCountryData2Map(df,
                               nameJoinColumn="Country",
                               joinCode="NAME",

                            verbose=TRUE)

# plot the map************
colorbreaks = c(0, 0.01, 0.1, 5, hmax/32, hmax/16, hmax/8, hmax/4, hmax/2, hmax/2 + 0.01, hmax/2 + 0.02, hmax)

mapCountryData(COVIDMap,
     nameColumnToPlot='DPM',
     catMethod = colorbreaks,
     colourPalette = "heat",
     oceanCol = "light sky blue",
     borderCol = "black",
     mapTitle = paste("COVID-19 Deaths per Million People,", today))

# Add labels, note that lat/lon is set to erroneous number to be in right place for zoomed graphic
text(x=155, y=45, label="?", col = "dark red", cex = 2)
text(x=100,y=-90, label="David Pittelli -- woodedpaths.blogspot.com", col = "black", cex = 0.8)

No comments: