Thursday, December 17, 2020

Some States Are Better Than Others

Someone on Twitter: "Tell me the best state without telling me Texas."

Several people answered that they don't know, but they did know that the worst state is Florida.

There are things to like and dislike about every state.  But perhaps the most objective and broadest measure of how good or appealing a place is, is based on the people who have voted with their feet, leaving places with little appeal or opportunity, to migrate to places which are more appealing and provide much opportunity.  By that measure, Florida is clearly the best state over the last decade – the only state with very strong positive international and domestic migration for 2010-2019, which is why it is top right (blue circle) in the scatter graph below. (DC is not far behind, and Texas is doing fairly well by both measures.)

The distance from the sloping red line is each state's total (international and domestic) migration.  The upper right states are the most popular, the lower left the least popular. (Click for higher resolution.)


Domestic migration is, necessarily, evenly balanced.  When you include state population (the size of the dots), the graph is evenly balanced on the vertical blue zero line.

The states with the worst domestic outmigration (negative migration) are: Alaska, New York, Illinois, Connecticut, New Jersey.  In general, people have been leaving the Midwest and Northeast for the West and South.  The one Midwestern (here, red) state attracting significant numbers is North Dakota, mostly due to an oil boom.  Only two Northeastern states have seen even slightly positive migration: Maine and New Hampshire.

While the Northwest and Mountain West saw gains, Alaska and Hawaii saw substantial outmigration.  California's outmigration was smaller as a percentage of population, but the largest in the country in absolute terms.


For international migration (height above the blue horizontal line), the US regions are more evenly distributed. West Virginia has the weakest international migration; there are not a lot of West Virginia jobs attracting people from overseas.  Massachusetts has attracted a lot of foreign workers, even as domestic outmigration has been considerable.  Boston and its suburbs have been performing very well economically, but this has also made real estate there among the most expensive in the country, pricing a lot of people out.


For 2019-2020, components of population change won’t be available until February, 2021.  In the face of the coronavirus pandemic, the raw percentage changes in population (map and link below) show that most of these trends continued.

Monday, July 06, 2020

Graphing COVID-19

There's been a lot of debate online about how various countries have fared in the face of the COVID-19 pandemic.  Often the countries chosen are cherry-picked, or the measure of deaths (i.e., totals, not rates) says more about the size of the countries' populations than their success or failure in any meaningful sense.

There are just 6 major countries with higher reported death rates than the US.  But the fact that all of them are well-functioning economies and democracies in Western Europe would tend to refute the notion that economic or political backwardness is the key to their high death rates.

In my opinion, the key disadvantages that the US and Western Europe have, relative to less wealthy countries and isolated islands like New Zealand, is that the former are so large and interconnected (in terms of travel).  A Western, especially US, reluctance to wear masks is also a major disadvantage relative to East Asian countries.

To illustrate such comparisons, I've written an R script to download the latest daily data from the European Centre for Disease Prevention and Control (ECDC), and turn daily death rates for selected countries into a line graph, such as here (click on graph for higher resolution; the graph is followed by my R script).

Note that Spain has revised figures downward for a reputed overcount of prior deaths; rather than go back and alter prior days' numbers, they did this by logging 1,918 "negative deaths" on May 25.  Later they revised numbers upward by 1,179 on June 19.

# **********************
# R Code to download and graph ECDC historical daily data
# David Pittelli --


# Load functions************

# Manual color adjust****************
ggplotColours <- 360="" function="" h="c(0," n="6," o:p="">
  if ((diff(h)%%360) < 1) h[2] <- -="" 360="" h="" n="" o:p="">
    hcl(h = (seq(h[1], h[2], length = n)), c = 100, l = 65)


# Download data from web************
df = read.csv("", na.strings = "", fileEncoding = "UTF-8-BOM", stringsAsFactors=FALSE)

# convert date info in format 'mm/dd/yyyy'
df$date2 = as.Date(df$dateRep, "%d/%m/%Y")

# Subset rows and add columns ***********************
# eliminate NAs in countryterritoryCode
df = df[!$countryterritoryCode),]

# Add deaths per 100 Million column
df$DPC = (df$deaths * 100000000) / df$popData2019

# Add cases per 100 Million column
df$CPC = (df$cases * 100000000) / df$popData2019

# Add 7-day moving averages
# vectors are reversed since MA can only look upwards (or in both directions)
df$DPCMA = rev(filter(rev(df$DPC), rep(1/7,7), sides = 1))
df$CPCMA = rev(filter(rev(df$CPC), rep(1/7,7), sides = 1))
df$deathsMA = rev(filter(rev(df$deaths), rep(1/7,7), sides = 1))

# Add day of week  and 1-7 *****************************
df$dayOfWeek = weekdays(as.Date(df$date2))
df$dayNum = setNames(0:6, c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))[weekdays(as.Date(df$date2))]

# Save with a file name including today's date***********
today = paste(substr(date(), 5, 10), substr(date(), 21, 24))
filename = paste(today, "ECDC COVID table.csv")
write.csv(df, filename)

# Just look since March 1
df = df[df$date2 > as.Date("2020-02-29"), ]

# Graph ***************************************
# Select Countries
df2 = df[df$geoId == "ES" | df$geoId == "IT" | df$geoId == "UK" | df$geoId == "SE" | df$geoId == "US" | df$geoId == "DE" |  df$geoId == "KR"|  df$geoId == "NZ",]

# Annotated plot, 7-day Moving Average, linear scale
p = ggplot(data = df2, aes(x = date2, y = DPCMA, group = geoId, color = geoId)) +
     geom_line(size = 1.2) +
      theme_bw() +
      ggtitle(paste("Daily Death Rate from COVID-19, 7-Day smoothing,", today)) +
      theme(plot.title = element_text(size = 24)) +
      theme(axis.title.x = element_text(size=14)) +
      theme(axis.title.y = element_text(size=14)) +
      theme(axis.text.x = element_text(size=14)) +
      theme(axis.text.y = element_text(size=14)) +
      theme(legend.text = element_text(size = 14)) +
      xlab("Date") + ylab("Daily Deaths per 100 Million People") +

# Manual colors; black for US
p +  scale_colour_manual(breaks=c("ES", "IT", "UK", "SE", "US", "DE", "KR", "NZ"), labels=c("Spain", "Italy", "UK", "Sweden", "US", "Germany", "S. Korea", "New Zealand"),
        values =c(ggplotColours(7), "black")) +
      annotate("text", x = as.Date(18425, origin="1970-01-01"), y = 1750, label="David Pittelli --\nData from", col = "black")

Friday, April 24, 2020

Mapping the World of COVID-19

A few days ago I was looking at the COVID-19 data at

The website has a table showing data on cases, deaths, recoveries and testing for almost every country in the world.  You can sort the table by any of its columns, but I wanted to be able to see geographic patterns by looking at a map.  So, I decided to practice my R skills by writing code to webscrape the table data and produce a world choropleth map with colors coded to per capita death rates. (Click on map for higher resolution.)

Here is the R code.  I hope it is sufficiently documented to follow, and also that you can see how to modify it to make a choropleth from just about any website's table of country-level data:

# Webscrape COVID
# David Pittelli --
# Some code from SNHU IT 697 II
# will give CSS codes such as 'td' for table data.
# Data source:


# Load functions************

setwd("C:/Users/David/Documents/2020 Choropleth")

#scrape table data************

#extract the table data ('td' in html) ************
scraped_td = html_nodes(d, 'td')

# convert the html data to text ************
Tab = html_text(scraped_td)

# extract table head and convert to text for column names ************
# scraped_th = html_nodes(d, 'th')
# TabHead = html_text(scraped_th)

# Remove non-country leading info.  Start with the "1" followed by the Country with most Deaths***
first = min(grep("^1$", Tab)) # The first "1" as whole string in Tab is probably where we want to start
# Use anchors. ^ is the start. $ is the end.
# The start position was 115, then went to 153 on or slightly before 2020 06 01 run.

last = min(grep("Total", Tab)) - 2

Tab2 = Tab[first:last]

#build matrix of data, then convert to data frame************
df=data.frame(matrix(Tab2, ncol=19,byrow=T), stringsAsFactors=FALSE)
df=df[, c(2:16)]

colnames(df) = c("Country", "TotalCases", "NewCases", "TotalDeaths", "NewDeaths", "TotalRecovered", "NewRecovered", "ActiveCases", "CriticalCases", "CPM", "DPM", "TotalTests", "TPM", "Population", "Region")

# Clear out unneeded variables************
# rm(list=setdiff(ls(), "df"))

# Fix countries with abbreviated names************
df[df$Country == "S. Korea", "Country"] = "South Korea"
df[df$Country == "UAE", "Country"] = "United Arab Emirates"
df[df$Country == "DRC", "Country"] = "Democratic Republic of the Congo"
df[df$Country == "CAR", "Country"] = "Central African Republic"
df[df$Country == "Czechia", "Country"] = "Czech Republic"
df[df$Country == "North Macedonia", "Country"] = "Macedonia"
df[df$Country == "Eswatini", "Country"] = "Swaziland"

# Add countries with missing data************
# df = rbind(df, c("Turkmenistan", rep(0, 9), NA, rep(0, 4))) # claims no cases, obviously untrue
# df = rbind(df, c("North Korea", rep(0, 15))) # claims no cases, obviously untrue
df = rbind(df, c("Antarctica", rep(0, 15)))
df = rbind(df, c("Kosovo", rep(0, 9), 66/1.845, 0, 0, 1845000, 'Europe')) # deaths / millions 7/4/20
# source:
# df = rbind(df, c("Kazakhstan", rep(0, 15))) # they got data by July
# df = rbind(df, c("Tajikistan", rep(0, 15))) # they got data by July

# Remove commas from text-numbers and change from chr to numeric************
for (i in 2:14) {
df[ , i] = as.numeric(gsub(",","",df[ , i]))

# Change NA values to 0
df[$DPM), "DPM"] = 0
df[$CPM), "CPM"] = 0

# Find highest DPM for country with Population > 1 million
# Only San Marino has higher DPM (as of 4/24/2020)
hmax = max(df[df$Population > 1000000 & !$Population) & !$DPM), "DPM"])

# Save with a file name including today's date************
today = paste(substr(date(), 5, 10), substr(date(), 21, 24))
filename = paste(today, "COVID table.csv")
write.csv(df, filename)

#join data to a map************
COVIDMap = joinCountryData2Map(df,

# plot the map************
colorbreaks = c(0, 0.01, 0.1, 5, hmax/32, hmax/16, hmax/8, hmax/4, hmax/2, hmax/2 + 0.01, hmax/2 + 0.02, hmax)

     catMethod = colorbreaks,
     colourPalette = "heat",
     oceanCol = "light sky blue",
     missingCountryCol = "dimgrey",
     borderCol = "black",
     mapTitle = paste("COVID-19 Deaths per Million People,", today))

# Add labels, note that lat/lon is set to erroneous number to be in right place for zoomed graphic
text(x=155, y=45, label="?", col = "dark red", cex = 2)
text(x=100,y=-90, label="David Pittelli --", col = "black", cex = 0.8)