Covid-19 Data Visualization with R

Posted by: mistan

Data Source

UPDATE: JHU dataset changed its name and format, The data for USA and global are separated. The US data adapted a new format. Some amateur data scientists are in charge of the this project. 04-02-2020

The data source used for this analysis is the 2019 Novel Coronavirus COVID-19 (2019-nCoV) Data Repository1 built the Center for Systems Science and Engineering, Johns Hopkins University (GitHub – CSSEGISandData/COVID-19: Novel Coronavirus (COVID-19) Cases, provided by JHU CSSE). The following is true for global data set.

Field description:

Province/State: China – province name; US/Canada/Australia/ – city name, state/province name; Others – name of the event (e.g., “Diamond Princess” cruise ship); other countries – blank.

Country/Region: country/region name conforming to WHO (will be updated).

Last Update: MM/DD/YYYY HH:mm (24 hour format, in UTC).

Confirmed: the number of confirmed cases. For Hubei Province: from Feb 13 (GMT +8), we report both clinically diagnosed and lab-confirmed cases. For lab-confirmed cases only (Before Feb 17), please refer to who_covid_19_situation_reports. For Italy, diagnosis standard might be changed since Feb 27 to “slow the growth of new case numbers.” (Source)

Deaths: the number of deaths.

Recovered: the number of recovered cases.

Time series summary (csse_covid_19_time_series)

This folder contains daily time series summary tables, including confirmed, deaths and recovered. All data are from the daily case report.

Field description:

Province/State: same as above.

Country/Region: same as above.

Lat and Long: a coordinates reference for the user.

Date fields: M/DD/YYYY (UTC), the same data as MM-DD-YYYY.csv file.

R Packages Needed

list.of.packages <- c("ggplot2","lubridate", "tidyverse","magrittr","dplyr","maps","ggmap","mapdata","viridis","ggforce","kableExtra", "gganimate","ggthemes")

new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]

if(length(new.packages)) install.packages(new.packages)

eval(parse(text=paste("library(",list.of.packages,")")))####load the needed packages

options(digits=3)

panderOptions('round', 3)

panderOptions('keep.trailing.zeros', TRUE)

Loading Data (UPDATED)

The data preparation code is shamelessly taken from here.

# as of 03-26-2010, the dataset changed its format, following code is revised accordingly.

# source data files

filenames <- c('time_series_covid19_confirmed_global.csv',

'time_series_covid19_deaths_global.csv',

'time_series_covid19_recovered_global.csv',

'time_series_covid19_confirmed_US.csv',

'time_series_covid19_deaths_US.csv')

url.path <- 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series'

## download files to local

download <- function(filename) {

url <- file.path(url.path, filename)

dest <- file.path('./data', filename)

download.file(url, dest)

}

bin <- lapply(filenames, download)

# load data into R

data.confirmed.us<-read.csv('./data/time_series_covid19_confirmed_US.csv')

data.deaths.us<-read.csv('./data/time_series_covid19_deaths_US.csv')

# data.confirmed <- read.csv('./data/time_series_19-covid-Confirmed.csv')

data.deaths.world<-read.csv('./data/time_series_covid19_deaths_global.csv')

data.recovered.world<-read.csv('./data/time_series_covid19_recovered_global.csv')

data.confirmed.world <- read.csv('./data/time_series_covid19_confirmed_global.csv')

data.confirmed.world_orig <- data.confirmed.world

Each dataset has `r dim(data.confirmed.us)[1]` rows, corresponding to country/region/province/state. It has `r dim(data.confirmed.us)[2]` columns. Starting from column 5, each column corresponds to a single day. Here we draw a random sample of 10 rows and have a look at their first 10 columns.

data_usa <-data.confirmed.us # %>% filter(Country.Region=='US')

# data.confirmed.orig <- data.confirmed

data.confirmed.us [,1:10] %>%

sample_n(10) # %>%

# kable("latex", booktabs=T, caption="Raw Data (Confirmed, First 10 Columns only)") %>%

# kable_styling(font_size=6, latex_options = c("striped", "hold_position", "repeat_header"))

84022059	US	USA	840	22059	LaSalle	Louisiana	US	31.67884782	-92.15907765
84016019	US	USA	840	16019	Bonneville	Idaho	US	43.38713372	-111.6161537
84032003	US	USA	840	32003	Clark	Nevada	US	36.21458855	-115.0130241
84036117	US	USA	840	36117	Wayne	New York	US	43.15494365	-77.02976528
84026025	US	USA	840	26025	Calhoun	Michigan	US	42.24633834	-85.00493569
84037173	US	USA	840	37173	Swain	North Carolina	US	35.48665845	-83.48748932
84048065	US	USA	840	48065	Carson	Texas	US	35.40365929	-101.3542669
84049003	US	USA	840	49003	Box Elder	Utah	US	41.52106798	-113.0832816
84054089	US	USA	840	54089	Summers	West Virginia	US	37.65390597	-80.86009693
84080040	US	USA	840	80040	Out of OK	Oklahoma	US	0	0

n.col <- ncol(data.confirmed.us)

## get dates from column names

dates <- names(data.confirmed.us)[13:n.col] %>% substr(2,8) %>% mdy()

range(dates)

max.date <- max(dates)

min.date <- min(dates)

2020-01-23

It shows that the data was last updated on 2020-04-02.

USA State-by-state Map

n.col <- ncol(data_usa)

# Create breaks for the color scale

# mybreaks <- c(2, 10, 40, 100)

states <- map_data("state")

dt1 <- data_usa %>% select(-c(UID,iso2,iso3,code3,FIPS,Admin2,Country_Region,Combined_Key)) %>%

# filter(!str_detect(Province.State, "\\,")) %>% # if want to include city data

# mutate(Province.State = toupper(str_sub(Province.State, -2,-1))) %>%

mutate_all(~replace(., is.na(.), 0)) %>%

filter(Province_State!="C.") %>%

gather(Date, Confirmed, -Province_State, -Lat, -Long_) %>%

# convert Date column in right format

mutate(Date=substr(Date,2,8))%>% mutate(Date=mdy(Date)) %>%

# filter(Date<"2020-03-25") %>%

filter(Confirmed>0) %>% # remove the 0 from plotting

filter(Long_>-130 & Long_<0 & Lat > 20) # focus on mainland USA

# usa <- map_data("usa")

ggplot() + geom_polygon(data=states, aes(x=long, y=lat, group = group),color="white",alpha=0.1) +

geom_point(data=dt1, aes(x=Long_, y=Lat, size =Confirmed, color=Confirmed, alpha =Confirmed)) +

scale_size(limits = c(1,1000),range = c(.1,10),guide_legend(title=""))+

scale_alpha_continuous(range=c(0.1,0.9),guide = FALSE) + # turn off legend

scale_color_viridis(limits = c(1,1000)) +

coord_fixed(1.3) +

theme_void() + coord_map() +

labs(title=paste0('美国新冠病毒分布图(更新至', max.date, ')')) +

guides(fill=FALSE) + # do this to leave off the color legend

theme(

text = element_text(color = "#22211d"),

plot.title = element_text(size= 16, hjust=0.5, color = "#4e4d47", margin = margin(b = 0.1, t = 0.4, l = 2, unit = "cm")),

) +

guides(fill=FALSE) # do this to leave off the color legend

Animation by Day

The gganimate or gapminder package is almost working for a much easy job, but the data set have city/county with duplicate dates. So, the following rendering method used.

However the city/county stopped updating according to Cities and counties statics going dark · Issue #1068 · CSSEGISandData/COVID-19 · GitHub.

starting 3/10/20, they’re only doing the US by states. Province/State is smallest subcategory they go, and only for some countries (dependencies of some nations and provinces/states of US, China, Canada, Australia, I believe).

library(gapminder)

library(animation)

library(patchwork)

# library(patchwork)

n.col <- ncol(data_usa)

# Create breaks for the color scale

mybreaks <- c(2, 10, 40, 100)

# Get the world polygon and extract US

states <- map_data("state")

dt1 <- data_usa %>% select(-c(UID,iso2,iso3,code3,FIPS,Admin2,Country_Region,Combined_Key)) %>%

# filter(!str_detect(Province.State, "\\,")) %>%

# mutate(Province.State = toupper(str_sub(Province_State, -2,-1))) %>% # using city data

mutate_all(~replace(., is.na(.), 0)) %>%

filter(Province_State!="C.") %>%

gather(Date, Confirmed, -Province_State, -Lat, -Long_) %>%

# convert Date column in right format

mutate(Date=substr(Date,2,8))%>% mutate(Date=mdy(Date)) %>%

# filter(Date>"2020-02-01") %>%

filter(Confirmed>0) %>% # remove the 0

filter(Long_>-130 & Long_<0 & Lat > 20) #remove outlier in long and lat

usa <- map_data("usa")

ani.options(

convert = shQuote('C:/Program Files (x86)/ImageMagick-6.8.1-Q16/convert.exe')

)

saveGIF(

{

start_date <- min.date

while (start_date <= max.date)

{

dt <- dt1 %>% filter(Date <= start_date)

p1 <- ggplot() + geom_polygon(data=states, aes(x=long, y=lat, group = group),color="white",alpha=0.2) +coord_fixed(1.5)+

geom_point(data=dt, aes(x=Long_, y=Lat, size =Confirmed, color=Confirmed, alpha =Confirmed)) +

scale_size(limits = c(1,500),range = c(.5,10),guide_legend(title=""))+

scale_alpha_continuous(range=c(0.2,0.4),guide = FALSE) + # turn off legend

scale_color_viridis(limits = c(1,500)) +

coord_fixed(1.3) +

theme_void() + coord_map() +

labs(title=paste0('美国新冠病毒动态分布图- Date ', start_date)) +

guides(fill=FALSE) + # do this to leave off the color legend

theme(

text = element_text(color = "#22211d"),

plot.title = element_text(size= 16, hjust=0.5, color = "#4e4d47", margin = margin(b = 0.1, t = 0.4, l = 2, unit = "cm")),

) +

guides(fill=FALSE) # do this to leave off the color legend

dt %>% group_by(Date) %>%

summarise(Confirmed = sum(Confirmed)) %>%

ungroup()

p2 <- ggplot(dt, aes(x = Date, y = Confirmed)) +

geom_col() +

scale_fill_manual(values = c("#b2d1e0","gold")) +

scale_x_date(limits = c(as.Date(min.date), as.Date(max.date+1)), expand = c(0,0)) +

scale_y_continuous(expand = c(0,0)) +

labs(x = "Date", y = "Confirmed") +

theme(legend.position = "none",

panel.grid.major = element_blank(),

panel.grid.minor = element_blank(),

panel.background = element_rect(fill = "grey99", colour = "grey80"),

plot.title = element_text(hjust = 0.5))

start_date <- start_date + 1

# plot(p3)

# Print plots using patchwork

grid.arrange(p1, p2, ncol=1)

# print(p1 + p2 + plot_layout(ncol = 1, heights = c(5, 1)))

}

movie.name = "anim.gif",

interval = 1,

ani.width = 900,

ani.height = 900,

outdir = getwd()

)

Worldwide

# data preparation

data_world <- data.confirmed.world_orig %>%

rename(Province = 'Province.State', Country = 'Country.Region') %>%

gather(Date, Confirmed, -Province, -Country, -Lat, -Long) %>%

mutate(Country = case_when(Country %in% c("Mainland China", "Macau", "Hong Kong SAR", "Taipei and environs") ~ "China",

Country == "US" ~ "United States",

Country == "UK" ~ "United Kingdom",

TRUE ~ as.character(Country))) %>%

# Reorder data to show biggest cities on top

arrange(Country) %>%

mutate(Country=factor(Country,unique(Country))) %>%

filter(Long>-150)

world_map <- map_data("world")

ggplot() + geom_polygon(data=world_map, aes(x=long, y=lat, group = group),color="white",alpha=0.1) +coord_fixed(1.3) +

geom_point(data=data_world, aes(x=Long, y=Lat, size =Confirmed, color=Confirmed,alpha=Confirmed )) +

scale_size(limits = c(1,800000),trans="log10",range = c(.5,4),guide_legend(title=""))+

scale_alpha_continuous(range=c(0.1,0.4),guide = FALSE) + # turn off legend

scale_color_viridis(limits = c(1,5000000)) +

# scale_size_continuous(name="Counts", trans="log2", range=c(.5,5), breaks=mybreaks) +

# scale_alpha_continuous(name="Counts", range=c(0.2, .5), breaks=mybreaks) +

# scale_color_viridis(option="magma", name="Counts" ) +

guides(size=guide_legend("Counts")) +

scale_color_viridis() +

coord_fixed(1.3) +

theme_void() +

labs(title=paste0('全球新冠病毒分布图(更新至', max.date, ')')) +

guides(fill=FALSE) + # do this to leave off the color legend

theme(

text = element_text(color = "#22211d"),

plot.title = element_text(size= 16, hjust=0.5, color = "#4e4d47", margin = margin(b = 0.1, t = 0.4, l = 2, unit = "cm")),

)

21
MAR

Blog