7. Exploratory Data Analysis - PI4 Computational Bootcamp

Note: Some of the commands below, particularly those accessing the terraref database from within RStudio, require being logged into the NDS Analytics Workbench. This requires signing up for the TERRA REF Alpha User program.

Objectives

Learning how to access data on terraref database via PostgreSQL
Learning how to manipulate dat using the dplyr library, in particular:
filtering, subsetting, summarizing, new variables with dplyr
converting data from wide to long with tidyr

Accessing TERRA REF via PostgreSQL

We want to access the TERRA REF database. We will do it using the dplyr, library which makes it easier to connect to databases and work with data inside the database. This can be particularly useful with large datasets, where processing can be done before reading the data into memory.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

You can then connect to the database using (note: this is one of those commands that require you to be working in the NDS Analytics Workbench):

bety_src <- src_postgres(dbname = "bety", 
                         password = 'DelchevskoOro', 
                         host = 'bety6.ncsa.illinois.edu', 
                         user = 'viewer', 
                         port = 5432)

filtering / subsetting

Let’s get some data:

traits <- tbl(bety_src, 'traits', n=Inf)

variables <- tbl(bety_src, 'variables') %>%
  mutate(variable_id = id, trait = name)

sites <- tbl(bety_src, 'sites') %>%
  mutate(site_id = id) %>% 
  dplyr::select(-geometry)

## Warning in postgresqlExecStatement(conn, statement, ...): RS-DBI driver
## warning: (unrecognized PostgreSQL field type geometry (id:18962) in column
## 17)

cultivars <- tbl(bety_src, 'cultivars') %>%
  mutate(cultivar_id = id, cultivar = name)

In the above code the symbol %>% is a pipe from the dplyr library, it passes the left hand side to the first argument slot of the right hand side.

Using filter we can pick what we see based on their values:

traits_variables <- traits %>%
  left_join(variables, by = 'variable_id') %>%
  left_join(sites, by = 'site_id') %>%
  left_join(cultivars, by = 'cultivar_id') %>%
  dplyr::select(date, sitename, cultivar, trait, mean) %>% 
  filter(trait == 'canopy_height') %>%
  collect()

traits_variables_season_2 <- traits_variables %>%
      filter(grepl("Season 2",sitename))

grouping and summarizing

One can combine the group_by and summarize functions to get a summary of the data we pulled:

traits_variables_season_2 %>%
  group_by(date) %>%
  summarize(mean = mean(mean), n =n())

## # A tibble: 7 x 3
##                  date     mean     n
##                <dttm>    <dbl> <int>
## 1 2016-08-24 05:00:00 16.09714   700
## 2 2016-08-31 05:00:00 31.79857   700
## 3 2016-09-08 05:00:00 50.62000   700
## 4 2016-09-15 05:00:00 65.55000   700
## 5 2016-09-22 05:00:00 74.89000   700
## 6 2016-09-29 05:00:00 75.52647   699
## 7 2016-10-06 05:00:00 79.28755   699

canopy_heights <- traits_variables_season_2 %>%
  group_by(date, cultivar) %>%
  summarize(mean = mean(mean), n =n())

Plotting with ggplot

The ggplot2 library (for more see here) can be used to create graphics.

library(ggplot2)

An example using the data we pulled before:

ggplot(data = traits_variables_season_2,
       aes(x = date, y = mean, group = cultivar)) +
  geom_smooth(se = FALSE, alpha = 0.25, size = 0.5) +
  geom_point(alpha = 0.25, size = 0.5, position = 'jitter')

## `geom_smooth()` using method = 'loess'

Another example:

### filter a subset of 25 genotypes
set.seed(1000) # to reproducibly return the same 25
height_season_2_subset <- traits_variables_season_2 %>%
  filter(cultivar %in% sample(unique(traits_variables_season_2$cultivar, 25)))

# how many unique values of cultivar_id?

#height_season_2_subset %>% group_by(cultivar_id, date) %>% summarize(n())


ggplot(data = height_season_2_subset,
       aes(x = date, y = mean, color = as.factor(cultivar))) +
  geom_smooth(se = FALSE, alpha = 0.25, size = 0.5) +
  geom_point(alpha = 0.25, size = 0.5)

## `geom_smooth()` using method = 'loess'

Faceting

Using ggplot2 we can partition a plot into a matrix of panels. Here’s an example:

ggplot(data = height_season_2_subset,
       aes(x = date, y = mean)) +
  geom_smooth(se = FALSE) +
  geom_point() +
  facet_wrap(~cultivar) +
  ggthemes::theme_tufte()

## `geom_smooth()` using method = 'loess'

Other plots

Here is an example of a histogram:

ggplot(data = height_season_2_subset,
       aes(mean, fill = date)) +
  geom_histogram() +
  facet_wrap(~cultivar) +
  theme_bw()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Try making plots using the following functions: * geom_boxplot * geom_violin

Visualization in R using ggplot

Some references for using ggplot:

R for data science data-visualization
Software Carpentry 8