PI4 Computational Bootcamp

PI4 Computational Bootcamp / Recent content on PI4 Computational Bootcamp Hugo -- gohugo.io en-us Tue, 06 Jun 2017 00:00:00 +0000 10. Bayesian Analysis /2017/06/06/bayesian-analysis/ Tue, 06 Jun 2017 00:00:00 +0000 /2017/06/06/bayesian-analysis/ Bayesian regression library(rstanarm) options(mc.cores = parallel::detectCores()) glm1 <- stan_glmer(yield_annual ~ fertilizer_n + age + genus:fertilizer_n + genus:age + (1 | site ), #prior = normal(30, 4), data = grass_yields) glm1 plot(glm1) plot(glm1, plotfun = 'hist') pairs(glm1) posterior_interval(glm1) summary(glm1) library(sjPlot) sjPlot::sjp.lmer(glm1) http://topepo.github.io/caret/train-models-by-tag.html https://thinkinator.com/2016/01/12/r-users-will-now-inevitably-become-bayesians/ 11. Stochastic Simulation /2017/06/06/stochastic-simulation/ Tue, 06 Jun 2017 00:00:00 +0000 /2017/06/06/stochastic-simulation/ Note: Some of the commands below, particularly those accessing the terraref database from within RStudio, require being logged into the NDS Analytics Workbench. This requires signing up for the TERRA REF Alpha User program options(warn=-1) library(ggplot2) Stochastic Simulation (Bolker Ch5) A deterministic linear function: \[y_\rm{det}=a+bx\] Add normally distributed errors, modeling $\epsilon\sim(N(0,\sigma))$ \[Y\sim N(a+bx, \sigma)\] Or equivalently: \[Y\sim N(\mu, \sigma)\] \[\mu = a+bx\] Lets first plot the deterministic function as a line 9. Intro Statistics /2017/06/05/intro-statistics/ Mon, 05 Jun 2017 00:00:00 +0000 /2017/06/05/intro-statistics/ Summary statistics Objectives: * learn to summarize data with basic measures of central tendancy and spread * learn to fit parametric distributions to data Data analysis: Sorghum height data knitr::opts_chunk$set(cache = TRUE) library(dplyr) ## ## Attaching package: 'dplyr' ## The following objects are masked from 'package:stats': ## ## filter, lag ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union library(ggplot2) theme_set(theme_bw()) #bety_src <- RPostgreSQL::dbConnect(odbc::odbc(), "TERRA-REF traits copy BETYdb") #bety_src <- src_postgres(dbname = "bety", password = 'bety', host = 'bety. 8. Probability /2017/06/04/probability/ Sun, 04 Jun 2017 00:00:00 +0000 /2017/06/04/probability/ Probability Distributions in R Objectives: Become familiar with common probability distributions in R Understand how probability distributions are related to each other Summarizing Data Expectation If the random variable $x$ can take values $x_1,...,x_N$ with equal probability, the expected value $E[x]$ of $x$ is defined to be: \[E[x]=\sum_{i=1}^N\frac{x_i}{N}\] Properties of the expected value: * $E[x+y]=E[x]+E[y]$ * $E[cx]=cE[x]$ Variance The variance of the random variable $x$ is defined to be: Projects /projects/ Thu, 01 Jun 2017 00:00:00 +0000 /projects/ Final group projects Each group has a github repository that contains code developed to answer a specific question with data from the TERRA REF project. Here are the project topics / questions and links to their repositories. Can the shortwave infrared solar spectrum be predicted from visible spectra? https://github.com/pi4-uiuc/Team_1 Can genotype be predicted from leaf traits? https://github.com/pi4-uiuc/TeamTwo Comparing growth rates of plants grown indoors versus outdoors https://github.com/pi4-uiuc/IndoorOutdoor Can plant growth in greenhouses predict how plants will grow in the field? 6. Accessing Sensor Data /2017/05/31/accessing-sensor-data/ Wed, 31 May 2017 00:00:00 +0000 /2017/05/31/accessing-sensor-data/ Hyperspectral Data Calibration Targets These were collected on April 15 2017 every ~15 minutes library(ncdf4) library(dplyr) hsi_calibration_dir <- '/data/terraref/sites/ua-mac/Level_1/hyperspectral/2017-04-15' hsi_calibration_files <- dir(hsi_calibration_dir, recursive = TRUE, full.names = TRUE) fileinfo <- bind_rows(lapply(hsi_calibration_files, file.info)) %>% mutate(size_gb = size/1073741824) calibration_nc <- nc_open(hsi_calibration_files[200]) a <- calibration_nc$var$rfl_img #calibration_nc$dim$x$len 1600 #calibration_nc$dim$y$len x_length <- round(calibration_nc$dim$x$len / 10) y_length <- round(calibration_nc$dim$y$len * 3/4) xstart <- ceiling(calibration_nc$dim$x$len / 2) - floor(x_length / 2) + 1 ystart <- ceiling(calibration_nc$dim$y$len / 2) - floor(y_length / 2) + 1 rfl <- ncvar_get(calibration_nc, 'rfl_img', #start = c(1, xstart, ystart), #count = c(955, x_length, y_length) start = c(2, 2, 2), count = c(1320, 10, 954) ) x <- ncvar_get(calibration_nc, 'x', start = 100, count = 160) y <- ncvar_get(calibration_nc, 'y', start = 100, count = 1324) lambda <- calibration_nc$dim$wavelength$vals for(i in 1 + 0:10*95){ image(x = x, y = y, z = rfl[i,,], xlab = 'x (m)', ylab = 'y (m)', col = rainbow(n=100), main = paste('wavelength', udunits2::ud. 7. Exploratory Data Analysis /2017/05/31/exploratory-data-analysis/ Wed, 31 May 2017 00:00:00 +0000 /2017/05/31/exploratory-data-analysis/ Note: Some of the commands below, particularly those accessing the terraref database from within RStudio, require being logged into the NDS Analytics Workbench. This requires signing up for the TERRA REF Alpha User program. Objectives Learning how to access data on terraref database via PostgreSQL Learning how to manipulate dat using the dplyr library, in particular: filtering, subsetting, summarizing, new variables with dplyr converting data from wide to long with tidyr 4. Best Practices and Collaborative Development /2017/05/30/scrum-and-best-practices/ Tue, 30 May 2017 00:00:00 +0000 /2017/05/30/scrum-and-best-practices/ Objectives: Understand best practices by Wilson et al Basic components of Reproducible Research Be familiar with concepts related to Scrum Best Practices in Scientific Computing Wilson et al 2014 Best practices in Scientific Computing Wilson et al paper: http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1001745 Slideshow: http://swcarpentry.github.io/slideshows/best-practices/ Wilson et al 2017 Good enough practices in Scientific Computing Wilson et al paper: https://swcarpentry.github.io/good-enough-practices-in-scientific-computing/ Example: PEcAn Project Please start the day by reading this. 5. Databases and SQL /2017/05/30/databases-and-sql/ Tue, 30 May 2017 00:00:00 +0000 /2017/05/30/databases-and-sql/ Overview We started with an overview of data management, including the data lifecycle, metadata, and databases, using the TERRA REF to illustrate applications. Spreadsheets and Tables A database is a generic term for data. Often these are divided into ‘relational’ and ‘non-relational’… We will get to these later. First, lets start with Spreadsheets, which are a simple, and very common format for data. In fact, so far we have been working with tables. 3. Introduction to R /2017/05/27/intro-to-r/ Sat, 27 May 2017 00:00:00 +0000 /2017/05/27/intro-to-r/ This lesson is based on Getting Started with R and Rstudio (SWC 1-3). Getting started with R Why R? * by design “a programming language and environment for statistical computing” * Most common language used by academic statisticians. Thus, many methods are developed as R packages and not available elsewhere. * built around stats, with lots of handy functions * data analysis by design * formulas for statistical models * ggplot ‘grammar of graphics’ among best graphing programs (only get better w/ graphical design programs like photoshop/Inkscape) * dplyr and tidyr for data munging * Integrates with lower level languages (esp C, C++, FORTRAN; Rcpp package for integration w/ C++ excellent) * can do anything * has a reputation for being slow. 1. The Terminal (Unix Shell) /2017/05/26/the-terminal-unix-shell/ Fri, 26 May 2017 00:00:00 +0000 /2017/05/26/the-terminal-unix-shell/ “What is a command shell and why would I use one?” This tutorial is based on the Software Carpentry Unix Shell) [@swc_unix_shell] lesson, and will refer to it for background information. Today we will learn - how the shell relates to the keyboard, the screen, the operating system, and users’ programs." - when and why command-line interfaces should be used instead of graphical interfaces. - similarities and differences between a file and a directory. 2. Version Control and Collaborative development /2017/05/26/version-control-git-github/ Fri, 26 May 2017 00:00:00 +0000 /2017/05/26/version-control-git-github/ Version control makes it possible to have ‘unlimited undo’ and histories of your documents (and some smaller datasets) Having everything on a website like GitHub (or Bitbucket or GitLab) makes it easy to share with other people and collaborate. If you keep all of your important programs and files somewhere in the cloud, like on box.com, dropbox, github, it makes it easier to use heterogeneous environments. Version Control SWC Git Novice 1-6 Getting Started Introduce yourself (once you’ve set this on your machine, you don’t have to set it again unless you want to change something): Reading /reading/ Fri, 05 May 2017 00:00:00 +0000 /reading/ Day 1. Programming Fundamentals Wilson et al 2014 Best Practices for Scientific Computing Agile Manifesto Vincent Dresen “A successful Git branching model” Day 2: Getting started with R Introduction, Data Structures, Subsetting, Functions in “Advanced R” by Wickham “Git and GitHub” in “R packages” by Wickham. Day 3: Databases and Visualization LeBauer, David, et al. “BETYdb: a yield, trait, and ecosystem service database applied to second‐generation bioenergy feedstock production. Syllabus /syllabus/ Fri, 05 May 2017 00:00:00 +0000 /syllabus/ Instructors: David LeBauer Carl R Woese Institute for Genomic Biology and National Center for Supercomputing Applications, University of Illinois email:dlebauer@illinois.edu Neal Davis Department of Computer Science, University of Illinois email:davis68@illinois.edu Stefan Klajbor (TA) Department of Mathematics, University of Illinois email:klajbor2@illinois.edu Course Objectives A two week course designed to introduce Math graduate students with little or no programming experience to methods in data analysis and computation. /week2/02-pca/ Mon, 01 Jan 0001 00:00:00 +0000 /week2/02-pca/ Lesson 2: Principal component analysis Contents Introduction Data access and data description PCA Implementation Example: Iris data References Contributors As usual, a standard header to load necessary libraries: import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline Introduction Principle component analysis (PCA) is a well-known technique for reducing the number of dimensions in a data set. PCA takes a linear transformation of features of the original dataset to reduce dimensionality. /week2/04-kdtree/ Mon, 01 Jan 0001 00:00:00 +0000 /week2/04-kdtree/ Lesson 4: $k$-d tree for classification and regression Contents Introduction Example: Online news data Fitting a classification tree from an original data set Fitting a classification tree from PCA References Contributors As usual, a standard header to load necessary libraries: import pandas as pd import numpy as np from scipy import spatial import matplotlib.pyplot as plt %matplotlib inline Introduction $k$-d trees partition $k$-dimensional space in order to organize points and data access. /week2/05-svm/ Mon, 01 Jan 0001 00:00:00 +0000 /week2/05-svm/ Lesson 5: Support vector machines Contents Introduction SVM for binary-class classification The kernel function SVM for three-class classification Exercise References Contributors As usual, a standard header to load necessary libraries: import pandas as pd import numpy as np from sklearn import svm import matplotlib.pyplot as plt %matplotlib inline Introduction “Support vector machines are a set of supervised learning methods used for classification, regressions, and outliers detection.” (scikit-learn documentation) /week2/06-mc/ Mon, 01 Jan 0001 00:00:00 +0000 /week2/06-mc/ Lesson 6: Monte Carlo simulation Contents Monte Carlo simulation Sampling Applications Contributors First, a standard header to load necessary libraries: import numpy as np from numpy import random as npr import matplotlib.pyplot as plt %matplotlib inline Monte Carlo simulation Monte Carlo techniques are used to estimate quantities by random sampling. They are particularly useful in three cases (arranged in order of approximate complexity): Numerical integration (particularly high-dimensional integrals) Kinetic Monte Carlo (sampling events based on likelihoods) Markov chain Monte Carlo (MCMC), including the Metropolis algorithm (targeted physics simulations) > The Metropolis–Hastings algorithm can draw samples from any probability distribution P(x), provided you can compute the value of a function f(x) which is proportional to the density of P. /week2/07-kmclus/ Mon, 01 Jan 0001 00:00:00 +0000 /week2/07-kmclus/ Lesson 7: $k$-Means Clustering Contents Introduction $k$-means clustering method Evaluation: silhouette method Evaluation: gap statistics method Conclusion Contributors First, a standard header to load necessary libraries: import numpy as np from numpy import random as npr import matplotlib.pyplot as plt import matplotlib.cm as cm %matplotlib inline Introduction Clustering analysis is a type of unsupervised learning. Two common methods will be examined: $k$-means and hierarchical clustering. In this section, we will be introducing the $k$-means method as well as criteria to evaluate the performance of clustering. /week2/08-hierclus/ Mon, 01 Jan 0001 00:00:00 +0000 /week2/08-hierclus/ Lesson 8: Hierarchical Clustering Contents Introduction Hierarchical clustering method Dendrograms Evaluation: silhouette method Contributors First, a standard header to load necessary libraries: import numpy as np from numpy import random as npr import matplotlib.pyplot as plt import matplotlib.cm as cm %matplotlib inline Introduction How many clusters does the following data set have? from sklearn.datasets import make_circles noisy_circles = make_circles( n_samples=1500,factor=0.5,noise=0.05 ) plt.plot( noisy_circles[ 0 ][ :,0 ],noisy_circles[ 0 ][ :,1 ],'ko' ) plt. /week2/09-hiddenmarkov/ Mon, 01 Jan 0001 00:00:00 +0000 /week2/09-hiddenmarkov/ Lesson 9: Hidden Markov modeling Contents Introduction Example: Generating a hidden Markov process Hidden Markov Model Generate HMM process Estimate transition and emission matrix Estimate hidden states Estimate transition and emission matrix given observations and hidden states Contributors First, a standard header to load necessary libraries: import numpy as np from numpy import random as npr import matplotlib.pyplot as plt %matplotlib inline Introduction The goal of data mining is to tease out characteristic behaviors and interesting subsets and interactions. /week2/09a-capstone/ Mon, 01 Jan 0001 00:00:00 +0000 /week2/09a-capstone/ Lesson 09a: Data Mining Capstone Contents The data set The challenge Contributors The data set The MovieLens database represents an extraction from IMDB of movies, tag data, and ratings data. For a capstone exercise, use the data mining techniques we have covered to analyze the tagging/metadata and the ratings data in order to make predictions about movie and tag popularity. Two data sets are available: ml-latest-small ml-latest Start with the smaller data set (100,000 entries) to set up your approach. /week2/10-aws/ Mon, 01 Jan 0001 00:00:00 +0000 /week2/10-aws/ Lesson 10: Amazon Web Services Contents Cloud computing Amazon Web Services Elastic Compute Cloud Simple Storage Service Lambda Elastic MapReduce Contributors Cloud computing Cloud computing marks the rise of computing as a utility, a resource that can be metered and accessed essentially anywhere. Although the lines can be blurry, cloud computing services are typically bundled into four categories: cloud storage Dropbox and Box provide storage services, but naturally for enterprise customers there are many more competitors. /week2/11-hadoop-mr/ Mon, 01 Jan 0001 00:00:00 +0000 /week2/11-hadoop-mr/ Lesson 11: Apache Hadoop MapReduce Contents MapReduce algorithm Elastic MapReduce Hadoop Streaming example: Word count Contributors MapReduce algorithm The MapReduce algorithm is a specific low-communication parallel programming technique. It’s suitable for certain types of problems with iterated one-time local computational steps followed by a data trading step. Essentially, a Map step filters or sorts data, then a Reduce operation carries out a summarization procedure. Key to this is to stop thinking of machines carrying out tasks and instead just think in terms of functions. /week2/12-hadoop-fs/ Mon, 01 Jan 0001 00:00:00 +0000 /week2/12-hadoop-fs/ Lesson 12: Apache Hadoop Architecture Contents Apache Hadoop components Hadoop Distributed File System Hadoop Zoo Contributors Apache Hadoop components Apache Hadoop is composed of a collection of tools and services, and embedded in a wider environment as well (the Hadoop Zoo). Hadoop itself consists of: the MapReduce component, including the Mapper, Reducer, JobTracker, TaskTrackers, shufflers, sorters, etc. the Hadoop Distributed File System interfaces like Hadoop Streaming and Pig, etc. /week2/13-pig/ Mon, 01 Jan 0001 00:00:00 +0000 /week2/13-pig/ Lesson 13: Apache Pig Contents Apache Pig rationale Pig Latin Exercise Contributors Apache Pig rationale Apache Pig is a procedural SQL-ish language which lets Hadoop users work with the distributed data sets of HDFS. Everything in Pig is a tuple (or a variation on a tuple, like a bag or an atom). Pig Latin The best way to get a feel for Pig is to just run an example. Lesson 1: Data access and data cleaning /week2/data-access-cleaning/ Mon, 01 Jan 0001 00:00:00 +0000 /week2/data-access-cleaning/ Lesson 1: Data access and data cleaning Contents Data description and data access Data cleaning Basic statistical analysis Data smoothing Correlation between different variables Contributors First, a standard header to load necessary libraries: import pandas as pd import matplotlib.pyplot as plt %matplotlib inline Data description and data access We will utilize weather data for the Coleto Creek Reservoir from 2003–14, provided by the National Oceanic and Atmospheric Administration (NOAA).