Instructors:
David LeBauer
- Carl R Woese Institute for Genomic Biology and National Center for Supercomputing Applications, University of Illinois
- email:dlebauer@illinois.edu
Neal Davis
- Department of Computer Science, University of Illinois
- email:davis68@illinois.edu
Stefan Klajbor (TA)
- Department of Mathematics, University of Illinois
- email:klajbor2@illinois.edu
Course Objectives
A two week course designed to introduce Math graduate students with little or no programming experience to methods in data analysis and computation. The goal is to prepare students to apply their understanding of math to solve problems in industry.
Pre-requisites
Accounts
Each student should register for the following accounts:
- github.com
- TERRA REF Alpha User program https://goo.gl/forms/M0ZEMi3PSLENhspl2
And will receive email notifications that they are signed up for the following:
- National Data Service workbench http://www.pi4.ndslabs.org
- Slack (an invitation will be sent to you via email)
Familarity with basic syntax and operations in R and Python
Although the course is aimed at students with limited experience using software, you are expected to complete two introductory courses in order to become familiar with the basic syntax and operations in R and Python. Two free courses are Required*; completion certificates must be mailed to the instructors by the start of the first day of class (Friday May 25). These courses should take a few hours to complete: * Introduction to R * Introduction to Python for Data Science.
*Students who have significant experience with R and / or Python may elect to substitute a more advanced course.
Computers and Software
The only software requirement is a modern web browser.
The classroom is equiped with desktop computers, though students are encouraged to bring laptops. Much of the instruction and collaborative work will be done using the NDS Labs Workbench. The NDS labs workbench provides Shell, R, and Python editors as well as access to large datasets within a web browser.
Students are welcome and encouraged to run the software on their own computers, however some of the software is challenging to install and there will be limited time for instructors to assist with installation and configuration.
Code of Conduct
All participants must read and abide by our Code of Conduct.
Logistics
Location: 239 Altgeld Hall,
University of Illinois Department of Mathematics 1409 West Green Street Urbana, Il
Time: 9AM - 5PM
We will have a one hour break each day for lunch. On days with a guest lecture, we will break from 11:30 12:00 so that students have time to purchase a lunch and bring it back to the classroom in time for the talk.
Dates: May 26 – June 9, 2017
- May 26: Computing Basics
- May 30-June 2: Data and Statistics in R
- June 5-June 8: Data and Machine Learning with Python
- June 9: Conclusion and Project Presentations
Daily Schedule:
Time | Activity |
---|---|
9:00–9:30 | Review, questions, overview |
9:30–10:30 | Topic 1 |
10:30–10:45 | Break |
10:45–12:00 | Topic 2 |
12:00–1:00 | Lunch, ocassionally with guest lecture |
1:00–2:00 | Topic 3 |
2:00–3:00 | Topic 4 |
3:00–3:15 | Break |
3:15–5:00 | Group Projects |
Guest Presentations
All lectures will be from 12:00 to 1:00 unless otherwise noted.
- Tuesday, May 30: ““Transitioning From Academia to Industry: Marketing your PhD” Aaron Saxton, PhD. Senior Data Scientist, Agrible Inc.
- Wednesday May 31: “Data in Nuclear Engineering” Katy Huff, PhD, Department of Nuclear, Plasma, and Radiological Engineering, UIUC
- Friday, June 2: “TBD” Rob Kooper, Senior Research Programmer, National Center for Supercomputing Applications
Course Schedule
Day 1: Computing Fundamentals (LeBauer)
Friday May 26
- The Terminal SWC The Unix Shell)
- file system navigation
- scripting
- control flow
- Version Control SWC Git Novice 1-6
- commiting changes
- branching
- merging
- Collaborative Coding SWC Git Novice 7-14
- GitHub
- Code Reviews
- Software Development
- Reproducible Research
- Agile / Scrum
- Group Projects: Setup
- Overview of available data
- Overview of scientific questions
- Divide into Teams
- Setup GitHub repository
- Formulate questions and hypotheses
Day 2: Getting started with R (LeBauer)
Tuesday May 30
- Best Practices in Scientific Computing
- Getting Started with R and Rstudio (SWC 1-3)
- Importing data
- vectorization
- Control Flow (if, else, for) SWC 7
- Writing Reports with Rmarkdown
- Data structures
- Spreadsheets DC lesson
- Relational Databases
- non-relational databases
- Raster data and databases
- Querying databases
- SQL
- Connecting from R using the dplyr package
- Connecting to BETYdb using the traits package
- Project
- curate data
- design data management plan
- identify data that is needed / open questions
The first half of the day will follow the R Novice Gapminder lesson http://swcarpentry.github.io/r-novice-gapminder/
Day 3: Databases and Visualization (LeBauer)
Wednesday May 31
- Intro to Agile Development (lecture)
- Data Cleaning
- Data Cleaning with Open Refine DC lesson 1-4
- Data Cleaning in R
- Exploratory Analysis
- Summary Statistics
- Scatter Plots
- Data Curation
- Metadata and Vocabularies
- Publishing Data, Archives and Repositories
- Data Manipulation
- Visualization
- bestiary of plots, which plots for which data
- Turning tables into graphs Gelman et al 2002
- Beyond Bar and line graphs Weissgerber et al 2015
- Tufte, sparklines
- ggplot starting with SWC 8
- Project: Find data, clean, evaluate, and summarize, publish to GitHub
Day 4 Data Mining (Davis)
Thursday June 1
- Presumptive background: Intro to Python for Data Science (covers through NumPy)
Day 5 Data Mining (Davis)
Friday June 2
- $k$-nearest-neighbor
- $k$-d tree
- Support Vector Machines
- Monte Carlo sampling
Day 6: Probability and Statistics (LeBauer)
Monday June 5
- Probability Distributions
- Bestiary, meaning, PDFs (Bolker Ch4, Dietze EE509)
- Stochastic Simulation (Bolker Ch5)
- Summary statistics
- Estimates of central tendency, variance, shape
- Fitting PDFs -
- parameter estimation
- goodness of fit (L, [A,B,D,]IC)
- Statistical Modeling
- Regression
- Functions
- Dynamic Models
Day 7: Statistics II (LeBauer)
Tuesday June 6
- Model Building
- Descriptive Analysis
- Hypothesis Driven Analysis
- Model Fitting
- Frequentist, Bayesian
- Inference and Prediction
- Multilevel modeling
- ANOVA (Gelman et al 2005)
- GLM
- HB
Day 8 Data Mining (Davis)
Wednesday June 7
- Hierarchical clustering
- Hidden Markov models
- Data Mining Project
Day 9 Cloud Computing & MapReduce (Davis)
Thursday June 8
- Amazon Web Services (Cloud computing)
- Hadoop MapReduce
- Hadoop Pig
Day 10 Project Wrapup and Presentations
Friday June 9
- Morning: Group project completion and write-up.
- Afternoon: Group Presentations (15 min each, open to public).