Syllabus

Instructors:

David LeBauer

  • Carl R Woese Institute for Genomic Biology and National Center for Supercomputing Applications, University of Illinois
  • email:dlebauer@illinois.edu

Neal Davis

  • Department of Computer Science, University of Illinois
  • email:davis68@illinois.edu

Stefan Klajbor (TA)

  • Department of Mathematics, University of Illinois
  • email:klajbor2@illinois.edu

Course Objectives

A two week course designed to introduce Math graduate students with little or no programming experience to methods in data analysis and computation. The goal is to prepare students to apply their understanding of math to solve problems in industry.

Pre-requisites

Accounts

Each student should register for the following accounts:

And will receive email notifications that they are signed up for the following:

Familarity with basic syntax and operations in R and Python

Although the course is aimed at students with limited experience using software, you are expected to complete two introductory courses in order to become familiar with the basic syntax and operations in R and Python. Two free courses are Required*; completion certificates must be mailed to the instructors by the start of the first day of class (Friday May 25). These courses should take a few hours to complete: * Introduction to R * Introduction to Python for Data Science.

*Students who have significant experience with R and / or Python may elect to substitute a more advanced course.

Computers and Software

The only software requirement is a modern web browser.

The classroom is equiped with desktop computers, though students are encouraged to bring laptops. Much of the instruction and collaborative work will be done using the NDS Labs Workbench. The NDS labs workbench provides Shell, R, and Python editors as well as access to large datasets within a web browser.

Students are welcome and encouraged to run the software on their own computers, however some of the software is challenging to install and there will be limited time for instructors to assist with installation and configuration.

Code of Conduct

All participants must read and abide by our Code of Conduct.

Logistics

Location: 239 Altgeld Hall,

University of Illinois Department of Mathematics 1409 West Green Street Urbana, Il

Time: 9AM - 5PM

We will have a one hour break each day for lunch. On days with a guest lecture, we will break from 11:30 12:00 so that students have time to purchase a lunch and bring it back to the classroom in time for the talk.

Dates: May 26 – June 9, 2017

  • May 26: Computing Basics
  • May 30-June 2: Data and Statistics in R
  • June 5-June 8: Data and Machine Learning with Python
  • June 9: Conclusion and Project Presentations

Daily Schedule:

Time Activity
9:00–9:30 Review, questions, overview
9:30–10:30 Topic 1
10:30–10:45 Break
10:45–12:00 Topic 2
12:00–1:00 Lunch, ocassionally with guest lecture
1:00–2:00 Topic 3
2:00–3:00 Topic 4
3:00–3:15 Break
3:15–5:00 Group Projects

Guest Presentations

All lectures will be from 12:00 to 1:00 unless otherwise noted.

  • Tuesday, May 30: ““Transitioning From Academia to Industry: Marketing your PhD” Aaron Saxton, PhD. Senior Data Scientist, Agrible Inc.
  • Wednesday May 31: “Data in Nuclear Engineering” Katy Huff, PhD, Department of Nuclear, Plasma, and Radiological Engineering, UIUC
  • Friday, June 2: “TBD” Rob Kooper, Senior Research Programmer, National Center for Supercomputing Applications

Course Schedule

Day 1: Computing Fundamentals (LeBauer)

Friday May 26

  1. The Terminal SWC The Unix Shell)
    • file system navigation
    • scripting
    • control flow
  2. Version Control SWC Git Novice 1-6
    • commiting changes
    • branching
    • merging
  3. Collaborative Coding SWC Git Novice 7-14
    • GitHub
    • Code Reviews
  4. Software Development
    • Reproducible Research
    • Agile / Scrum
  5. Group Projects: Setup
    • Overview of available data
    • Overview of scientific questions
    • Divide into Teams
    • Setup GitHub repository
    • Formulate questions and hypotheses

Day 2: Getting started with R (LeBauer)

Tuesday May 30

  1. Best Practices in Scientific Computing
  2. Getting Started with R and Rstudio (SWC 1-3)
    • Importing data
    • vectorization
    • Control Flow (if, else, for) SWC 7
    • Writing Reports with Rmarkdown
  3. Data structures
    • Spreadsheets DC lesson
    • Relational Databases
    • non-relational databases
    • Raster data and databases
  4. Querying databases
    • SQL
    • Connecting from R using the dplyr package
    • Connecting to BETYdb using the traits package
  5. Project
    • curate data
    • design data management plan
    • identify data that is needed / open questions

The first half of the day will follow the R Novice Gapminder lesson http://swcarpentry.github.io/r-novice-gapminder/

Day 3: Databases and Visualization (LeBauer)

Wednesday May 31

  1. Intro to Agile Development (lecture)
  2. Data Cleaning
    • Data Cleaning with Open Refine DC lesson 1-4
    • Data Cleaning in R
  3. Exploratory Analysis
    • Summary Statistics
    • Scatter Plots
  4. Data Curation
    • Metadata and Vocabularies
    • Publishing Data, Archives and Repositories
  5. Data Manipulation
  6. Visualization
  7. Project: Find data, clean, evaluate, and summarize, publish to GitHub

Day 4 Data Mining (Davis)

Thursday June 1

  1. Pandas
  2. MatPlotLib
  3. Data cleaning
  4. Principal Component Analysis
  5. $k$-nearest-neighbor

Day 5 Data Mining (Davis)

Friday June 2

  1. $k$-nearest-neighbor
  2. $k$-d tree
  3. Support Vector Machines
  4. Monte Carlo sampling

Day 6: Probability and Statistics (LeBauer)

Monday June 5

  1. Probability Distributions
    • Bestiary, meaning, PDFs (Bolker Ch4, Dietze EE509)
    • Stochastic Simulation (Bolker Ch5)
  2. Summary statistics
    • Estimates of central tendency, variance, shape
    • Fitting PDFs -
      • parameter estimation
      • goodness of fit (L, [A,B,D,]IC)
  3. Statistical Modeling
    • Regression
    • Functions
    • Dynamic Models

Day 7: Statistics II (LeBauer)

Tuesday June 6

  1. Model Building
    • Descriptive Analysis
    • Hypothesis Driven Analysis
  2. Model Fitting
    • Frequentist, Bayesian
    • Inference and Prediction
  3. Multilevel modeling

Day 8 Data Mining (Davis)

Wednesday June 7

  1. Hierarchical clustering
  2. Hidden Markov models
  3. Data Mining Project

Day 9 Cloud Computing & MapReduce (Davis)

Thursday June 8

  1. Amazon Web Services (Cloud computing)
  2. Hadoop MapReduce
  3. Hadoop Pig

Day 10 Project Wrapup and Presentations

Friday June 9

  • Morning: Group project completion and write-up.
  • Afternoon: Group Presentations (15 min each, open to public).