1. The Terminal (Unix Shell)

“What is a command shell and why would I use one?”

This tutorial is based on the Software Carpentry Unix Shell) [@swc_unix_shell] lesson, and will refer to it for background information.

Today we will learn - how the shell relates to the keyboard, the screen, the operating system, and users’ programs." - when and why command-line interfaces should be used instead of graphical interfaces. - similarities and differences between a file and a directory. - absolute and relative paths - steps in the shell’s read-run-print cycle. - learn about commands, flags, and filenames in a command-line call

Background

A shell also known as a ‘terminal’ or ‘command line interface’ (CLI). A CLI is different from a graphical user interface (GUI), in that the CLI only responds to text whereas a GUI can respond to text as well as mouse inputs.

Many GUI programs have a command line interface, although even if you don’t know it. In addition, the standard Unix shell provides access to a diverse range of standard programs. These make it easier to automate repetitive tasks as well as access other computers and servers. While the shell is powerful, at first it is unfamiliar with cryptic commands and operations.

The heart of a CLI is a read-evaluate-print loop, or REPL: when the user types a command and then presses the Enter (or Return) key, the computer reads it, executes it, and prints its output. The user then types another command, and so on until the user logs off.

Using Bash or any other shell sometimes feels more like programming than like using a mouse. Commands are terse (often only a couple of characters long), their names are frequently cryptic, and their output is lines of text rather than something visual like a graph. On the other hand, with only a few keystrokes, the shell allows us to combine existing tools into powerful pipelines and handle large volumes of data automatically. This automation not only makes us more productive but also improves the reproducibility of our workflows by allowing us to repeat them with few simple commands.

In addition, the command line is often the easiest way to interact with remote machines and supercomputers. Familiarity with the shell is near essential to run a variety of specialized tools and resources including high-performance computing systems. As clusters and cloud computing systems become more popular for scientific data crunching, being able to interact with the shell is becoming a necessary skill. We can build on the command-line skills covered here to tackle a wide range of scientific questions and computational challenges.

Navigating Files and Directories

In this section we will learn to move around on the computer, see what files and directories are there, and specify the location of a directory or file on your computer.

Moving around on the computer:

You will be used to navigating directories and files via a program such as Finder on Apple computers or File explorer on Windows. We can also navigate directories and look at files from the shell.

Lets get started. When you open a terminal, you will see something like:

dlebauer@dlebauer-desktop:~$

The $ is called the command prompt. The information before it provides information about the computer we are on, in this case the user and computer names. For the purposes of the tutorial, we will enter the command sh PS1='$ 'and press the ‘enter’ key so that only the$` shows up. This isn’t necessary to follow along, but it is useful to know that you can change the information that the command prompt provides. In fact, there are many ways to customize your Shell - the prompt as well as the way it presents text - that are beyond the scope of this course (but easy to find on Google).

First type the command whoami

whoami

This tells you the user’s name. You will likely not have the same answer as the computer gave when it ran the code in this book.

Now type the command pwd, which is short for ‘print working directory’

pwd

This tells you your current working directory. On OSX it may look like /Users/dlebauer/ while on Linux it may look like /home/dlebauer and on Windows C:\Users\David LeBauer.

What is in this direcotry?

The program ls will tell you:

ls

## 04-introduction-to-R.html
## 2017-05-26-the-terminal-unix-shell.Rmd
## 2017-05-26-version-control-and-collaborative-development.Rmd
## 2017-05-30-databases-and-sql_files
## 2017-05-30-databases-and-sql.Rmd
## 2017-05-30-introduction-to-r.html
## 2017-05-30-introduction-to-r.Rmd
## 2017-05-30-scrum-and-best-practices.html
## 2017-05-30-scrum-and-best-practices.Rmd
## 2017-05-31-accessing-sensor-data.Rmd
## 2017-05-31-exploratory-data-analysis.Rmd
## 2017-05-31-project-day-2-querying-data.md
## data
## data.csv
## doc
## reports
## results
## salix2.csv
## src

ls -F

Most command line programs have many options. Options are often specified as ‘flags’ (perhaps because - looks like a flag pole?)

Many shell programs provide common flags such as --help and --version (as well as synonyms -h and -v). Lets ask the shell for help with ls:

ls --help

## Usage: ls [OPTION]... [FILE]...
## List information about the FILEs (the current directory by default).
## Sort entries alphabetically if none of -cftuvSUX nor --sort is specified.
## 
## Mandatory arguments to long options are mandatory for short options too.
##   -a, --all                  do not ignore entries starting with .
##   -A, --almost-all           do not list implied . and ..
##       --author               with -l, print the author of each file
##   -b, --escape               print C-style escapes for nongraphic characters
##       --block-size=SIZE      scale sizes by SIZE before printing them; e.g.,
##                                '--block-size=M' prints sizes in units of
##                                1,048,576 bytes; see SIZE format below
##   -B, --ignore-backups       do not list implied entries ending with ~
##   -c                         with -lt: sort by, and show, ctime (time of last
##                                modification of file status information);
##                                with -l: show ctime and sort by name;
##                                otherwise: sort by ctime, newest first
##   -C                         list entries by columns
##       --color[=WHEN]         colorize the output; WHEN can be 'never', 'auto',
##                                or 'always' (the default); more info below
##   -d, --directory            list directories themselves, not their contents
##   -D, --dired                generate output designed for Emacs' dired mode
##   -f                         do not sort, enable -aU, disable -ls --color
##   -F, --classify             append indicator (one of */=>@|) to entries
##       --file-type            likewise, except do not append '*'
##       --format=WORD          across -x, commas -m, horizontal -x, long -l,
##                                single-column -1, verbose -l, vertical -C
##       --full-time            like -l --time-style=full-iso
##   -g                         like -l, but do not list owner
##       --group-directories-first
##                              group directories before files;
##                                can be augmented with a --sort option, but any
##                                use of --sort=none (-U) disables grouping
##   -G, --no-group             in a long listing, don't print group names
##   -h, --human-readable       with -l and/or -s, print human readable sizes
##                                (e.g., 1K 234M 2G)
##       --si                   likewise, but use powers of 1000 not 1024
##   -H, --dereference-command-line
##                              follow symbolic links listed on the command line
##       --dereference-command-line-symlink-to-dir
##                              follow each command line symbolic link
##                                that points to a directory
##       --hide=PATTERN         do not list implied entries matching shell PATTERN
##                                (overridden by -a or -A)
##       --indicator-style=WORD  append indicator with style WORD to entry names:
##                                none (default), slash (-p),
##                                file-type (--file-type), classify (-F)
##   -i, --inode                print the index number of each file
##   -I, --ignore=PATTERN       do not list implied entries matching shell PATTERN
##   -k, --kibibytes            default to 1024-byte blocks for disk usage
##   -l                         use a long listing format
##   -L, --dereference          when showing file information for a symbolic
##                                link, show information for the file the link
##                                references rather than for the link itself
##   -m                         fill width with a comma separated list of entries
##   -n, --numeric-uid-gid      like -l, but list numeric user and group IDs
##   -N, --literal              print raw entry names (don't treat e.g. control
##                                characters specially)
##   -o                         like -l, but do not list group information
##   -p, --indicator-style=slash
##                              append / indicator to directories
##   -q, --hide-control-chars   print ? instead of nongraphic characters
##       --show-control-chars   show nongraphic characters as-is (the default,
##                                unless program is 'ls' and output is a terminal)
##   -Q, --quote-name           enclose entry names in double quotes
##       --quoting-style=WORD   use quoting style WORD for entry names:
##                                literal, locale, shell, shell-always, c, escape
##   -r, --reverse              reverse order while sorting
##   -R, --recursive            list subdirectories recursively
##   -s, --size                 print the allocated size of each file, in blocks
##   -S                         sort by file size
##       --sort=WORD            sort by WORD instead of name: none (-U), size (-S),
##                                time (-t), version (-v), extension (-X)
##       --time=WORD            with -l, show time as WORD instead of default
##                                modification time: atime or access or use (-u)
##                                ctime or status (-c); also use specified time
##                                as sort key if --sort=time
##       --time-style=STYLE     with -l, show times using style STYLE:
##                                full-iso, long-iso, iso, locale, or +FORMAT;
##                                FORMAT is interpreted like in 'date'; if FORMAT
##                                is FORMAT1<newline>FORMAT2, then FORMAT1 applies
##                                to non-recent files and FORMAT2 to recent files;
##                                if STYLE is prefixed with 'posix-', STYLE
##                                takes effect only outside the POSIX locale
##   -t                         sort by modification time, newest first
##   -T, --tabsize=COLS         assume tab stops at each COLS instead of 8
##   -u                         with -lt: sort by, and show, access time;
##                                with -l: show access time and sort by name;
##                                otherwise: sort by access time
##   -U                         do not sort; list entries in directory order
##   -v                         natural sort of (version) numbers within text
##   -w, --width=COLS           assume screen width instead of current value
##   -x                         list entries by lines instead of by columns
##   -X                         sort alphabetically by entry extension
##   -Z, --context              print any security context of each file
##   -1                         list one file per line
##       --help     display this help and exit
##       --version  output version information and exit
## 
## The SIZE argument is an integer and optional unit (example: 10K is 10*1024).
## Units are K,M,G,T,P,E,Z,Y (powers of 1024) or KB,MB,... (powers of 1000).
## 
## Using color to distinguish file types is disabled both by default and
## with --color=never.  With --color=auto, ls emits color codes only when
## standard output is connected to a terminal.  The LS_COLORS environment
## variable can change the settings.  Use the dircolors command to set it.
## 
## Exit status:
##  0  if OK,
##  1  if minor problems (e.g., cannot access subdirectory),
##  2  if serious trouble (e.g., cannot access command-line argument).
## 
## GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
## Full documentation at: <http://www.gnu.org/software/coreutils/ls>
## or available locally via: info '(coreutils) ls invocation'

For even more information, you can use man ls.

Looking around

You can use the command ls to look at different directories (without changing the current working directory) as follows:

ls
ls ~/
ls /data/terraref/sites/ua-mac

To change to the directory /data/terraref/sites/ua-mac you can use the cd command:

cd /data/terraref/sites/ua-mac
ls ## reveal the contents
cd ## shortcut brings you back home
pwd

Directories in the shell have a tree-like structure. You can use cd .. to move up one step in the tree, cd ../.. to move up two steps, cd ../../.. to move up three steps, and so on. Just typing in cd will bring you to your home directory (equivalent to cd ~/.

Environment Variables

Environment variables are variables that affect the way software interacts with the user and are valid systemwide. For example, try echoing the environment variable PATH as follows:

echo $PATH

You can see the environment variables by doing:

printenv

Useful Tools

Let’s look at some useful tools. First we need a file, so let’s go ahead and make a directory to put some data in:

mkdir project
mkdir project/data
cd project/data

Then we get some data from the BETYdb.org database.

Here we use the curl program, which is a command line tool for moving data around the internet. Here, we use curl with the -o $genus.csv flag to specify where to save output from the url.

### Get some crop trait and yield data
for genus in panicum miscanthus salix populus
do 
  curl -o $genus.csv https://www.betydb.org/search.csv?search=$genus
done

### Get some weather data
cp /data/terraref/sites/uiuc-energyfarm/raw_data/weather_stations/Weather*DayAvg.dat ~/project/data/

Let’s take a look at it, head prints out the first 10 lines while tail the last 10 lines:

head data.csv
tail data.csv

The -n flag can modify the number of lines shown:

head -n 10 data.csv

tail -n +11 data.csv

You can also display all of its contents onto the shell (not recommended for big files like this one):

cat data.csv

We can count the lines, words, and characters (in that order) of a file with:

wc -l data.csv

You can also save this data into a file:

wc -l data.csv > lengths.txt

and sort it:

sort lengths.txt

This doesn’t modify the file lengths.txt, so you need to save it to a new file:

sort lengths.txt > sorted_lengths.txt

Pipes

A pipe | is used to have the output of the command to its left be used as the input of the command to the right, for example:

tail -n +11 data.csv | head -n 4

compare this with:

tail -n +11 data.csv > miscanthus.csv

Find with `grep`

The command grep can be used to find a pattern in a file, for example to find the pattern yield in the file miscanthus.csv:

grep yield miscanthus.csv

Find and replace with `sed`

The following replaces “willow” for “Willow” in the file willow.csv. The -i flag is to make this interactive (try it with and without it in some small text file of your creation):

sed -i 's/willow/Willow/g' willow.csv

In sed -i s/find/replace/g, -i means change the file in place, s/ means ‘substitute’ and /g means replace all occurances.

Parse text with `cut`

The command cut can be used to remove sections of each line of a file:

cut -d, -f3,5 miscanthus.csv

For Loops

The shell has for loops to iterate actions. When you enter:

for genus in panicum miscanthus salix

it enters into a command entry mode in which you need to tell it what it does as it loops through panicum, miscanthus, salix. For example to echo these words into the shell:

for genus in panicum miscanthus salix ; do
    echo $genus
done

To echo the integers:

for $i in in {seq 1 10}; do
   echo $i
done

Shell Scripts

You can write short scripts to run commands on the shell. You can use any text editor, here we’ll use nano. Shell scripts end in .sh.

Create and open with nano a shell script:

nano getdata.sh

In the above, notice the terminal was “locked”, there was no command prompt. To avoid this you can use:

nano getdata.sh &

Now type the following in the file:

#!/bin/bash
for genus in panicum miscanthus salix populus
do 
  curl -o $genus https://www.betydb.org/search.csv?search=$genus
done

Save the file and exit (in nano the ^ means Ctrl).

Make the file executable using the program chmod. chmod modifies the permissions of the file. +x makes it executable (+rw makes it readable and writable)

chmod +x getdata.sh

Run the file:

./getdata.sh

To run the file in the background and continue using your terminal, you can run it using:

nohup ./getdata.sh &

To see its progress:

tail -f nohup.out

Using variables in your shell script.

You can write shell scripts that take input variables.

When you run a bash script, a variable can be added after the name of the script like myscript.sh var1 var2.

Create a file:

nano getonegenus.sh

Write the following. Here $1 represents the first input variable:

#!/bin/bash
curl -o $1.csv https://www.betydb.org/search.csv?search=$1

Save it and make it executable:

chmod +x getonegenus.sh

Now you can send the search term via the command line:

./getonegenus.sh salix

Time Saving Tips

You can press the Tab on your keyboard to autocomplete.
You can press Ctrl+A to go to the beginning of a line.
You can press the up arrow to view previous commands.
To clear the line use Ctrl+U
There are many more if you Google “Time-saving command line tips”

Regular Expressions (‘Regex’)

Too long for today, difficult to learn, but examples are Googleable, like

“Regex date range at end of string”

Real World Examples:

Advanced:

Problem: changing file naming conventions broke a computing pipeline

The recent calibration images brought to my attention that the SWIR naming convention has diverged from the VNIR. SWIR files now have an extra date string (right after the UUID) in their names, and also are missing an underscore before the english suffix. Was there a good reason for changing the naming convention? It would be helpful if SWIR used the same convention as VNIR. Otherwise we need to re-write the extractors to handle both conventions and be backwards compatible, etc.

Questions * What is the UUID? * Why is having an extra date a problem? * How many dates can you find in the following code snippet? Are they consistent?

zender@cg-gpu01:~/nco$ ls -l /projects/arpae/terraref/sites/ua-mac/raw_data/SWIR/2017-04-15/2017-04-15__11-59-12-426
total 481696
-rw-r--r-- 1 dlebauer grp_202     55123 Apr 15 12:03 f130c910-7887-49b0-97bb-db49e8c85e63_2017_04_15_12_05_11frameIndex.txt
-rw-r--r-- 1 dlebauer grp_202     27533 Apr 15 12:03 f130c910-7887-49b0-97bb-db49e8c85e63_2017_04_15_12_05_11image.jpg
-rw-r--r-- 1 dlebauer grp_202 493129728 Apr 15 12:03 f130c910-7887-49b0-97bb-db49e8c85e63_2017_04_15_12_05_11raw
-rw-r--r-- 1 dlebauer grp_202      3503 Apr 15 12:03 f130c910-7887-49b0-97bb-db49e8c85e63_2017_04_15_12_05_11raw.hdr
-rw-r--r-- 1 dlebauer grp_202       869 Apr 15 12:03 f130c910-7887-49b0-97bb-db49e8c85e63_2017_04_15_12_05_11settings.txt
-rw-r--r-- 1 dlebauer grp_202      3561 Apr 15 12:03 f130c910-7887-49b0-97bb-db49e8c85e63_metadata.json


zender@cg-gpu01:~/nco$ ls -l /projects/arpae/terraref/sites/ua-mac/raw_data/VNIR/2017-04-15/2017-04-15__11-56-59-902/
total 5204960
-rw-r--r-- 1 dlebauer grp_202      40591 Apr 15 12:00 ca045a19-7b12-4627-b700-9f51f5829b64_frameIndex.txt
-rw-r--r-- 1 dlebauer grp_202      69299 Apr 15 12:00 ca045a19-7b12-4627-b700-9f51f5829b64_image.jpg
-rw-r--r-- 1 dlebauer grp_202       3605 Apr 15 12:00 ca045a19-7b12-4627-b700-9f51f5829b64_metadata.json
-rw-r--r-- 1 dlebauer grp_202 5329664000 Apr 15 12:02 ca045a19-7b12-4627-b700-9f51f5829b64_raw
-rw-r--r-- 1 dlebauer grp_202      10257 Apr 15 12:01 ca045a19-7b12-4627-b700-9f51f5829b64_raw.hdr
-rw-r--r-- 1 dlebauer grp_202        872 Apr 15 12:02 ca045a19-7b12-4627-b700-9f51f5829b64_settings.txt

date is in directory name, date-time subdirectory, filename (for SWIR but not VNIR) and in the output from ls -l.
note that there is a ~6 minute discrepancy between SWIR directory and filename. File timestamp is in between.

Solution Part 1

Don’t change upstream formats / conventions without alerting downstream developers!

After a few similar issues we created a protocol:
upstream developers alert downstream developers before changing formats
downstream developers write tests to catch errors upstream before they cause errors (in most cases, for the pipeline to break)

This is a clear rule but difficult to enforce without extensive automated testing (in this case the tests are written, but the pipeline was lagging a few months behind and did not catch it).

Solution Part 2

When Part 1 fails, write a script to rename all of the files. Lets take a look.

#!/bin/bash

cd /projects/arpae/terraref/sites/ua-mac/raw_data/SWIR/
for d in `/bin/ls -d 2016-11-0[89] 2016-11-[123]? 2016-12-?? 2017-??-??`; do 
    yyyymmdd=$d
    drc_top="/projects/arpae/terraref/sites/ua-mac/raw_data/SWIR/${yyyymmdd}"
    cd ${drc_top}
    for drc_sub in `/bin/ls -d ${yyyymmdd}*` ; do
        echo "Renaming in directory ${drc_sub}..."
        cd ${drc_top}/${drc_sub}
        for fl in `/bin/ls *raw` ; do
          dt_sng=`expr match "${fl}" '.*\([0-9][0-9][0-9][0-9]_[0-9][0-9]_[0-9][0-9]_[0-9][0-9]_[0-9][0-9]_[0-9][0-9]\)\.*'`
          uuid_sng=${fl:0:36}
        done # !fl
        echo "uuid=${uuid_sng}, dt=${dt_sng}"
        for sfx in frameIndex.txt image.jpg raw raw.hdr settings.txt ; do
          mv_cmd="/bin/mv ${uuid_sng}_${dt_sng}${sfx} ${uuid_sng}_${sfx}"
          echo "/bin/mv ${uuid_sng}_${dt_sng}${sfx} ${uuid_sng}_${sfx}"
          eval ${mv_cmd}
        done # !sfx
   done # !drc_sub
done # !d

Can you find:

How many for-loops are there? How deeply are they nested?
What do you think the following regex characters mean:
[0-9]?
[123]