12 min read

The Unix Shell

The Terminal (Unix Shell)

“What is a command shell and why would I use one?”

This tutorial is based on the Software Carpentry Unix Shell) lesson. We will refer to it for background information. As you become familiar with the Unix Shell, it will be worth reviewing some of the more advanced topics in the SWC lesson Shell Extras.

Today we will learn - how the shell relates to the keyboard, the screen, the operating system, and users’ programs." - when and why command-line interfaces should be used instead of graphical interfaces. - similarities and differences between a file and a directory. - absolute and relative paths - steps in the shell’s read-run-print cycle. - learn about commands, flags, and filenames in a command-line call

Background

A shell also known as a ‘terminal’ or ‘command line interface’ (CLI). A CLI is different from a graphical user interface (GUI), in that the CLI only responds to text whereas a GUI can respond to text as well as mouse inputs.

Many GUI programs have a command line interface, although even if you don’t know it. In addition, the standard Unix shell provides access to a diverse range of standard programs. These make it easier to automate repetitive tasks as well as access other computers and servers. While the shell is powerful, at first it is unfamiliar with cryptic commands and operations.

The heart of a CLI is a read-evaluate-print loop, or REPL: when the user types a command and then presses the Enter (or Return) key, the computer reads it, executes it, and prints its output. The user then types another command, and so on until the user logs off.

Using Bash or any other shell sometimes feels more like programming than like using a mouse. Commands are terse (often only a couple of characters long), their names are frequently cryptic, and their output is lines of text rather than something visual like a graph. On the other hand, with only a few keystrokes, the shell allows us to combine existing tools into powerful pipelines and handle large volumes of data automatically. This automation not only makes us more productive but also improves the reproducibility of our workflows by allowing us to repeat them with few simple commands.

In addition, the command line is often the easiest way to interact with remote machines and supercomputers. Familiarity with the shell is near essential to run a variety of specialized tools and resources including high-performance computing systems. As clusters and cloud computing systems become more popular for scientific data crunching, being able to interact with the shell is becoming a necessary skill. We can build on the command-line skills covered here to tackle a wide range of scientific questions and computational challenges.

Environment Variables

echo $PATH
printenv

Find with grep

Find and replace with sed

For Loops

for $i in in {seq 1 10}; do
   echo $i
done

Shell Scripts

Regular Expressions (‘Regex’)

Too long for today, difficult to learn, but examples are Googleable, like

“Regex date range at end of string”

Real World Examples:

Advanced:

Problem: changing file naming conventions broke a computing pipeline

The recent calibration images brought to my attention that the SWIR naming convention has diverged from the VNIR. SWIR files now have an extra date string (right after the UUID) in their names, and also are missing an underscore before the english suffix. Was there a good reason for changing the naming convention? It would be helpful if SWIR used the same convention as VNIR. Otherwise we need to re-write the extractors to handle both conventions and be backwards compatible, etc.

Questions * What is the UUID? * Why is having an extra date a problem? * How many dates can you find in the following code snippet? Are they consistent?

zender@cg-gpu01:~/nco$ ls -l /projects/arpae/terraref/sites/ua-mac/raw_data/SWIR/2017-04-15/2017-04-15__11-59-12-426
total 481696
-rw-r--r-- 1 dlebauer grp_202     55123 Apr 15 12:03 f130c910-7887-49b0-97bb-db49e8c85e63_2017_04_15_12_05_11frameIndex.txt
-rw-r--r-- 1 dlebauer grp_202     27533 Apr 15 12:03 f130c910-7887-49b0-97bb-db49e8c85e63_2017_04_15_12_05_11image.jpg
-rw-r--r-- 1 dlebauer grp_202 493129728 Apr 15 12:03 f130c910-7887-49b0-97bb-db49e8c85e63_2017_04_15_12_05_11raw
-rw-r--r-- 1 dlebauer grp_202      3503 Apr 15 12:03 f130c910-7887-49b0-97bb-db49e8c85e63_2017_04_15_12_05_11raw.hdr
-rw-r--r-- 1 dlebauer grp_202       869 Apr 15 12:03 f130c910-7887-49b0-97bb-db49e8c85e63_2017_04_15_12_05_11settings.txt
-rw-r--r-- 1 dlebauer grp_202      3561 Apr 15 12:03 f130c910-7887-49b0-97bb-db49e8c85e63_metadata.json


zender@cg-gpu01:~/nco$ ls -l /projects/arpae/terraref/sites/ua-mac/raw_data/VNIR/2017-04-15/2017-04-15__11-56-59-902/
total 5204960
-rw-r--r-- 1 dlebauer grp_202      40591 Apr 15 12:00 ca045a19-7b12-4627-b700-9f51f5829b64_frameIndex.txt
-rw-r--r-- 1 dlebauer grp_202      69299 Apr 15 12:00 ca045a19-7b12-4627-b700-9f51f5829b64_image.jpg
-rw-r--r-- 1 dlebauer grp_202       3605 Apr 15 12:00 ca045a19-7b12-4627-b700-9f51f5829b64_metadata.json
-rw-r--r-- 1 dlebauer grp_202 5329664000 Apr 15 12:02 ca045a19-7b12-4627-b700-9f51f5829b64_raw
-rw-r--r-- 1 dlebauer grp_202      10257 Apr 15 12:01 ca045a19-7b12-4627-b700-9f51f5829b64_raw.hdr
-rw-r--r-- 1 dlebauer grp_202        872 Apr 15 12:02 ca045a19-7b12-4627-b700-9f51f5829b64_settings.txt
  • date is in directory name, date-time subdirectory, filename (for SWIR but not VNIR) and in the output from ls -l.
  • note that there is a ~6 minute discrepancy between SWIR directory and filename. File timestamp is in between.

Solution Part 1

Don’t change upstream formats / conventions without alerting downstream developers!

  • After a few similar issues we created a protocol:
  • upstream developers alert downstream developers before changing formats
  • downstream developers write tests to catch errors upstream before they cause errors (in most cases, for the pipeline to break)

This is a clear rule but difficult to enforce without extensive automated testing (in this case the tests are written, but the pipeline was lagging a few months behind and did not catch it).

Solution Part 2

When Part 1 fails, write a script to rename all of the files. Lets take a look.

#!/bin/bash

cd /projects/arpae/terraref/sites/ua-mac/raw_data/SWIR/
for d in `/bin/ls -d 2016-11-0[89] 2016-11-[123]? 2016-12-?? 2017-??-??`; do 
    yyyymmdd=$d
    drc_top="/projects/arpae/terraref/sites/ua-mac/raw_data/SWIR/${yyyymmdd}"
    cd ${drc_top}
    for drc_sub in `/bin/ls -d ${yyyymmdd}*` ; do
        echo "Renaming in directory ${drc_sub}..."
        cd ${drc_top}/${drc_sub}
        for fl in `/bin/ls *raw` ; do
          dt_sng=`expr match "${fl}" '.*\([0-9][0-9][0-9][0-9]_[0-9][0-9]_[0-9][0-9]_[0-9][0-9]_[0-9][0-9]_[0-9][0-9]\)\.*'`
          uuid_sng=${fl:0:36}
        done # !fl
        echo "uuid=${uuid_sng}, dt=${dt_sng}"
        for sfx in frameIndex.txt image.jpg raw raw.hdr settings.txt ; do
          mv_cmd="/bin/mv ${uuid_sng}_${dt_sng}${sfx} ${uuid_sng}_${sfx}"
          echo "/bin/mv ${uuid_sng}_${dt_sng}${sfx} ${uuid_sng}_${sfx}"
          eval ${mv_cmd}
        done # !sfx
   done # !drc_sub
done # !d
## sh: 3: cd: can't cd to /projects/arpae/terraref/sites/ua-mac/raw_data/SWIR/
## /bin/ls: cannot access '2016-11-0[89]': No such file or directory
## /bin/ls: cannot access '2016-11-[123]?': No such file or directory
## /bin/ls: cannot access '2016-12-??': No such file or directory
## /bin/ls: cannot access '2017-??-??': No such file or directory

Can you find:

  • How many for-loops are they? How deeply are they nested?
  • What do you think the following regex characters mean:
  • [0-9]?
  • [123]