15 min read

1. The Terminal (Unix Shell)

“What is a command shell and why would I use one?”

This tutorial is based on the Software Carpentry Unix Shell) [@swc_unix_shell] lesson, and will refer to it for background information.

Today we will learn - how the shell relates to the keyboard, the screen, the operating system, and users’ programs." - when and why command-line interfaces should be used instead of graphical interfaces. - similarities and differences between a file and a directory. - absolute and relative paths - steps in the shell’s read-run-print cycle. - learn about commands, flags, and filenames in a command-line call

Background

A shell also known as a ‘terminal’ or ‘command line interface’ (CLI). A CLI is different from a graphical user interface (GUI), in that the CLI only responds to text whereas a GUI can respond to text as well as mouse inputs.

Many GUI programs have a command line interface, although even if you don’t know it. In addition, the standard Unix shell provides access to a diverse range of standard programs. These make it easier to automate repetitive tasks as well as access other computers and servers. While the shell is powerful, at first it is unfamiliar with cryptic commands and operations.

The heart of a CLI is a read-evaluate-print loop, or REPL: when the user types a command and then presses the Enter (or Return) key, the computer reads it, executes it, and prints its output. The user then types another command, and so on until the user logs off.

Using Bash or any other shell sometimes feels more like programming than like using a mouse. Commands are terse (often only a couple of characters long), their names are frequently cryptic, and their output is lines of text rather than something visual like a graph. On the other hand, with only a few keystrokes, the shell allows us to combine existing tools into powerful pipelines and handle large volumes of data automatically. This automation not only makes us more productive but also improves the reproducibility of our workflows by allowing us to repeat them with few simple commands.

In addition, the command line is often the easiest way to interact with remote machines and supercomputers. Familiarity with the shell is near essential to run a variety of specialized tools and resources including high-performance computing systems. As clusters and cloud computing systems become more popular for scientific data crunching, being able to interact with the shell is becoming a necessary skill. We can build on the command-line skills covered here to tackle a wide range of scientific questions and computational challenges.

Looking around

You can use the command ls to look at different directories (without changing the current working directory) as follows:

ls
ls ~/
ls /data/terraref/sites/ua-mac

To change to the directory /data/terraref/sites/ua-mac you can use the cd command:

cd /data/terraref/sites/ua-mac
ls ## reveal the contents
cd ## shortcut brings you back home
pwd

Directories in the shell have a tree-like structure. You can use cd .. to move up one step in the tree, cd ../.. to move up two steps, cd ../../.. to move up three steps, and so on. Just typing in cd will bring you to your home directory (equivalent to cd ~/.

Environment Variables

Environment variables are variables that affect the way software interacts with the user and are valid systemwide. For example, try echoing the environment variable PATH as follows:

echo $PATH

You can see the environment variables by doing:

printenv

Useful Tools

Let’s look at some useful tools. First we need a file, so let’s go ahead and make a directory to put some data in:

mkdir project
mkdir project/data
cd project/data

Then we get some data from the BETYdb.org database.

Here we use the curl program, which is a command line tool for moving data around the internet. Here, we use curl with the -o $genus.csv flag to specify where to save output from the url.

### Get some crop trait and yield data
for genus in panicum miscanthus salix populus
do 
  curl -o $genus.csv https://www.betydb.org/search.csv?search=$genus
done

### Get some weather data
cp /data/terraref/sites/uiuc-energyfarm/raw_data/weather_stations/Weather*DayAvg.dat ~/project/data/

Let’s take a look at it, head prints out the first 10 lines while tail the last 10 lines:

head data.csv
tail data.csv

The -n flag can modify the number of lines shown:

head -n 10 data.csv
tail -n +11 data.csv 

You can also display all of its contents onto the shell (not recommended for big files like this one):

cat data.csv

We can count the lines, words, and characters (in that order) of a file with:

wc -l data.csv

You can also save this data into a file:

wc -l data.csv > lengths.txt

and sort it:

sort lengths.txt

This doesn’t modify the file lengths.txt, so you need to save it to a new file:

sort lengths.txt > sorted_lengths.txt

Pipes

A pipe | is used to have the output of the command to its left be used as the input of the command to the right, for example:

tail -n +11 data.csv | head -n 4 

compare this with:

tail -n +11 data.csv > miscanthus.csv

Find with grep

The command grep can be used to find a pattern in a file, for example to find the pattern yield in the file miscanthus.csv:

grep yield miscanthus.csv

Find and replace with sed

The following replaces “willow” for “Willow” in the file willow.csv. The -i flag is to make this interactive (try it with and without it in some small text file of your creation):

sed -i 's/willow/Willow/g' willow.csv

In sed -i s/find/replace/g, -i means change the file in place, s/ means ‘substitute’ and /g means replace all occurances.

Parse text with cut

The command cut can be used to remove sections of each line of a file:

cut -d, -f3,5 miscanthus.csv

For Loops

The shell has for loops to iterate actions. When you enter:

for genus in panicum miscanthus salix 

it enters into a command entry mode in which you need to tell it what it does as it loops through panicum, miscanthus, salix. For example to echo these words into the shell:

for genus in panicum miscanthus salix ; do
    echo $genus
done

To echo the integers:

for $i in in {seq 1 10}; do
   echo $i
done

Shell Scripts

You can write short scripts to run commands on the shell. You can use any text editor, here we’ll use nano. Shell scripts end in .sh.

Create and open with nano a shell script:

nano getdata.sh 

In the above, notice the terminal was “locked”, there was no command prompt. To avoid this you can use:

nano getdata.sh &

Now type the following in the file:

#!/bin/bash
for genus in panicum miscanthus salix populus
do 
  curl -o $genus https://www.betydb.org/search.csv?search=$genus
done

Save the file and exit (in nano the ^ means Ctrl).

Make the file executable using the program chmod. chmod modifies the permissions of the file. +x makes it executable (+rw makes it readable and writable)

chmod +x getdata.sh

Run the file:

./getdata.sh

To run the file in the background and continue using your terminal, you can run it using:

nohup ./getdata.sh &

To see its progress:

tail -f nohup.out

Using variables in your shell script.

You can write shell scripts that take input variables.

When you run a bash script, a variable can be added after the name of the script like myscript.sh var1 var2.

Create a file:

nano getonegenus.sh

Write the following. Here $1 represents the first input variable:

#!/bin/bash
curl -o $1.csv https://www.betydb.org/search.csv?search=$1

Save it and make it executable:

chmod +x getonegenus.sh

Now you can send the search term via the command line:

./getonegenus.sh salix

Time Saving Tips

  • You can press the Tab on your keyboard to autocomplete.
  • You can press Ctrl+A to go to the beginning of a line.
  • You can press the up arrow to view previous commands.
  • To clear the line use Ctrl+U
  • There are many more if you Google “Time-saving command line tips”

Regular Expressions (‘Regex’)

Too long for today, difficult to learn, but examples are Googleable, like

“Regex date range at end of string”

Real World Examples:

Advanced:

Problem: changing file naming conventions broke a computing pipeline

The recent calibration images brought to my attention that the SWIR naming convention has diverged from the VNIR. SWIR files now have an extra date string (right after the UUID) in their names, and also are missing an underscore before the english suffix. Was there a good reason for changing the naming convention? It would be helpful if SWIR used the same convention as VNIR. Otherwise we need to re-write the extractors to handle both conventions and be backwards compatible, etc.

Questions * What is the UUID? * Why is having an extra date a problem? * How many dates can you find in the following code snippet? Are they consistent?

zender@cg-gpu01:~/nco$ ls -l /projects/arpae/terraref/sites/ua-mac/raw_data/SWIR/2017-04-15/2017-04-15__11-59-12-426
total 481696
-rw-r--r-- 1 dlebauer grp_202     55123 Apr 15 12:03 f130c910-7887-49b0-97bb-db49e8c85e63_2017_04_15_12_05_11frameIndex.txt
-rw-r--r-- 1 dlebauer grp_202     27533 Apr 15 12:03 f130c910-7887-49b0-97bb-db49e8c85e63_2017_04_15_12_05_11image.jpg
-rw-r--r-- 1 dlebauer grp_202 493129728 Apr 15 12:03 f130c910-7887-49b0-97bb-db49e8c85e63_2017_04_15_12_05_11raw
-rw-r--r-- 1 dlebauer grp_202      3503 Apr 15 12:03 f130c910-7887-49b0-97bb-db49e8c85e63_2017_04_15_12_05_11raw.hdr
-rw-r--r-- 1 dlebauer grp_202       869 Apr 15 12:03 f130c910-7887-49b0-97bb-db49e8c85e63_2017_04_15_12_05_11settings.txt
-rw-r--r-- 1 dlebauer grp_202      3561 Apr 15 12:03 f130c910-7887-49b0-97bb-db49e8c85e63_metadata.json


zender@cg-gpu01:~/nco$ ls -l /projects/arpae/terraref/sites/ua-mac/raw_data/VNIR/2017-04-15/2017-04-15__11-56-59-902/
total 5204960
-rw-r--r-- 1 dlebauer grp_202      40591 Apr 15 12:00 ca045a19-7b12-4627-b700-9f51f5829b64_frameIndex.txt
-rw-r--r-- 1 dlebauer grp_202      69299 Apr 15 12:00 ca045a19-7b12-4627-b700-9f51f5829b64_image.jpg
-rw-r--r-- 1 dlebauer grp_202       3605 Apr 15 12:00 ca045a19-7b12-4627-b700-9f51f5829b64_metadata.json
-rw-r--r-- 1 dlebauer grp_202 5329664000 Apr 15 12:02 ca045a19-7b12-4627-b700-9f51f5829b64_raw
-rw-r--r-- 1 dlebauer grp_202      10257 Apr 15 12:01 ca045a19-7b12-4627-b700-9f51f5829b64_raw.hdr
-rw-r--r-- 1 dlebauer grp_202        872 Apr 15 12:02 ca045a19-7b12-4627-b700-9f51f5829b64_settings.txt
  • date is in directory name, date-time subdirectory, filename (for SWIR but not VNIR) and in the output from ls -l.
  • note that there is a ~6 minute discrepancy between SWIR directory and filename. File timestamp is in between.

Solution Part 1

Don’t change upstream formats / conventions without alerting downstream developers!

  • After a few similar issues we created a protocol:
  • upstream developers alert downstream developers before changing formats
  • downstream developers write tests to catch errors upstream before they cause errors (in most cases, for the pipeline to break)

This is a clear rule but difficult to enforce without extensive automated testing (in this case the tests are written, but the pipeline was lagging a few months behind and did not catch it).

Solution Part 2

When Part 1 fails, write a script to rename all of the files. Lets take a look.

#!/bin/bash

cd /projects/arpae/terraref/sites/ua-mac/raw_data/SWIR/
for d in `/bin/ls -d 2016-11-0[89] 2016-11-[123]? 2016-12-?? 2017-??-??`; do 
    yyyymmdd=$d
    drc_top="/projects/arpae/terraref/sites/ua-mac/raw_data/SWIR/${yyyymmdd}"
    cd ${drc_top}
    for drc_sub in `/bin/ls -d ${yyyymmdd}*` ; do
        echo "Renaming in directory ${drc_sub}..."
        cd ${drc_top}/${drc_sub}
        for fl in `/bin/ls *raw` ; do
          dt_sng=`expr match "${fl}" '.*\([0-9][0-9][0-9][0-9]_[0-9][0-9]_[0-9][0-9]_[0-9][0-9]_[0-9][0-9]_[0-9][0-9]\)\.*'`
          uuid_sng=${fl:0:36}
        done # !fl
        echo "uuid=${uuid_sng}, dt=${dt_sng}"
        for sfx in frameIndex.txt image.jpg raw raw.hdr settings.txt ; do
          mv_cmd="/bin/mv ${uuid_sng}_${dt_sng}${sfx} ${uuid_sng}_${sfx}"
          echo "/bin/mv ${uuid_sng}_${dt_sng}${sfx} ${uuid_sng}_${sfx}"
          eval ${mv_cmd}
        done # !sfx
   done # !drc_sub
done # !d

Can you find:

  • How many for-loops are there? How deeply are they nested?
  • What do you think the following regex characters mean:
  • [0-9]?
  • [123]