The Terminal (Unix Shell)
“What is a command shell and why would I use one?”
This tutorial is based on the Software Carpentry Unix Shell) lesson. We will refer to it for background information. As you become familiar with the Unix Shell, it will be worth reviewing some of the more advanced topics in the SWC lesson Shell Extras.
Today we will learn - how the shell relates to the keyboard, the screen, the operating system, and users’ programs." - when and why command-line interfaces should be used instead of graphical interfaces. - similarities and differences between a file and a directory. - absolute and relative paths - steps in the shell’s read-run-print cycle. - learn about commands, flags, and filenames in a command-line call
Background
A shell also known as a ‘terminal’ or ‘command line interface’ (CLI). A CLI is different from a graphical user interface (GUI), in that the CLI only responds to text whereas a GUI can respond to text as well as mouse inputs.
Many GUI programs have a command line interface, although even if you don’t know it. In addition, the standard Unix shell provides access to a diverse range of standard programs. These make it easier to automate repetitive tasks as well as access other computers and servers. While the shell is powerful, at first it is unfamiliar with cryptic commands and operations.
The heart of a CLI is a read-evaluate-print loop, or REPL: when the user types a command and then presses the Enter (or Return) key, the computer reads it, executes it, and prints its output. The user then types another command, and so on until the user logs off.
Using Bash or any other shell sometimes feels more like programming than like using a mouse. Commands are terse (often only a couple of characters long), their names are frequently cryptic, and their output is lines of text rather than something visual like a graph. On the other hand, with only a few keystrokes, the shell allows us to combine existing tools into powerful pipelines and handle large volumes of data automatically. This automation not only makes us more productive but also improves the reproducibility of our workflows by allowing us to repeat them with few simple commands.
In addition, the command line is often the easiest way to interact with remote machines and supercomputers. Familiarity with the shell is near essential to run a variety of specialized tools and resources including high-performance computing systems. As clusters and cloud computing systems become more popular for scientific data crunching, being able to interact with the shell is becoming a necessary skill. We can build on the command-line skills covered here to tackle a wide range of scientific questions and computational challenges.
Environment Variables
echo $PATH
printenv
Find with grep
Find and replace with sed
For Loops
for $i in in {seq 1 10}; do
echo $i
done
Shell Scripts
Regular Expressions (‘Regex’)
Too long for today, difficult to learn, but examples are Googleable, like
Real World Examples:
Advanced:
Problem: changing file naming conventions broke a computing pipeline
The recent calibration images brought to my attention that the SWIR naming convention has diverged from the VNIR. SWIR files now have an extra date string (right after the UUID) in their names, and also are missing an underscore before the english suffix. Was there a good reason for changing the naming convention? It would be helpful if SWIR used the same convention as VNIR. Otherwise we need to re-write the extractors to handle both conventions and be backwards compatible, etc.
Questions * What is the UUID? * Why is having an extra date a problem? * How many dates can you find in the following code snippet? Are they consistent?
zender@cg-gpu01:~/nco$ ls -l /projects/arpae/terraref/sites/ua-mac/raw_data/SWIR/2017-04-15/2017-04-15__11-59-12-426
total 481696
-rw-r--r-- 1 dlebauer grp_202 55123 Apr 15 12:03 f130c910-7887-49b0-97bb-db49e8c85e63_2017_04_15_12_05_11frameIndex.txt
-rw-r--r-- 1 dlebauer grp_202 27533 Apr 15 12:03 f130c910-7887-49b0-97bb-db49e8c85e63_2017_04_15_12_05_11image.jpg
-rw-r--r-- 1 dlebauer grp_202 493129728 Apr 15 12:03 f130c910-7887-49b0-97bb-db49e8c85e63_2017_04_15_12_05_11raw
-rw-r--r-- 1 dlebauer grp_202 3503 Apr 15 12:03 f130c910-7887-49b0-97bb-db49e8c85e63_2017_04_15_12_05_11raw.hdr
-rw-r--r-- 1 dlebauer grp_202 869 Apr 15 12:03 f130c910-7887-49b0-97bb-db49e8c85e63_2017_04_15_12_05_11settings.txt
-rw-r--r-- 1 dlebauer grp_202 3561 Apr 15 12:03 f130c910-7887-49b0-97bb-db49e8c85e63_metadata.json
zender@cg-gpu01:~/nco$ ls -l /projects/arpae/terraref/sites/ua-mac/raw_data/VNIR/2017-04-15/2017-04-15__11-56-59-902/
total 5204960
-rw-r--r-- 1 dlebauer grp_202 40591 Apr 15 12:00 ca045a19-7b12-4627-b700-9f51f5829b64_frameIndex.txt
-rw-r--r-- 1 dlebauer grp_202 69299 Apr 15 12:00 ca045a19-7b12-4627-b700-9f51f5829b64_image.jpg
-rw-r--r-- 1 dlebauer grp_202 3605 Apr 15 12:00 ca045a19-7b12-4627-b700-9f51f5829b64_metadata.json
-rw-r--r-- 1 dlebauer grp_202 5329664000 Apr 15 12:02 ca045a19-7b12-4627-b700-9f51f5829b64_raw
-rw-r--r-- 1 dlebauer grp_202 10257 Apr 15 12:01 ca045a19-7b12-4627-b700-9f51f5829b64_raw.hdr
-rw-r--r-- 1 dlebauer grp_202 872 Apr 15 12:02 ca045a19-7b12-4627-b700-9f51f5829b64_settings.txt
- date is in directory name, date-time subdirectory, filename (for SWIR but not VNIR) and in the output from
ls -l
. - note that there is a ~6 minute discrepancy between SWIR directory and filename. File timestamp is in between.
Solution Part 1
Don’t change upstream formats / conventions without alerting downstream developers!
- After a few similar issues we created a protocol:
- upstream developers alert downstream developers before changing formats
- downstream developers write tests to catch errors upstream before they cause errors (in most cases, for the pipeline to break)
This is a clear rule but difficult to enforce without extensive automated testing (in this case the tests are written, but the pipeline was lagging a few months behind and did not catch it).
Solution Part 2
When Part 1 fails, write a script to rename all of the files. Lets take a look.
#!/bin/bash
cd /projects/arpae/terraref/sites/ua-mac/raw_data/SWIR/
for d in `/bin/ls -d 2016-11-0[89] 2016-11-[123]? 2016-12-?? 2017-??-??`; do
yyyymmdd=$d
drc_top="/projects/arpae/terraref/sites/ua-mac/raw_data/SWIR/${yyyymmdd}"
cd ${drc_top}
for drc_sub in `/bin/ls -d ${yyyymmdd}*` ; do
echo "Renaming in directory ${drc_sub}..."
cd ${drc_top}/${drc_sub}
for fl in `/bin/ls *raw` ; do
dt_sng=`expr match "${fl}" '.*\([0-9][0-9][0-9][0-9]_[0-9][0-9]_[0-9][0-9]_[0-9][0-9]_[0-9][0-9]_[0-9][0-9]\)\.*'`
uuid_sng=${fl:0:36}
done # !fl
echo "uuid=${uuid_sng}, dt=${dt_sng}"
for sfx in frameIndex.txt image.jpg raw raw.hdr settings.txt ; do
mv_cmd="/bin/mv ${uuid_sng}_${dt_sng}${sfx} ${uuid_sng}_${sfx}"
echo "/bin/mv ${uuid_sng}_${dt_sng}${sfx} ${uuid_sng}_${sfx}"
eval ${mv_cmd}
done # !sfx
done # !drc_sub
done # !d
## sh: 3: cd: can't cd to /projects/arpae/terraref/sites/ua-mac/raw_data/SWIR/
## /bin/ls: cannot access '2016-11-0[89]': No such file or directory
## /bin/ls: cannot access '2016-11-[123]?': No such file or directory
## /bin/ls: cannot access '2016-12-??': No such file or directory
## /bin/ls: cannot access '2017-??-??': No such file or directory
Can you find:
- How many for-loops are they? How deeply are they nested?
- What do you think the following regex characters mean:
[0-9]
?[123]