“What is a command shell and why would I use one?”
This tutorial is based on the Software Carpentry Unix Shell) [@swc_unix_shell] lesson, and will refer to it for background information.
Today we will learn - how the shell relates to the keyboard, the screen, the operating system, and users’ programs." - when and why command-line interfaces should be used instead of graphical interfaces. - similarities and differences between a file and a directory. - absolute and relative paths - steps in the shell’s read-run-print cycle. - learn about commands, flags, and filenames in a command-line call
Background
A shell also known as a ‘terminal’ or ‘command line interface’ (CLI). A CLI is different from a graphical user interface (GUI), in that the CLI only responds to text whereas a GUI can respond to text as well as mouse inputs.
Many GUI programs have a command line interface, although even if you don’t know it. In addition, the standard Unix shell provides access to a diverse range of standard programs. These make it easier to automate repetitive tasks as well as access other computers and servers. While the shell is powerful, at first it is unfamiliar with cryptic commands and operations.
The heart of a CLI is a read-evaluate-print loop, or REPL: when the user types a command and then presses the Enter (or Return) key, the computer reads it, executes it, and prints its output. The user then types another command, and so on until the user logs off.
Using Bash or any other shell sometimes feels more like programming than like using a mouse. Commands are terse (often only a couple of characters long), their names are frequently cryptic, and their output is lines of text rather than something visual like a graph. On the other hand, with only a few keystrokes, the shell allows us to combine existing tools into powerful pipelines and handle large volumes of data automatically. This automation not only makes us more productive but also improves the reproducibility of our workflows by allowing us to repeat them with few simple commands.
In addition, the command line is often the easiest way to interact with remote machines and supercomputers. Familiarity with the shell is near essential to run a variety of specialized tools and resources including high-performance computing systems. As clusters and cloud computing systems become more popular for scientific data crunching, being able to interact with the shell is becoming a necessary skill. We can build on the command-line skills covered here to tackle a wide range of scientific questions and computational challenges.
Looking around
You can use the command ls
to look at different directories (without changing the current working directory) as follows:
ls
ls ~/
ls /data/terraref/sites/ua-mac
To change to the directory /data/terraref/sites/ua-mac
you can use the cd
command:
cd /data/terraref/sites/ua-mac
ls ## reveal the contents
cd ## shortcut brings you back home
pwd
Directories in the shell have a tree-like structure. You can use cd ..
to move up one step in the tree, cd ../..
to move up two steps, cd ../../..
to move up three steps, and so on. Just typing in cd
will bring you to your home directory (equivalent to cd ~/
.
Environment Variables
Environment variables are variables that affect the way software interacts with the user and are valid systemwide. For example, try echoing the environment variable PATH
as follows:
echo $PATH
You can see the environment variables by doing:
printenv
Useful Tools
Let’s look at some useful tools. First we need a file, so let’s go ahead and make a directory to put some data in:
mkdir project
mkdir project/data
cd project/data
Then we get some data from the BETYdb.org database.
Here we use the curl
program, which is a command line tool for moving data around the internet. Here, we use curl
with the -o $genus.csv
flag to specify where to save output from the url.
### Get some crop trait and yield data
for genus in panicum miscanthus salix populus
do
curl -o $genus.csv https://www.betydb.org/search.csv?search=$genus
done
### Get some weather data
cp /data/terraref/sites/uiuc-energyfarm/raw_data/weather_stations/Weather*DayAvg.dat ~/project/data/
Let’s take a look at it, head
prints out the first 10 lines while tail
the last 10 lines:
head data.csv
tail data.csv
The -n flag can modify the number of lines shown:
head -n 10 data.csv
tail -n +11 data.csv
You can also display all of its contents onto the shell (not recommended for big files like this one):
cat data.csv
We can count the lines, words, and characters (in that order) of a file with:
wc -l data.csv
You can also save this data into a file:
wc -l data.csv > lengths.txt
and sort it:
sort lengths.txt
This doesn’t modify the file lengths.txt
, so you need to save it to a new file:
sort lengths.txt > sorted_lengths.txt
Pipes
A pipe |
is used to have the output of the command to its left be used as the input of the command to the right, for example:
tail -n +11 data.csv | head -n 4
compare this with:
tail -n +11 data.csv > miscanthus.csv
Find with grep
The command grep
can be used to find a pattern in a file, for example to find the pattern yield
in the file miscanthus.csv
:
grep yield miscanthus.csv
Find and replace with sed
The following replaces “willow” for “Willow” in the file willow.csv
. The -i flag is to make this interactive (try it with and without it in some small text file of your creation):
sed -i 's/willow/Willow/g' willow.csv
In sed -i s/find/replace/g
, -i
means change the file in place, s/
means ‘substitute’ and /g
means replace all occurances.
Parse text with cut
The command cut
can be used to remove sections of each line of a file:
cut -d, -f3,5 miscanthus.csv
For Loops
The shell has for
loops to iterate actions. When you enter:
for genus in panicum miscanthus salix
it enters into a command entry mode in which you need to tell it what it does as it loops through panicum
, miscanthus
, salix
. For example to echo these words into the shell:
for genus in panicum miscanthus salix ; do
echo $genus
done
To echo the integers:
for $i in in {seq 1 10}; do
echo $i
done
Shell Scripts
You can write short scripts to run commands on the shell. You can use any text editor, here we’ll use nano
. Shell scripts end in .sh
.
Create and open with nano
a shell script:
nano getdata.sh
In the above, notice the terminal was “locked”, there was no command prompt. To avoid this you can use:
nano getdata.sh &
Now type the following in the file:
#!/bin/bash
for genus in panicum miscanthus salix populus
do
curl -o $genus https://www.betydb.org/search.csv?search=$genus
done
Save the file and exit (in nano the ^
means Ctrl
).
Make the file executable using the program chmod
. chmod
modifies the permissions of the file. +x
makes it executable (+rw
makes it readable and writable)
chmod +x getdata.sh
Run the file:
./getdata.sh
To run the file in the background and continue using your terminal, you can run it using:
nohup ./getdata.sh &
To see its progress:
tail -f nohup.out
Using variables in your shell script.
You can write shell scripts that take input variables.
When you run a bash script, a variable can be added after the name of the script like myscript.sh var1 var2
.
Create a file:
nano getonegenus.sh
Write the following. Here $1
represents the first input variable:
#!/bin/bash
curl -o $1.csv https://www.betydb.org/search.csv?search=$1
Save it and make it executable:
chmod +x getonegenus.sh
Now you can send the search term via the command line:
./getonegenus.sh salix
Time Saving Tips
- You can press the
Tab
on your keyboard to autocomplete. - You can press
Ctrl
+A
to go to the beginning of a line. - You can press the up arrow to view previous commands.
- To clear the line use
Ctrl
+U
- There are many more if you Google “Time-saving command line tips”
Regular Expressions (‘Regex’)
Too long for today, difficult to learn, but examples are Googleable, like
Real World Examples:
Advanced:
Problem: changing file naming conventions broke a computing pipeline
The recent calibration images brought to my attention that the SWIR naming convention has diverged from the VNIR. SWIR files now have an extra date string (right after the UUID) in their names, and also are missing an underscore before the english suffix. Was there a good reason for changing the naming convention? It would be helpful if SWIR used the same convention as VNIR. Otherwise we need to re-write the extractors to handle both conventions and be backwards compatible, etc.
Questions * What is the UUID? * Why is having an extra date a problem? * How many dates can you find in the following code snippet? Are they consistent?
zender@cg-gpu01:~/nco$ ls -l /projects/arpae/terraref/sites/ua-mac/raw_data/SWIR/2017-04-15/2017-04-15__11-59-12-426
total 481696
-rw-r--r-- 1 dlebauer grp_202 55123 Apr 15 12:03 f130c910-7887-49b0-97bb-db49e8c85e63_2017_04_15_12_05_11frameIndex.txt
-rw-r--r-- 1 dlebauer grp_202 27533 Apr 15 12:03 f130c910-7887-49b0-97bb-db49e8c85e63_2017_04_15_12_05_11image.jpg
-rw-r--r-- 1 dlebauer grp_202 493129728 Apr 15 12:03 f130c910-7887-49b0-97bb-db49e8c85e63_2017_04_15_12_05_11raw
-rw-r--r-- 1 dlebauer grp_202 3503 Apr 15 12:03 f130c910-7887-49b0-97bb-db49e8c85e63_2017_04_15_12_05_11raw.hdr
-rw-r--r-- 1 dlebauer grp_202 869 Apr 15 12:03 f130c910-7887-49b0-97bb-db49e8c85e63_2017_04_15_12_05_11settings.txt
-rw-r--r-- 1 dlebauer grp_202 3561 Apr 15 12:03 f130c910-7887-49b0-97bb-db49e8c85e63_metadata.json
zender@cg-gpu01:~/nco$ ls -l /projects/arpae/terraref/sites/ua-mac/raw_data/VNIR/2017-04-15/2017-04-15__11-56-59-902/
total 5204960
-rw-r--r-- 1 dlebauer grp_202 40591 Apr 15 12:00 ca045a19-7b12-4627-b700-9f51f5829b64_frameIndex.txt
-rw-r--r-- 1 dlebauer grp_202 69299 Apr 15 12:00 ca045a19-7b12-4627-b700-9f51f5829b64_image.jpg
-rw-r--r-- 1 dlebauer grp_202 3605 Apr 15 12:00 ca045a19-7b12-4627-b700-9f51f5829b64_metadata.json
-rw-r--r-- 1 dlebauer grp_202 5329664000 Apr 15 12:02 ca045a19-7b12-4627-b700-9f51f5829b64_raw
-rw-r--r-- 1 dlebauer grp_202 10257 Apr 15 12:01 ca045a19-7b12-4627-b700-9f51f5829b64_raw.hdr
-rw-r--r-- 1 dlebauer grp_202 872 Apr 15 12:02 ca045a19-7b12-4627-b700-9f51f5829b64_settings.txt
- date is in directory name, date-time subdirectory, filename (for SWIR but not VNIR) and in the output from
ls -l
. - note that there is a ~6 minute discrepancy between SWIR directory and filename. File timestamp is in between.
Solution Part 1
Don’t change upstream formats / conventions without alerting downstream developers!
- After a few similar issues we created a protocol:
- upstream developers alert downstream developers before changing formats
- downstream developers write tests to catch errors upstream before they cause errors (in most cases, for the pipeline to break)
This is a clear rule but difficult to enforce without extensive automated testing (in this case the tests are written, but the pipeline was lagging a few months behind and did not catch it).
Solution Part 2
When Part 1 fails, write a script to rename all of the files. Lets take a look.
#!/bin/bash
cd /projects/arpae/terraref/sites/ua-mac/raw_data/SWIR/
for d in `/bin/ls -d 2016-11-0[89] 2016-11-[123]? 2016-12-?? 2017-??-??`; do
yyyymmdd=$d
drc_top="/projects/arpae/terraref/sites/ua-mac/raw_data/SWIR/${yyyymmdd}"
cd ${drc_top}
for drc_sub in `/bin/ls -d ${yyyymmdd}*` ; do
echo "Renaming in directory ${drc_sub}..."
cd ${drc_top}/${drc_sub}
for fl in `/bin/ls *raw` ; do
dt_sng=`expr match "${fl}" '.*\([0-9][0-9][0-9][0-9]_[0-9][0-9]_[0-9][0-9]_[0-9][0-9]_[0-9][0-9]_[0-9][0-9]\)\.*'`
uuid_sng=${fl:0:36}
done # !fl
echo "uuid=${uuid_sng}, dt=${dt_sng}"
for sfx in frameIndex.txt image.jpg raw raw.hdr settings.txt ; do
mv_cmd="/bin/mv ${uuid_sng}_${dt_sng}${sfx} ${uuid_sng}_${sfx}"
echo "/bin/mv ${uuid_sng}_${dt_sng}${sfx} ${uuid_sng}_${sfx}"
eval ${mv_cmd}
done # !sfx
done # !drc_sub
done # !d
Can you find:
- How many for-loops are there? How deeply are they nested?
- What do you think the following regex characters mean:
[0-9]
?[123]