Introduction

Overview

Teaching: 30 min
Exercises: 10 min

Questions

What is a command shell and why would I use one?

Objectives

Explain how the shell relates to the keyboard, the screen, the operating system, and users’ programs.

Explain when and why command-line interfaces should be used instead of graphical interfaces.

Open your terminal

To start we will open a terminal.

Go to the link given to you at the workshop
Paste the notebook link next to your name into your browser
Select “Terminal” from the “JupyterLab” launcher (or blue button with a plus in the upper left corner)
After you have done this put up a green sticky not if you see a flashing box next to a $

What am I seeing: when the shell is first opened, you are presented with a prompt, indicating that the shell is waiting for input.

The Shell

The shell is a program where users can type commands. With the shell, it’s possible to invoke complicated programs like climate modeling software or simple commands that create an empty directory with only one line of code. The most popular Unix shell is Bash (the Bourne Again SHell — so-called because it’s derived from a shell written by Stephen Bourne). Bash is the default shell on most modern implementations of Unix and in most packages that provide Unix-like tools for Windows. Note that ‘Git Bash’ is a piece of software that enables Windows users to use a Bash like interface when interacting with Git.

Using the shell will take some effort and some time to learn. While a GUI presents you with choices to select, CLI choices are not automatically presented to you, so you must learn a few commands like new vocabulary in a language you’re studying. However, unlike a spoken language, a small number of “words” (i.e. commands) gets you a long way, and we’ll cover those essential few today.

The grammar of a shell allows you to combine existing tools into powerful pipelines and handle large volumes of data automatically. Sequences of commands can be written into a script, improving the reproducibility of workflows.

In addition, the command line is often the easiest way to interact with remote machines and supercomputers. Familiarity with the shell is near essential to run a variety of specialized tools and resources including high-performance computing systems. As clusters and cloud computing systems become more popular for scientific data crunching, being able to interact with the shell is becoming a necessary skill. We can build on the command-line skills covered here to tackle a wide range of scientific questions and computational challenges.

The Prompt

The shell typically uses $ as the prompt, but may use a different symbol. In the examples for this lesson, we’ll show the prompt as $. Most importantly, do not type the prompt when typing commands. Only type the command that follows the prompt. This rule applies both in these lessons and in lessons from other sources. Also note that after you type a command, you have to press the Enter key to execute it.

The prompt is followed by a text cursor, a character that indicates the position where your typing will appear. The cursor is usually a flashing or solid block, but it can also be an underscore or a pipe. You may have seen it in a text editor program, for example.

Note that your prompt might look a little different. In particular, most popular shell environments by default put your user name and the host name before the $. Such a prompt might look like, e.g.:

student@workshop-1:~$

Read Evaluate Print Loop

There are many ways for a user to interact with a computer. For example, we often use a Graphical User Interface (GUI). With a GUI we might roll a mouse to the logo of a folder and click or tap (on a touch screen) to show the content of that folder. In a Commandline Interface the user can do all of the same actions (e.g. show the content of a folder). On the Commandline the user passes commands to the computer as lines of text. Below are the steps in a Read Evaluate Print Loop (REPL):

the shell presents a prompt (like $)
user types a command and presses the Enter key
the computer reads it
the computer executes it and prints its output (if any)
loop from step #4 back to step #1

Reasons to learn about the shell

Many bioinformatics tools can only process large data in the command line version not the GUI.
The shell makes your work less boring (same set of tasks with a large number of files)
The shell makes your work less error-prone
The shell makes your work more reproducible.
Many bioinformatic tasks require large amounts of computing power

Let’s call some programs

The most basic command is to call a program to perform its default action. For example, call the program whoami to return your username.
whoami
You can also call a program and pass arguments to the program, for example this command to find which shell we are using:
echo $SHELL

Glance at the Filesystem

The ls command will list the contents of your current directory (directory is synonymous with folder). Any line that starts with # will not be executed. We can write comments to ourselves by starting the line with #.

# call ls to list current directory
ls 

# pass one or more paths of files or directories as argument(s)
ls /home/student/

In a GUI you may customize your finder/file browser based on how you like to search. In general if you can do it on a GUI there is a way to use text to do it on the commandline. I like to see my most recently changed files first. I also like to see the date they were edited.

# call ls to list bin and show the most recently changed files first (with the `-t` option/flag)
ls -t /home/student/

# add the `-l` to show who owns the file, file size, and what date is was last edited
ls -t -l /home/student/

# a flags to distinguish Folders from files (`-F`) and to show "human readable" filesizes (`-h`)
ls -t -l -F -h /home/student/

# combine short flags for faster typing
ls -lthF /home/student/

The basic syntax of a unix command is:

call the program
pass any flags/options
pass any “order dependent arguments”

Getting help

ls has lots of other options. There are common ways to find out how to use a command and what options it accepts (depending on your environment). Today we will call the unix command man and pass the name of the program that we want a manual for as the argument for man.

$ man ls

Usage: ls [OPTION]... [FILE]...
List information about the FILEs (the current directory by default).
Sort entries alphabetically if none of -cftuvSUX nor --sort is specified.

Mandatory arguments to long options are mandatory for short options too.
  -a, --all                  do not ignore entries starting with .
  -A, --almost-all           do not list implied . and ..
      --author               with -l, print the author of each file
  -b, --escape               print C-style escapes for nongraphic characters

Help menus show you the basic syntax of the command. Optional elements are shown in square brackets. Ellipses indicate that you can type more than one of the elements.

Help menus show both the long and short version of the flags. Use the short option when typing commands directly into the shell to minimize keystrokes and get your task done faster. Use the long option in scripts to provide clarity. It will be read many times and typed once.

When man does not work

If man PROGRAM does not show you a help menu there are other common ways to show help menus that you can try.

call the program with only the --help flag Type q to exit out of this help screen.

some bioinformatics programs will show a help menu if you call the tool without any flags or arguments (e.g. samtools).

Command not found

If the shell can’t find a program whose name is the command you typed, it will print an error message such as:
$ ks
Solution
ks: command not found
This might happen if the command was mis-typed or if the program corresponding to that command is not installed. When your get an error message stay calm and give it a couple of read throughs. Error messages can seem akwardly worded at first but they can really help guide your debugging.

The Cloud

There are a number of reasons why accessing a remote machine is invaluable to any scientists working with large datasets. In the early history of computing, working on a remote machine was standard practice - computers were bulky and expensive. Today we work on laptops or desktops that are more powerful than the sum of the world’s computing capacity 20 years ago, but many analyses (especially in genomics) are too large to run on these laptops/desktops. These analyses require larger machines, often several of them linked together, where remote access is the only practical solution.

Schools and research organizations often link many computers into one High Performance Computing (HPC) cluster on or near the campus. Another model that is becoming common is to “rent” space on a cluster(s) owned by a large company (Amazon, Google, Microsoft, etc). In recent years, computational power has become a commodity and entire companies have been built around a business model that allows you to “rent” one or more linked computers for as long as you require, at lower cost than owning the cluster (depending on how often it is used vs idle, etc). This is the basic principle behind the cloud. You define your computational requirements and off you go.

The cloud is a part of our everyday life (e.g. using Amazon, Google, Netflix, or an ATM involves remote computing). The topic is fascinating, but this lesson says a few minutes or less so let’s get back to working on it for the workshop.

For this workshop starting a vm and setting up your working environment has been done for you. Going forward reach out to your organizations system administrators for your cluster for suggestions. To read more on your own here are lessons about working on the cloud and a local HPC. Additional lesson’s that you can run from a remote computer:

HUMAN GENOMIC DATA & SECURITY

Note that if you are working with human genomics data there might be ethical and legal considerations that affect your choice of cloud resources to use. The terms of use, and/or the legislation under which you are handling the genomic data, might impose heightened information security measures for the computing environment in which you intend to process it. This is a too broad topic to discuss in detail here, but in general terms you should think through the technical and procedural measures needed to ensure that the confidentiality and integrity of the human data you work with is not breached. If there are laws that govern these issues in the jurisdiction in which you work, be sure that the cloud service provider you use can certify that they support the necessary measures. Also note that there might exist restrictions for use of cloud service providers that operate in other jurisdictions than your own, either by how the data was consented by the research subjects or by the jurisdiction under which you operate. Do consult the legal office of your institution for guidance when processing human genomic data.

Key Points

Many bioinformatics tools can only process large data in the command line version not the GUI.

The shell makes your work less boring (same set of tasks with a large number of files)”

The shell makes your work less error-prone

The shell makes your work more reproducible.

Many bioinformatic tasks require large amounts of computing power

lesson home

NYGC Sequence Informatics Workshop

next episode

Introduction

Overview

Open your terminal

The Shell

The Prompt

Read Evaluate Print Loop

Reasons to learn about the shell

Let’s call some programs

Glance at the Filesystem

Getting help

When `man` does not work

Command not found

Solution

The Cloud

HUMAN GENOMIC DATA & SECURITY

Key Points

lesson home

next episode

lesson home

NYGC Sequence Informatics Workshop

next episode

Introduction

Overview

Open your terminal

The Shell

The Prompt

Read Evaluate Print Loop

Reasons to learn about the shell

Let’s call some programs

Glance at the Filesystem

Getting help

When man does not work

Command not found

Solution

The Cloud

HUMAN GENOMIC DATA & SECURITY

Key Points

lesson home

next episode

When `man` does not work