3. Searching files with GREP

Searching files is one of the most common things you’ll be doing on the command line. Is this word contained in this document? Does this gene appear in this smetagenome? How many sequences are there in this fasta file? How often does this annotation appear in this fasta file?

The main tool used to search documents is grep. grep stands for **g**lobally search for a **r**egular **e**xpression and **p**rint matching lines.

3.1. Basic grep searches

Let’s start by changing into the my_work directory and searching through some files there.

cd ~/my_work

Let’s look for the word “favorite” in the “favorites” document. The syntax for grep is grep [optional-arguments] <pattern> <file>. We’ll just start with a basic search:

grep "favorite" favorites

Output:

Iced coffee is my favorite drink.
My favorite movie is Speed Racer.
My favorite food is a burrito.

Not that exciting since every line contains the word favorite. Instead, let’s try searching for our favorite drink:

grep "favorite drink" favorites

Output:

Iced coffee is my favorite drink.

We can also output the line number that the match appears on with the -n option:

grep -n "favorite drink" favorites

Output:

1:Iced coffee is my favorite drink.

If we instead wanted to look for all the lines that DID NOT contain the word “drink,” we can pass the -v option to grep to invert the search. That is, it will return all the lines that don’t contain the search phrase:

grep -v "favorite drink" favorites

Output:

My favorite movie is Speed Racer.
My favorite food is a burrito.

We can also combine the two arguments:

grep -nv "favorite drink" favorites

Output:

2:My favorite movie is Speed Racer.
3:My favorite food is a burrito.

Another very useful option is -c. This counts the number of matches in the file. For instance, to count the number of times the word “favorite” appears in favorites, we can do:

grep -c "favorite" favorites

Output:

3

We can combine c and v to count the number of lines a match does NOT appear on. So, to count the number of lines “drink” doesn’t appear on in favorites, we can do:

grep -cv "drink" favorites
2

3.1.1. Exercises

  • What line number does “meta” appear on in combined_sequences.fasta?

  • What line number does “test” first appear on in all_seqs.combined.fa?

3.2. Using grep with fasta files

One of the advantages of working with fasta (and fastq) files is that they have a known structure: lines that start with > denote sequence headers, and the lines after that are the nucleotides that make up that sequence. Because we know that sequences start with >, we can search for the sequence headers directly with grep. Let’s look for the sequences in combined_sequences.fasta.

grep ">" combined_sequences.fasta

Output:

>meta1_1
>test_sequence_1
>test_sequence_14

So we see that we found the sequence headers in the combined_sequences.fasta file. If all we were interested in was the number of sequences contained in the file, we could use the -c flag from above:

grep -c ">" combined_sequences.fasta

Output:

3

Something to watch out for, though, is that sometimes people might use the > character elsewhere in the fasta header. In that case, the grep -c count will be inflated because it is counting the extra characters. A solution to this is to use the special ^ character at the beginning of your grep search pattern. The ^ character is referred to as the “anchor” and indicates that grep should only look at the beginning of each line for the pattern. When applied to our above grep search, this means that we will only look for the > character at the beginning of a line:

grep -c ^">" combined_sequences.fasta

Output:

3

This is a very useful search pattern and one you’ll use often to count the number of sequences in a fasta file. Note that it didn’t make a difference here, but there have been times where it has.

An additional very useful argument to use with grep when searching fasta files is the -A n flag. -A stands for “after context” and prints out the next n lines after each match.

Let’s try it out. Let’s look for “drink” again in favorites, but this time, let’s pull the line after it as well:

grep -A 1 "drink" favorites

Output:

Iced coffee is my favorite drink.
My favorite movie is Speed Racer.

This is very useful for working with fasta files because we often want to search for certain sequences by their header, pull out their nucleotide sequences, and store them somewhere else. Here, we’re going to search for the word “meta” in all_seqs.combined.fa and output the line following:

grep -A 1 "meta" all_seqs.combined.fa

Output:

>meta1_1
GGGCCGTGGCGTCATCACGTCGATCAATTTCCTCGAGGAGAACGGCGCGTATGACGGCGTGGACTATGTGTCGTACGACGTGCTCGGCGACGTGGTGTGC
>meta1_2
GAGCCGGGTGTGGGATGTGCCGGGAGAGGTATCATTACTTCTATTAATTTATTGGAACAGCTTGGTGCCTATGCGGATAGCGAGGAGCTGGATTATATAT
>meta1_3
CCTGATCTGCAACGAACGTCAAACCGACCGTGAGCTCGACCTTGCTGAAGCCCTGGCTTCCAGGCTCAATTCCAAGCTCATCCATTTCGTGCCGCGCGAC
>meta1_4
TTCCATCAACATGCTGGAAAACCTCGGTGCCTACGAAGAATCCGAAGGACTGGACTACGCTTTCTATGACGTCCTCGGTGACGTTGTTTGCGGCGGATTC
>meta1_5
GATGGCGCTCTATGCCGCCAACAACATCGCCAAGGGCATCCTGAAATATGCCCATTCGGGCGGGGTGCGGCTTGGGGGCCTGATCTGCAACGAACGTCAA
>meta1_6
CCTCGACTTCGTTTTCTACGACGTTCTGGGCGACGTCGTGTGTGGCGGATTCGCTATGCCGATTCGTGAAAACAAGGCGCAGGAAATCTACATTGTTGTA
>meta1_7
TGGGATACGGCGAAGTCAAGTGCGTCGAGTCTGGTGGTCCTGAGCCCGGTGTCGGATGTGCTGGTCGTGGCGTGATCACCGCGATCAACTTCCTCGAAGA
>meta1_8
TGGAGGATGTGCGCAAGGAGGGGTACGGAGGCACCATGTGCACGGAATCGGGAGGCCCTGAACCGGGAGTGGGTTGCGCGGGGCGCGGCATCATCACTTC
>meta1_9
TGGAACTGGCGGAGGCGCTGGCCAAGAAGTTGGGCACTCAGCTGATCTACTTTGTGCCGCGCGACAATGTGGTGCAGCATGCCGAGCTGCGGCGCATGAC
>meta1_10
AAATCGGCGTTTTTTCAACTGCTTAATTACAACCAATAGATAGGAGAACCCGATTATGACTAAAAAACTGAGACAAATCGCATTTTACGGCAAAGGAGGA

And just like before, we can use the > to store these sequences in a new file called meta_sequences.fa:

grep -A 1 "meta" all_seqs.combined.fa > meta_sequences.fa

3.2.1. Exercise

  • Count the number of sequences in all_seqs.combined.fa.

  • Search all_seqs.combined.fa for all sequences whose number starts with 2. How many were there? Save them to a file named seqs_2.fasta.

  • We used the -A 1 flag earlier to pull the sequence after each match. What assumption are we making about the structure of the file when we do this?

Note

This is a lot to take in and use right away, but there are a ton of resources to help you remember and work things out. There are also the man pages that you can access in the terminal by typing man grep (or whatever other command you have). This help text will show you the usage of the command and the different arguments. And of course there is Google. You’ll almost certainly be able to find an answer to your question online. However, it’s good practice to try and use man to figure out yourself so build up your “problem solving” muscle.