4. Introduction of BLAST¶

4.1. Authored by: Jia Liu, Paul Villanueva, and Adina Howe; MODIFIED from a previous tutorial ¶

5. Section 1. Learning objectives¶

This is a tutorial for working with sequencing data and running a program in shell.

5.1. Objectives¶

The learning objectives for this tutorial are as follows:

To be able to navigate and read a sequencing file in shell.
To be able to download, install, and run a command line bioinformatics tool.
To be able to search a BLAST output file.

5.2. Pre-requisites¶

You will need to know some things prior to this tutorial:

Access and login to an Amazon EC2 instance or similar ubuntu-based server
A brief introduction to shell

5.3. Mission¶

Your mission, if you choose to accept it, is to identify nitrogen fixation genes (nifh genes) from sequencing DNA from soils.

Details:: You have been delivered three dogma-changing metagenomes (sequencing datasets) originating from three different Iowa crop soils (corn, soybean, and prairie). You are interested in identifying nitrogen fixation genes that are associated with native bacteria in these soils. Nitrogen fixation is a natural process performed by bacteria that converts nitrogen in the atmosphere into a form that is usable for plants. If we can optimize natural nitrogen fixation, our hope is to reduce nitrogen fertilizer inputs that may contribute to the eutrophication of downstream waters (e.g., dead zones in the Gulf of Mexico).
Methods:: Use BLAST (basic local alignment search tool) to compare sequences from soil metagenome files with a set of know nifh gene sequences.

6. Section 2. Set up BLAST¶

6.1. What is BLAST?¶

From Wikipedia:

“BLAST (basic local alignment search tool) is an algorithm and program for aligning and comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA and/or RNA sequences.

A BLAST search enables a researcher to compare a subject protein or nucleotide sequence (called a query) with a library or database of sequences, and identify library sequences that resemble the query sequence above a certain threshold.”

6.2. Preparation¶

We need to download and install the BLAST package before using it. I am going to go back to the home directory, make a new folder tools, and set up BLAST there:

cd ~
mkdir tools
ls    # you should be able to see your tools folder

We will navigate to the tools folder:

cd tools

6.3. Download BLAST¶

Let’s download the compressed BLAST package after navigating to the folder you prefered by:

wget https://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/ncbi-blast-2.11.0+-x64-linux.tar.gz

Find more BLAST package download and install information from here.

6.4. Install BLAST¶

We can install BLAST by extracting the downloaded compressed package with tar command:

tar -xvf ncbi-blast-2.11.0+-x64-linux.tar.gz    # extract from the compressed package

Note

Command syntax: utility/command flag argument

Utility/Command: The command you run to do your work, e.g., tar
Flag: Options or preferences for commands; commands may use default settings if flags are not defined; Flag normall starts with one or two dashes - which depends on the command, e.g., -xvf
Argument: Arguments tell the command exactly what you want for a specific action. E.g., ncbi-blast-2.11.0+-x64-linux.tar.gz

Yay, we have a brand new BLAST installed! Let’s try it out:

blastn -help

Oops… Command 'blastn' not found... But we just did install BLAST! It seems like something is going wrong. Let’s see if we can find the path/location of blastn with which command:

which blastn

No path found. I already checked the ncbi-blast-2.11.0+ folder and know where blastn is stored. Let’s try to call “help” for blastn with its path:

~/tools/ncbi-blast-2.11.0+/bin/blastn -help

A lot of helpful information is showing. It worked this time, but we need to use the whole path of blastn. Hey, we don’t need to use the whole path for grep, or ls, or pwd. Why only BLAST?

There is an environment variable PATH. If you add the path of executable code files to PATH, the system will automatically know where to find this executable script when you run it. So that we can run these scripts on the command line in other directories without knowing and typing the whole path to them. Let’s see if bash knows where to find grep:

which grep

It does! Because the path of grep is already in PATH variable.

We want to add the path of blastn to the environment PATH. This will allow us to use blastn on the command line in other directories without knowing and typing the whole path to blastn.

Let’s navigate to the actual executable code scripts folder of BLAST, and add the path of these scripts to PATH:

# navigate to the bin folder under BLAST
cd ncbi-blast-2.11.0+/bin
ls

pwd
export PATH=$PATH:/home/ubuntu/tools/ncbi-blast-2.11.0+/bin     # Add the path of the tool to environment PATH
echo $PATH

Awesome! Now we have our BLAST tool set up. We will start the process of sequence analysis with BLAST in the next section.

Before moving on, these are some basic introduction about BLAST commands:

BLAST command	query	database
blastn	nucleotide	nucleotide
blastp	protein	protein
blastx	translated nucleotide	protein
tblastn	protein	translated nucleotide

7. Section 3. BLAST Analysis¶

7.1. Overview¶

7.1.1. Run BLAST on ncbi website¶

Suppose after days of hard working, you generated the nucleotide sequence of a gene that you are interested:

>gene_1
AAGTCCACCACCTCCCAAAATACGCTCGCCGCGCTGGTCGACCTCGGCCAGAGAATTCTCATCGTCGGCT
GCGACCCCAAAGCCGACTCCACCCGCCTGATCCTGAACTCGAAGGCTCAGGATACTGTCCTGCATCTGGC
GGCACAGGAAGGTTCGGTGGAAGATCTCGAGCTGCAGGACGTGCTCAAGATTGGCTACAGAGGCATCAAA
TGTGTGGAGTCCGGCGGCCCGGAGCCGGGTGTTGGCTGCGCCGGCCGAGGCGTCATCACATCGATCAACT
TCCTCGAGGAGAACGGCGCATATGACGATGTCGACTATGTCTCCTACGACGTGCTGGGTGACGTGGTTTG
CGGCGGTTTCGCGATGCCGATCCGCGAGGGCAAGGCGCAGGAAATCTACATCGTTATGTCCGGGGAGATG
ATGGCGCTCTATGCCGCCAACAATATCGCCAAGGGCATCCTGAAATATGCCCATTCGGGCGGGGTGCGGC
TTGGGGGCCTGATCTGCAACGAACGTCAAACTGACCGTGAGCTCGACCTTGCTGAAGCCCTGGCTTCCAG
GCTCAATTCCAAGCTCATCCATTTCGTGCCGCGCGACAACATCGTTCAGCACGCCGAACTCAGGAAGATG
TCAGTTATCCAGTATGCCCCGGATTCCAAGCAGGCCGGGGAGTACCGCGCACTGGCCGAGAAGATCCATG
CCAATTCTGGCCAGGGCACCATCCCGACGCCGATCACCATGGACG

You would like to know what gene it is, which organism it is from… You can easily get all these information through a BLAST on the ncbi website. Let’s copy and paste this gene, and try the online BLAST.

It takes a little while to BLAST one gene against ncbi nt database. nt is a big nucleotide collection consists of annotated nucleotide sequences from multiple sources. But maybe you are interested in searching a big amount of gene sequences from a specific database that you defined. It may be trivial to do this search on the BLAST web browser then. That’s when the command line BLAST may become more helpful. And today we will go through this process together.

7.1.2. Command line BLAST¶

Navigate to the data folder that we will be working in today:

cd ~/blast_crash_course/materials/blast-tutorial/data/
ls
ls -R # list recursively directory tree

The strucure of this folder is as below:

We will go though BLAST by doing two projects today:

Project 1. Understand BLAST with small datasets

Align single_gene_mylab.fa to nifh-ref.fa

Project 2. Apply BLAST on real datasets

Align corn.fa, prairie.fa, and soybean.fa to nifh-ref.fa separately

7.2. Project 1. Understand BLAST with small datasets¶

Navigate to small_blast directory by:

cd /home/ubuntu/blast_crash_course/materials/blast-tutorial/data/small_blast

7.2.1. Task 1.1. Explore the data¶

What are in this folder?

Let’s see what we have in this folder:

ls
ls *    # List all subdirectories

gene_db/: a folder contains a nucleotide sequence file of 50 nifh genes you collected from literatures nifh-ref.fa — database
query_gene/: a folder contains a nucleotide sequence file of the gene you generated from the lab single_gene_mylab.fa — query
Objective: Compare and align sequences from single_gene_mylab.fa to nifh-ref.fa database

How does sequence file look like?

Let’s look at the query file and the gene file:

head gene_db/nifh-ref.fa
head query_gene/single_gene_mylab.fa

fasta: One of the most common sequence file types. Fasta files may have extension as .fasta or .fna or .fa

Learn more about sequencing file format from here.

How many sequences are there in the fasta file?

we know that each sequence name will start with a special character, >. So we can

find lines that start with >
count the total number of these lines

Remember we can find specific characters in a file with grep? For example, to find all instances of CACGCTGGCGGCGCTGGCCGAGTTGGGTCACAAGATCCTGATTGTGG in the gene_db/nifh-ref.fa file, we could:

grep "CACGCTGGCGGCGCTGGCCGAGTTGGGTCACAAGATCCTGATTGTGG" gene_db/nifh-ref.fa

^> matches line which starts with >, the charter(s) after ^. To get the total number of sequences in a fasta file, we are going to find the lines that start with > pattern with grep "^>", and pipe (|) these lines to the next command wc -l for counting:

grep "^>" gene_db/nifh-ref.fa | wc -l

7.2.2. Task 1.2. Make database¶

Making a database for genes is a smart way to index the gene sequences so that it can search and compare metagenome sequences (query) with gene sequences much faster.

To make a database of the 50 nifh gene sequences:

# Usage: makeblastdb -in <nifh-db> -dbtype nucl -out <nifh-db>

cd /home/ubuntu/blast_crash_course/materials/blast-tutorial/data/small_blast
makeblastdb -in gene_db/nifh-ref.fa -dbtype nucl -out gene_db/nifh-ref.fa

# the database looks like:
ls gene_db/

7.2.3. Task 1.3. BLAST¶

Now we can BLAST the one gene we got from the lab to nifh gene database. I would suggest something like the following variables, note that the -outfmt 6 flag identifies a tabular output:

# Usage: blastn -query <metag file> -db <db file> -out <name of output> -outfmt 6

cd /home/ubuntu/blast_crash_course/materials/blast-tutorial/data/small_blast

blastn -query query_gene/single_gene_mylab.fa -db gene_db/nifh-ref.fa -out query_gene/single_gene_mylab.fa-blast-output.txt -outfmt 6

We can have a good understanding of -outfmt 6 format from this webpage.

You just did your first BLAST in shell, congratulations! Let’s warm up with some exercises and then we will work with some real sequencing data.

7.2.4. Exercise 1.1¶

Navigate to the folder /home/ubuntu/blast_crash_course/materials/blast-tutorial/data/big_blast for Project 2 first, then list files and directories in this folder:

cd /home/ubuntu/blast_crash_course/materials/blast-tutorial/data/big_blast
ls
ls *    # List all subdirectories

Exercise:

1.1.1. Show the last 40 lines of nifh-ref.fa in nifh-db folder. (Hint: head command reads the file from the beginning and tail command reads the file from the ending)

1.1.2. Show the first 8 lines of corn.fa in metags folder

1.1.3. Get the total number of nitrogen fixation gene sequences in nifh-ref.fa file

1.1.4. Get the total number of sequences in corn.fa, prairie.fa, and soybean.fa from metags folder separately

Solution:

1.1.1. Command:

tail -40 nifh_db/nifh-ref.fa

1.1.2. Command:

head -8 metags/corn.fa

1.1.3. There are 50 nifh gene seqeuces in nifh-db/nifh-ref.fa. Command:

grep "^>" nifh_db/nifh-ref.fa | wc -l

1.1.4. There are 8948 sequences in metags/corn.fa. Command:

grep "^>" metags/corn.fa | wc -l
grep "^>" metags/prairie.fa | wc -l
grep "^>" metags/soybean.fa | wc -l

Bonus: We can efficiently get the total number of sequences for every fasta files in metags/ folder using a for loop. This can be helpful when there are a large number of metagenome fasta files in the folder:

for x in metags/*.fa; do echo $x; grep "^>" $x | wc -l; done

7.3. Project 2. Apply BLAST on real datasets¶

Instead of just one gene as shown in Project 1, we may need to compare a large amount of sequences with a database in reality. In this project, we will try to identify the existence of 50 nifh genes in three different soil metagenomes. To do that, we will align sequences from metagenome against the nifh gene database.

You should already be in the folder of Project 2 if you just finished “Exercise 1.1”. If not, navigate to big_blast directory by:

cd /home/ubuntu/blast_crash_course/materials/blast-tutorial/data/big_blast

7.3.1. Task 2.1. Explore the data¶

What are in this folder?

Let’s see what we have in this folder:

ls
ls *    # List all subdirectories

nifh_db/: a folder contains nucleotide sequence file for nitrogen fixation genes nifh-ref.fa — database
metags/: a folder contains metagenome files from three different Iowa crop soils (corn.fa, soybean.fa, and prairie.fa) — query
Objective: Identify the existence of nitrogen fixation genes in soil metagenome files

In reality, database is not limited to functional genes of bacteria that we are using here. It can be a database of nucleotide or amino acid sequences that you would like to compare your query sequences with, for example: reference genome. At NCBI, nr is a database of protein sequences and nt is a nucleotide database. You can find more information on how to download these databases from here.

The same with query: query is not limited to metagenomes as shown in this tutorial. It can be any nucleotide or amino acid sequences, such as gene sequences generated from your lab.

How many sequences are there in the fasta file?

You already have some information about data in this folder after “Exercise 1.1”:

There are 50 nifh gene seqeuces in nifh-db/nifh-ref.fa
There are 8948, 9297, 8963 sequences in metags/corn.fa, metags/prairie.fa, and metags/soybean.fa separately

7.3.2. Task 2.2. Make database¶

We want to find the existence of nifh genes in soil metagenome samples. Making a database for nifh genes is a smart way to index the gene sequences so that it can search and compare metagenome sequences (query) with nifh gene sequences much faster.

To make a database of nifh gene sequences:

# Usage: makeblastdb -in <nifh-db> -dbtype nucl -out <nifh-db>

cd /home/ubuntu/blast_crash_course/materials/blast-tutorial/data/big_blast
makeblastdb -in nifh_db/nifh-ref.fa -dbtype nucl -out nifh_db/nifh-ref.fa

# the database looks like:
ls nifh_db/

7.3.3. Task 2.3. BLAST¶

Now we can BLAST the metagenome files to nifh gene database. I would suggest something like the following variables, note that the -outfmt 6 flag identifies a tabular output:

# Usage: blastn -query <metag file> -db <db file> -out <name of output> -outfmt 6

cd /home/ubuntu/blast_crash_course/materials/blast-tutorial/data/big_blast

blastn -query metags/corn.fa -db nifh_db/nifh-ref.fa -out metags/corn.fa-blast-output.txt -outfmt 6

We just BLAST metags/corn.fa to nifh_db/nifh-ref.fa. Now let’s try to BLAST the other two metagenome files to nifh gene database:

blastn -query metags/prairie.fa -db nifh_db/nifh-ref.fa -out metags/prairie.fa-blast-output.txt -outfmt 6

blastn -query metags/soybean.fa -db nifh_db/nifh-ref.fa -out metags/soybean.fa-blast-output.txt -outfmt 6

Bonus: This can be done with a for loop:

for x in metags/*.fa; do blastn -query $x -db nifh_db/nifh-ref.fa -out $x-blast-output.txt -outfmt 6; done

7.3.4. Task 2.4. Play around BLAST output¶

Let’s first navigate to the folder with BLAST output:

cd metags
ls

We may be interested in:

How does the BLAST output file look like:
```
head corn.fa-blast-output.txt
```

How many times in total did sequences from corn metagenomes hit nifh genes:

# how many lines are there in corn blast output file:
wc -l corn.fa-blast-output.txt

How many uniq nifh genes were hit by sequences from soybean metagenome:

cut -f 2 soybean.fa-blast-output.txt | sort | uniq | wc -l
cut -f 2 soybean.fa-blast-output.txt | sort | uniq -c | sort

How many uniq sequences from prairie metagenome can hit nifh genes:

cut -f 1 prairie.fa-blast-output.txt | sort | uniq | wc -l

How to get the sseqid, sstart, and send column (column 2, 9, 10) from the soybean blast output file:
```
cut -f 2,9,10 soybean.fa-blast-output.txt
```
How many times was gene gi|909637271|emb|LN713523.1| hit by sequences from corn metagenome sample:
```
grep "gi|909637271|emb|LN713523.1|" corn.fa-blast-output.txt | wc -l
```

8. Conclusion¶

You now have the foundation for having some sequencing data that you need to compare to any database. You should be able to do some basic search to BLAST output file. Note, that you can do this for specific genes and also genomes…! Now, go forth and conquer!

Acknowledgement:

This work was supported by the National Science Foundation Directorate of Biological Sciences under awards DEB 1737758 and DEB 1737765.