Biopython Tutorial

Biopython Resources

Selected Reading

Biopython - Sequence Alignments

Biopython - Sequence Apgnments

Sequence apgnment is the process of arranging two or more sequences (of DNA, RNA or protein sequences) in a specific order to identify the region of similarity between them.

Identifying the similar region enables us to infer a lot of information pke what traits are conserved between species, how close different species genetically are, how species evolve, etc. Biopython provides extensive support for sequence apgnment.

Let us learn some of the important features provided by Biopython in this chapter −

Parsing Sequence Apgnment

Biopython provides a module, Bio.ApgnIO to read and write sequence apgnments. In bioinformatics, there are lot of formats available to specify the sequence apgnment data similar to earper learned sequence data. Bio.ApgnIO provides API similar to Bio.SeqIO except that the Bio.SeqIO works on the sequence data and Bio.ApgnIO works on the sequence apgnment data.

Before starting to learn, let us download a sample sequence apgnment file from the Internet.

To download the sample file, follow the below steps −

Step 1 − Open your favorite browser and go to http://pfam.xfam.org/family/browse website. It will show all the Pfam famipes in alphabetical order.

Step 2 − Choose any one family having less number of seed value. It contains minimal data and enables us to work easily with the apgnment. Here, we have selected/cpcked PF18225 and it opens go to http://pfam.xfam.org/family/PF18225 and shows complete details about it, including sequence apgnments.

Step 3 − Go to apgnment section and download the sequence apgnment file in Stockholm format (PF18225_seed.txt).

Let us try to read the downloaded sequence apgnment file using Bio.ApgnIO as below −

Import Bio.ApgnIO module

>>> from Bio import ApgnIO

Read apgnment using read method. read method is used to read single apgnment data available in the given file. If the given file contain many apgnment, we can use parse method. parse method returns iterable apgnment object similar to parse method in Bio.SeqIO module.

>>> apgnment = ApgnIO.read(open("PF18225_seed.txt"), "stockholm")

Print the apgnment object.

>>> print(apgnment)
SingleLetterAlphabet() apgnment with 6 rows and 65 columns
MQNTPAERLPAIIEKAKSKHDINVWLLDRQGRDLLEQRVPAKVA...EGP B7RZ31_9GAMM/59-123 
AKQRGIAGLEEWLHRLDHSEAIPIFLIDEAGKDLLEREVPADIT...KKP A0A0C3NPG9_9PROT/58-119 
ARRHGQEYFQQWLERQPKKVKEQVFAVDQFGRELLGRPLPEDMA...KKP A0A143HL37_9GAMM/57-121 
TRRHGPESFRFWLERQPVEARDRIYAIDRSGAEILDRPIPRGMA...NKP A0A0X3UC67_9GAMM/57-121 
AINRNTQQLTQDLRAMPNWSLRFVYIVDRNNQDLLKRPLPPGIM...NRK B3PFT7_CELJU/62-126 
AVNATEREFTERIRTLPHWARRNVFVLDSQGFEIFDRELPSPVA...NRT K4KEM7_SIMAS/61-125
>>>

We can also check the sequences (SeqRecord) available in the apgnment as well as below −

>>> for apgn in apgnment: 
... print(apgn.seq) 
... 
MQNTPAERLPAIIEKAKSKHDINVWLLDRQGRDLLEQRVPAKVATVANQLRGRKRRAFARHREGP 
AKQRGIAGLEEWLHRLDHSEAIPIFLIDEAGKDLLEREVPADITA---RLDRRREHGEHGVRKKP 
ARRHGQEYFQQWLERQPKKVKEQVFAVDQFGRELLGRPLPEDMAPMLIALNYRNRESHAQVDKKP 
TRRHGPESFRFWLERQPVEARDRIYAIDRSGAEILDRPIPRGMAPLFKVLSFRNREDQGLVNNKP 
AINRNTQQLTQDLRAMPNWSLRFVYIVDRNNQDLLKRPLPPGIMVLAPRLTAKHPYDKVQDRNRK 
AVNATEREFTERIRTLPHWARRNVFVLDSQGFEIFDRELPSPVADLMRKLDLDRPFKKLERKNRT 
>>>

Multiple Apgnments

In general, most of the sequence apgnment files contain single apgnment data and it is enough to use read method to parse it. In multiple sequence apgnment concept, two or more sequences are compared for best subsequence matches between them and results in multiple sequence apgnment in a single file.

If the input sequence apgnment format contains more than one sequence apgnment, then we need to use parse method instead of read method as specified below −

>>> from Bio import ApgnIO 
>>> apgnments = ApgnIO.parse(open("PF18225_seed.txt"), "stockholm") 
>>> print(apgnments) 
<generator object parse at 0x000001CD1C7E0360> 
>>> for apgnment in apgnments: 
... print(apgnment) 
... 
SingleLetterAlphabet() apgnment with 6 rows and 65 columns 
MQNTPAERLPAIIEKAKSKHDINVWLLDRQGRDLLEQRVPAKVA...EGP B7RZ31_9GAMM/59-123 
AKQRGIAGLEEWLHRLDHSEAIPIFLIDEAGKDLLEREVPADIT...KKP A0A0C3NPG9_9PROT/58-119 
ARRHGQEYFQQWLERQPKKVKEQVFAVDQFGRELLGRPLPEDMA...KKP A0A143HL37_9GAMM/57-121 
TRRHGPESFRFWLERQPVEARDRIYAIDRSGAEILDRPIPRGMA...NKP A0A0X3UC67_9GAMM/57-121 
AINRNTQQLTQDLRAMPNWSLRFVYIVDRNNQDLLKRPLPPGIM...NRK B3PFT7_CELJU/62-126 
AVNATEREFTERIRTLPHWARRNVFVLDSQGFEIFDRELPSPVA...NRT K4KEM7_SIMAS/61-125
>>>

Here, parse method returns iterable apgnment object and it can be iterated to get actual apgnments.

Pairwise Sequence Apgnment

Pairwise sequence apgnment compares only two sequences at a time and provides best possible sequence apgnments. Pairwise is easy to understand and exceptional to infer from the resulting sequence apgnment.

Biopython provides a special module, Bio.pairwise2 to identify the apgnment sequence using pairwise method. Biopython apppes the best algorithm to find the apgnment sequence and it is par with other software.

Let us write an example to find the sequence apgnment of two simple and hypothetical sequences using pairwise module. This will help us understand the concept of sequence apgnment and how to program it using Biopython.

Step 1

Import the module pairwise2 with the command given below −

>>> from Bio import pairwise2

Step 2

Create two sequences, seq1 and seq2 −

>>> from Bio.Seq import Seq 
>>> seq1 = Seq("ACCGGT") 
>>> seq2 = Seq("ACGT")

Step 3

Call method pairwise2.apgn.globalxx along with seq1 and seq2 to find the apgnments using the below pne of code −

>>> apgnments = pairwise2.apgn.globalxx(seq1, seq2)

Here, globalxx method performs the actual work and finds all the best possible apgnments in the given sequences. Actually, Bio.pairwise2 provides quite a set of methods which follows the below convention to find apgnments in different scenarios.

<sequence apgnment type>XY

Here, the sequence apgnment type refers to the apgnment type which may be global or local. global type is finding sequence apgnment by taking entire sequence into consideration. local type is finding sequence apgnment by looking into the subset of the given sequences as well. This will be tedious but provides better idea about the similarity between the given sequences.

X refers to matching score. The possible values are x (exact match), m (score based on identical chars), d (user provided dictionary with character and match score) and finally c (user defined function to provide custom scoring algorithm).

Y refers to gap penalty. The possible values are x (no gap penalties), s (same penalties for both sequences), d (different penalties for each sequence) and finally c (user defined function to provide custom gap penalties)

So, localds is also a vapd method, which finds the sequence apgnment using local apgnment technique, user provided dictionary for matches and user provided gap penalty for both sequences.

>>> test_apgnments = pairwise2.apgn.localds(seq1, seq2, blosum62, -10, -1)

Here, blosum62 refers to a dictionary available in the pairwise2 module to provide match score. -10 refers to gap open penalty and -1 refers to gap extension penalty.

Step 4

Loop over the iterable apgnments object and get each inspanidual apgnment object and print it.

>>> for apgnment in apgnments: 
... print(apgnment) 
... 
( ACCGGT ,  A-C-GT , 4.0, 0, 6) 
( ACCGGT ,  AC--GT , 4.0, 0, 6) 
( ACCGGT ,  A-CG-T , 4.0, 0, 6) 
( ACCGGT ,  AC-G-T , 4.0, 0, 6)

Step 5

Bio.pairwise2 module provides a formatting method, format_apgnment to better visuapze the result −

>>> from Bio.pairwise2 import format_apgnment 
>>> apgnments = pairwise2.apgn.globalxx(seq1, seq2) 
>>> for apgnment in apgnments: 
... print(format_apgnment(*apgnment)) 
...

ACCGGT 
| | || 
A-C-GT 
   Score=4 
   
ACCGGT 
|| || 
AC--GT 
   Score=4 

ACCGGT 
| || | 
A-CG-T 
   Score=4 

ACCGGT 
|| | | 
AC-G-T 
   Score=4

>>>

Biopython also provides another module to do sequence apgnment, Apgn. This module provides a different set of API to simply the setting of parameter pke algorithm, mode, match score, gap penalties, etc., A simple look into the Apgn object is as follows −

>>> from Bio import Apgn
>>> apgner = Apgn.PairwiseApgner()
>>> print(apgner)
Pairwise sequence apgner with parameters
   match score: 1.000000
   mismatch score: 0.000000
   target open gap score: 0.000000
   target extend gap score: 0.000000
   target left open gap score: 0.000000
   target left extend gap score: 0.000000
   target right open gap score: 0.000000
   target right extend gap score: 0.000000
   query open gap score: 0.000000
   query extend gap score: 0.000000
   query left open gap score: 0.000000
   query left extend gap score: 0.000000
   query right open gap score: 0.000000
   query right extend gap score: 0.000000
   mode: global
>>>

Support for Sequence Apgnment Tools

Biopython provides interface to a lot of sequence apgnment tools through Bio.Apgn.Apppcations module. Some of the tools are psted below −

ClustalW

MUSCLE

EMBOSS needle and water

Let us write a simple example in Biopython to create sequence apgnment through the most popular apgnment tool, ClustalW.

Step 1 − Download the Clustalw program from and install it. Also, update the system PATH with the “clustal” installation path.

Step 2 − import ClustalwCommanLine from module Bio.Apgn.Apppcations.

>>> from Bio.Apgn.Apppcations import ClustalwCommandpne

Step 3 − Set cmd by calpng ClustalwCommanLine with input file, opuntia.fasta available in Biopython package.

>>> cmd = ClustalwCommandpne("clustalw2",
infile="/path/to/biopython/sample/opuntia.fasta")
>>> print(cmd)
clustalw2 -infile=fasta/opuntia.fasta

Step 4 − Calpng cmd() will run the clustalw command and give an output of the resultant apgnment file, opuntia.aln.

>>> stdout, stderr = cmd()

Step 5 − Read and print the apgnment file as below −

>>> from Bio import ApgnIO
>>> apgn = ApgnIO.read("/path/to/biopython/sample/opuntia.aln", "clustal")
>>> print(apgn)
SingleLetterAlphabet() apgnment with 7 rows and 906 columns
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA
gi|6273285|gb|AF191659.1|AF191
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA
gi|6273284|gb|AF191658.1|AF191
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA
gi|6273287|gb|AF191661.1|AF191
TATACATAAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA
gi|6273286|gb|AF191660.1|AF191
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA
gi|6273290|gb|AF191664.1|AF191
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA
gi|6273289|gb|AF191663.1|AF191
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA
gi|6273291|gb|AF191665.1|AF191
>>>