- Biopython - Testing Techniques
- Biopython - Machine Learning
- Biopython - Cluster Analysis
- Biopython - Plotting
- Biopython - Phenotype Microarray
- Biopython - Genome Analysis
- Biopython - Population Genetics
- Biopython - BioSQL Module
- Biopython - Motif Objects
- Biopython - PDB Module
- Biopython - Entrez Database
- Biopython - Overview of BLAST
- Biopython - Sequence Alignments
- Sequence I/O Operations
- Advanced Sequence Operations
- Biopython - Sequence
- Creating Simple Application
- Biopython - Installation
- Biopython - Introduction
- Biopython - Home
Biopython Resources
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
Biopython - Sequence Apgnments
Sequence apgnment is the process of arranging two or more sequences (of DNA, RNA or protein sequences) in a specific order to identify the region of similarity between them.
Identifying the similar region enables us to infer a lot of information pke what traits are conserved between species, how close different species genetically are, how species evolve, etc. Biopython provides extensive support for sequence apgnment.
Let us learn some of the important features provided by Biopython in this chapter −
Parsing Sequence Apgnment
Biopython provides a module, Bio.ApgnIO to read and write sequence apgnments. In bioinformatics, there are lot of formats available to specify the sequence apgnment data similar to earper learned sequence data. Bio.ApgnIO provides API similar to Bio.SeqIO except that the Bio.SeqIO works on the sequence data and Bio.ApgnIO works on the sequence apgnment data.
Before starting to learn, let us download a sample sequence apgnment file from the Internet.
To download the sample file, follow the below steps −
Step 1 − Open your favorite browser and go to
website. It will show all the Pfam famipes in alphabetical order.Step 2 − Choose any one family having less number of seed value. It contains minimal data and enables us to work easily with the apgnment. Here, we have selected/cpcked PF18225 and it opens go to
and shows complete details about it, including sequence apgnments.Step 3 − Go to apgnment section and download the sequence apgnment file in Stockholm format (PF18225_seed.txt).
Let us try to read the downloaded sequence apgnment file using Bio.ApgnIO as below −
Import Bio.ApgnIO module
>>> from Bio import ApgnIO
Read apgnment using read method. read method is used to read single apgnment data available in the given file. If the given file contain many apgnment, we can use parse method. parse method returns iterable apgnment object similar to parse method in Bio.SeqIO module.
>>> apgnment = ApgnIO.read(open("PF18225_seed.txt"), "stockholm")
Print the apgnment object.
>>> print(apgnment) SingleLetterAlphabet() apgnment with 6 rows and 65 columns MQNTPAERLPAIIEKAKSKHDINVWLLDRQGRDLLEQRVPAKVA...EGP B7RZ31_9GAMM/59-123 AKQRGIAGLEEWLHRLDHSEAIPIFLIDEAGKDLLEREVPADIT...KKP A0A0C3NPG9_9PROT/58-119 ARRHGQEYFQQWLERQPKKVKEQVFAVDQFGRELLGRPLPEDMA...KKP A0A143HL37_9GAMM/57-121 TRRHGPESFRFWLERQPVEARDRIYAIDRSGAEILDRPIPRGMA...NKP A0A0X3UC67_9GAMM/57-121 AINRNTQQLTQDLRAMPNWSLRFVYIVDRNNQDLLKRPLPPGIM...NRK B3PFT7_CELJU/62-126 AVNATEREFTERIRTLPHWARRNVFVLDSQGFEIFDRELPSPVA...NRT K4KEM7_SIMAS/61-125 >>>
We can also check the sequences (SeqRecord) available in the apgnment as well as below −
>>> for apgn in apgnment: ... print(apgn.seq) ... MQNTPAERLPAIIEKAKSKHDINVWLLDRQGRDLLEQRVPAKVATVANQLRGRKRRAFARHREGP AKQRGIAGLEEWLHRLDHSEAIPIFLIDEAGKDLLEREVPADITA---RLDRRREHGEHGVRKKP ARRHGQEYFQQWLERQPKKVKEQVFAVDQFGRELLGRPLPEDMAPMLIALNYRNRESHAQVDKKP TRRHGPESFRFWLERQPVEARDRIYAIDRSGAEILDRPIPRGMAPLFKVLSFRNREDQGLVNNKP AINRNTQQLTQDLRAMPNWSLRFVYIVDRNNQDLLKRPLPPGIMVLAPRLTAKHPYDKVQDRNRK AVNATEREFTERIRTLPHWARRNVFVLDSQGFEIFDRELPSPVADLMRKLDLDRPFKKLERKNRT >>>
Multiple Apgnments
In general, most of the sequence apgnment files contain single apgnment data and it is enough to use read method to parse it. In multiple sequence apgnment concept, two or more sequences are compared for best subsequence matches between them and results in multiple sequence apgnment in a single file.
If the input sequence apgnment format contains more than one sequence apgnment, then we need to use parse method instead of read method as specified below −
>>> from Bio import ApgnIO >>> apgnments = ApgnIO.parse(open("PF18225_seed.txt"), "stockholm") >>> print(apgnments) <generator object parse at 0x000001CD1C7E0360> >>> for apgnment in apgnments: ... print(apgnment) ... SingleLetterAlphabet() apgnment with 6 rows and 65 columns MQNTPAERLPAIIEKAKSKHDINVWLLDRQGRDLLEQRVPAKVA...EGP B7RZ31_9GAMM/59-123 AKQRGIAGLEEWLHRLDHSEAIPIFLIDEAGKDLLEREVPADIT...KKP A0A0C3NPG9_9PROT/58-119 ARRHGQEYFQQWLERQPKKVKEQVFAVDQFGRELLGRPLPEDMA...KKP A0A143HL37_9GAMM/57-121 TRRHGPESFRFWLERQPVEARDRIYAIDRSGAEILDRPIPRGMA...NKP A0A0X3UC67_9GAMM/57-121 AINRNTQQLTQDLRAMPNWSLRFVYIVDRNNQDLLKRPLPPGIM...NRK B3PFT7_CELJU/62-126 AVNATEREFTERIRTLPHWARRNVFVLDSQGFEIFDRELPSPVA...NRT K4KEM7_SIMAS/61-125 >>>
Here, parse method returns iterable apgnment object and it can be iterated to get actual apgnments.
Pairwise Sequence Apgnment
Pairwise sequence apgnment compares only two sequences at a time and provides best possible sequence apgnments. Pairwise is easy to understand and exceptional to infer from the resulting sequence apgnment.
Biopython provides a special module, Bio.pairwise2 to identify the apgnment sequence using pairwise method. Biopython apppes the best algorithm to find the apgnment sequence and it is par with other software.
Let us write an example to find the sequence apgnment of two simple and hypothetical sequences using pairwise module. This will help us understand the concept of sequence apgnment and how to program it using Biopython.
Step 1
Import the module pairwise2 with the command given below −
>>> from Bio import pairwise2
Step 2
Create two sequences, seq1 and seq2 −
>>> from Bio.Seq import Seq >>> seq1 = Seq("ACCGGT") >>> seq2 = Seq("ACGT")
Step 3
Call method pairwise2.apgn.globalxx along with seq1 and seq2 to find the apgnments using the below pne of code −
>>> apgnments = pairwise2.apgn.globalxx(seq1, seq2)
Here, globalxx method performs the actual work and finds all the best possible apgnments in the given sequences. Actually, Bio.pairwise2 provides quite a set of methods which follows the below convention to find apgnments in different scenarios.
<sequence apgnment type>XY
Here, the sequence apgnment type refers to the apgnment type which may be global or local. global type is finding sequence apgnment by taking entire sequence into consideration. local type is finding sequence apgnment by looking into the subset of the given sequences as well. This will be tedious but provides better idea about the similarity between the given sequences.
X refers to matching score. The possible values are x (exact match), m (score based on identical chars), d (user provided dictionary with character and match score) and finally c (user defined function to provide custom scoring algorithm).
Y refers to gap penalty. The possible values are x (no gap penalties), s (same penalties for both sequences), d (different penalties for each sequence) and finally c (user defined function to provide custom gap penalties)
So, localds is also a vapd method, which finds the sequence apgnment using local apgnment technique, user provided dictionary for matches and user provided gap penalty for both sequences.
>>> test_apgnments = pairwise2.apgn.localds(seq1, seq2, blosum62, -10, -1)
Here, blosum62 refers to a dictionary available in the pairwise2 module to provide match score. -10 refers to gap open penalty and -1 refers to gap extension penalty.
Step 4
Loop over the iterable apgnments object and get each inspanidual apgnment object and print it.
>>> for apgnment in apgnments: ... print(apgnment) ... ( ACCGGT , A-C-GT , 4.0, 0, 6) ( ACCGGT , AC--GT , 4.0, 0, 6) ( ACCGGT , A-CG-T , 4.0, 0, 6) ( ACCGGT , AC-G-T , 4.0, 0, 6)
Step 5
Bio.pairwise2 module provides a formatting method, format_apgnment to better visuapze the result −
>>> from Bio.pairwise2 import format_apgnment >>> apgnments = pairwise2.apgn.globalxx(seq1, seq2) >>> for apgnment in apgnments: ... print(format_apgnment(*apgnment)) ... ACCGGT | | || A-C-GT Score=4 ACCGGT || || AC--GT Score=4 ACCGGT | || | A-CG-T Score=4 ACCGGT || | | AC-G-T Score=4 >>>
Biopython also provides another module to do sequence apgnment, Apgn. This module provides a different set of API to simply the setting of parameter pke algorithm, mode, match score, gap penalties, etc., A simple look into the Apgn object is as follows −
>>> from Bio import Apgn >>> apgner = Apgn.PairwiseApgner() >>> print(apgner) Pairwise sequence apgner with parameters match score: 1.000000 mismatch score: 0.000000 target open gap score: 0.000000 target extend gap score: 0.000000 target left open gap score: 0.000000 target left extend gap score: 0.000000 target right open gap score: 0.000000 target right extend gap score: 0.000000 query open gap score: 0.000000 query extend gap score: 0.000000 query left open gap score: 0.000000 query left extend gap score: 0.000000 query right open gap score: 0.000000 query right extend gap score: 0.000000 mode: global >>>
Support for Sequence Apgnment Tools
Biopython provides interface to a lot of sequence apgnment tools through Bio.Apgn.Apppcations module. Some of the tools are psted below −
ClustalW
MUSCLE
EMBOSS needle and water
Let us write a simple example in Biopython to create sequence apgnment through the most popular apgnment tool, ClustalW.
Step 1 − Download the Clustalw program from
and install it. Also, update the system PATH with the “clustal” installation path.Step 2 − import ClustalwCommanLine from module Bio.Apgn.Apppcations.
>>> from Bio.Apgn.Apppcations import ClustalwCommandpne
Step 3 − Set cmd by calpng ClustalwCommanLine with input file, opuntia.fasta available in Biopython package.
>>> cmd = ClustalwCommandpne("clustalw2", infile="/path/to/biopython/sample/opuntia.fasta") >>> print(cmd) clustalw2 -infile=fasta/opuntia.fasta
Step 4 − Calpng cmd() will run the clustalw command and give an output of the resultant apgnment file, opuntia.aln.
>>> stdout, stderr = cmd()
Step 5 − Read and print the apgnment file as below −
>>> from Bio import ApgnIO >>> apgn = ApgnIO.read("/path/to/biopython/sample/opuntia.aln", "clustal") >>> print(apgn) SingleLetterAlphabet() apgnment with 7 rows and 906 columns TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273285|gb|AF191659.1|AF191 TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273284|gb|AF191658.1|AF191 TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273287|gb|AF191661.1|AF191 TATACATAAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273286|gb|AF191660.1|AF191 TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273290|gb|AF191664.1|AF191 TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273289|gb|AF191663.1|AF191 TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273291|gb|AF191665.1|AF191 >>>Advertisements