- Biopython - Testing Techniques
- Biopython - Machine Learning
- Biopython - Cluster Analysis
- Biopython - Plotting
- Biopython - Phenotype Microarray
- Biopython - Genome Analysis
- Biopython - Population Genetics
- Biopython - BioSQL Module
- Biopython - Motif Objects
- Biopython - PDB Module
- Biopython - Entrez Database
- Biopython - Overview of BLAST
- Biopython - Sequence Alignments
- Sequence I/O Operations
- Advanced Sequence Operations
- Biopython - Sequence
- Creating Simple Application
- Biopython - Installation
- Biopython - Introduction
- Biopython - Home
Biopython Resources
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
Biopython - Sequence
A sequence is series of letters used to represent an organism’s protein, DNA or RNA. It is represented by Seq class. Seq class is defined in Bio.Seq module.
Let’s create a simple sequence in Biopython as shown below −
>>> from Bio.Seq import Seq >>> seq = Seq("AGCT") >>> seq Seq( AGCT ) >>> print(seq) AGCT
Here, we have created a simple protein sequence AGCT and each letter represents Alanine, Glycine, Cysteine and Threonine.
Each Seq object has two important attributes −
data − the actual sequence string (AGCT)
alphabet − used to represent the type of sequence. e.g. DNA sequence, RNA sequence, etc. By default, it does not represent any sequence and is generic in nature.
Alphabet Module
Seq objects contain Alphabet attribute to specify sequence type, letters and possible operations. It is defined in Bio.Alphabet module. Alphabet can be defined as below −
>>> from Bio.Seq import Seq >>> myseq = Seq("AGCT") >>> myseq Seq( AGCT ) >>> myseq.alphabet Alphabet()
Alphabet module provides below classes to represent different types of sequences. Alphabet - base class for all types of alphabets.
SingleLetterAlphabet - Generic alphabet with letters of size one. It derives from Alphabet and all other alphabets type derives from it.
>>> from Bio.Seq import Seq >>> from Bio.Alphabet import single_letter_alphabet >>> test_seq = Seq( AGTACACTGGT , single_letter_alphabet) >>> test_seq Seq( AGTACACTGGT , SingleLetterAlphabet())
ProteinAlphabet − Generic single letter protein alphabet.
>>> from Bio.Seq import Seq >>> from Bio.Alphabet import generic_protein >>> test_seq = Seq( AGTACACTGGT , generic_protein) >>> test_seq Seq( AGTACACTGGT , ProteinAlphabet())
NucleotideAlphabet − Generic single letter nucleotide alphabet.
>>> from Bio.Seq import Seq >>> from Bio.Alphabet import generic_nucleotide >>> test_seq = Seq( AGTACACTGGT , generic_nucleotide) >>> test_seq Seq( AGTACACTGGT , NucleotideAlphabet())
DNAAlphabet − Generic single letter DNA alphabet.
>>> from Bio.Seq import Seq >>> from Bio.Alphabet import generic_dna >>> test_seq = Seq( AGTACACTGGT , generic_dna) >>> test_seq Seq( AGTACACTGGT , DNAAlphabet())
RNAAlphabet − Generic single letter RNA alphabet.
>>> from Bio.Seq import Seq >>> from Bio.Alphabet import generic_rna >>> test_seq = Seq( AGTACACTGGT , generic_rna) >>> test_seq Seq( AGTACACTGGT , RNAAlphabet())
Biopython module, Bio.Alphabet.IUPAC provides basic sequence types as defined by IUPAC community. It contains the following classes −
IUPACProtein (protein) − IUPAC protein alphabet of 20 standard amino acids.
ExtendedIUPACProtein (extended_protein) − Extended uppercase IUPAC protein single letter alphabet including X.
IUPACAmbiguousDNA (ambiguous_dna) − Uppercase IUPAC ambiguous DNA.
IUPACUnambiguousDNA (unambiguous_dna) − Uppercase IUPAC unambiguous DNA (GATC).
ExtendedIUPACDNA (extended_dna) − Extended IUPAC DNA alphabet.
IUPACAmbiguousRNA (ambiguous_rna) − Uppercase IUPAC ambiguous RNA.
IUPACUnambiguousRNA (unambiguous_rna) − Uppercase IUPAC unambiguous RNA (GAUC).
Consider a simple example for IUPACProtein class as shown below −
>>> from Bio.Alphabet import IUPAC >>> protein_seq = Seq("AGCT", IUPAC.protein) >>> protein_seq Seq( AGCT , IUPACProtein()) >>> protein_seq.alphabet
Also, Biopython exposes all the bioinformatics related configuration data through Bio.Data module. For example, IUPACData.protein_letters has the possible letters of IUPACProtein alphabet.
>>> from Bio.Data import IUPACData >>> IUPACData.protein_letters ACDEFGHIKLMNPQRSTVWY
Basic Operations
This section briefly explains about all the basic operations available in the Seq class. Sequences are similar to python strings. We can perform python string operations pke spcing, counting, concatenation, find, sppt and strip in sequences.
Use the below codes to get various outputs.
To get the first value in sequence.
>>> seq_string = Seq("AGCTAGCT") >>> seq_string[0] A
To print the first two values.
>>> seq_string[0:2] Seq( AG )
To print all the values.
>>> seq_string[ : ] Seq( AGCTAGCT )
To perform length and count operations.
>>> len(seq_string) 8 >>> seq_string.count( A ) 2
To add two sequences.
>>> from Bio.Alphabet import generic_dna, generic_protein >>> seq1 = Seq("AGCT", generic_dna) >>> seq2 = Seq("TCGA", generic_dna) >>> seq1+seq2 Seq( AGCTTCGA , DNAAlphabet())
Here, the above two sequence objects, seq1, seq2 are generic DNA sequences and so you can add them and produce new sequence. You can’t add sequences with incompatible alphabets, such as a protein sequence and a DNA sequence as specified below −
>>> dna_seq = Seq( AGTACACTGGT , generic_dna) >>> protein_seq = Seq( AGUACACUGGU , generic_protein) >>> dna_seq + protein_seq ..... ..... TypeError: Incompatible alphabets DNAAlphabet() and ProteinAlphabet() >>>
To add two or more sequences, first store it in a python pst, then retrieve it using ‘for loop’ and finally add it together as shown below −
>>> from Bio.Alphabet import generic_dna >>> pst = [Seq("AGCT",generic_dna),Seq("TCGA",generic_dna),Seq("AAA",generic_dna)] >>> for s in pst: ... print(s) ... AGCT TCGA AAA >>> final_seq = Seq(" ",generic_dna) >>> for s in pst: ... final_seq = final_seq + s ... >>> final_seq Seq( AGCTTCGAAAA , DNAAlphabet())
In the below section, various codes are given to get outputs based on the requirement.
To change the case of sequence.
>>> from Bio.Alphabet import generic_rna >>> rna = Seq("agct", generic_rna) >>> rna.upper() Seq( AGCT , RNAAlphabet())
To check python membership and identity operator.
>>> rna = Seq("agct", generic_rna) >>> a in rna True >>> A in rna False >>> rna1 = Seq("AGCT", generic_dna) >>> rna is rna1 False
To find single letter or sequence of letter inside the given sequence.
>>> protein_seq = Seq( AGUACACUGGU , generic_protein) >>> protein_seq.find( G ) 1 >>> protein_seq.find( GG ) 8
To perform spptting operation.
>>> protein_seq = Seq( AGUACACUGGU , generic_protein) >>> protein_seq.sppt( A ) [Seq( , ProteinAlphabet()), Seq( GU , ProteinAlphabet()), Seq( C , ProteinAlphabet()), Seq( CUGGU , ProteinAlphabet())]
To perform strip operations in the sequence.
>>> strip_seq = Seq(" AGCT ") >>> strip_seq Seq( AGCT ) >>> strip_seq.strip() Seq( AGCT )Advertisements