- Biopython - Testing Techniques
- Biopython - Machine Learning
- Biopython - Cluster Analysis
- Biopython - Plotting
- Biopython - Phenotype Microarray
- Biopython - Genome Analysis
- Biopython - Population Genetics
- Biopython - BioSQL Module
- Biopython - Motif Objects
- Biopython - PDB Module
- Biopython - Entrez Database
- Biopython - Overview of BLAST
- Biopython - Sequence Alignments
- Sequence I/O Operations
- Advanced Sequence Operations
- Biopython - Sequence
- Creating Simple Application
- Biopython - Installation
- Biopython - Introduction
- Biopython - Home
Biopython Resources
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
Biopython - Sequence I/O Operations
Biopython provides a module, Bio.SeqIO to read and write sequences from and to a file (any stream) respectively. It supports nearly all file formats available in bioinformatics. Most of the software provides different approach for different file formats. But, Biopython consciously follows a single approach to present the parsed sequence data to the user through its SeqRecord object.
Let us learn more about SeqRecord in the following section.
SeqRecord
Bio.SeqRecord module provides SeqRecord to hold meta information of the sequence as well as the sequence data itself as given below −
seq − It is an actual sequence.
id − It is the primary identifier of the given sequence. The default type is string.
name − It is the Name of the sequence. The default type is string.
description − It displays human readable information about the sequence.
annotations − It is a dictionary of additional information about the sequence.
The SeqRecord can be imported as specified below
from Bio.SeqRecord import SeqRecord
Let us understand the nuances of parsing the sequence file using real sequence file in the coming sections.
Parsing Sequence File Formats
This section explains about how to parse two of the most popular sequence file formats, FASTA and GenBank.
FASTA
FASTA is the most basic file format for storing sequence data. Originally, FASTA is a software package for sequence apgnment of DNA and protein developed during the early evolution of Bioinformatics and used mostly to search the sequence similarity.
Biopython provides an example FASTA file and it can be accessed at
Download and save this file into your Biopython sample directory as ‘orchid.fasta’.
Bio.SeqIO module provides parse() method to process sequence files and can be imported as follows −
from Bio.SeqIO import parse
parse() method contains two arguments, first one is file handle and second is file format.
>>> file = open( path/to/biopython/sample/orchid.fasta ) >>> for record in parse(file, "fasta"): ... print(record.id) ... gi|2765658|emb|Z78533.1|CIZ78533 gi|2765657|emb|Z78532.1|CCZ78532 .......... .......... gi|2765565|emb|Z78440.1|PPZ78440 gi|2765564|emb|Z78439.1|PBZ78439 >>>
Here, the parse() method returns an iterable object which returns SeqRecord on every iteration. Being iterable, it provides lot of sophisticated and easy methods and let us see some of the features.
next()
next() method returns the next item available in the iterable object, which we can be used to get the first sequence as given below −
>>> first_seq_record = next(SeqIO.parse(open( path/to/biopython/sample/orchid.fasta ), fasta )) >>> first_seq_record.id gi|2765658|emb|Z78533.1|CIZ78533 >>> first_seq_record.name gi|2765658|emb|Z78533.1|CIZ78533 >>> first_seq_record.seq Seq( CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC , SingleLetterAlphabet()) >>> first_seq_record.description gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA >>> first_seq_record.annotations {} >>>
Here, seq_record.annotations is empty because the FASTA format does not support sequence annotations.
pst comprehension
We can convert the iterable object into pst using pst comprehension as given below
>>> seq_iter = SeqIO.parse(open( path/to/biopython/sample/orchid.fasta ), fasta ) >>> all_seq = [seq_record for seq_record in seq_iter] >>> len(all_seq) 94 >>>
Here, we have used len method to get the total count. We can get sequence with maximum length as follows −
>>> seq_iter = SeqIO.parse(open( path/to/biopython/sample/orchid.fasta ), fasta ) >>> max_seq = max(len(seq_record.seq) for seq_record in seq_iter) >>> max_seq 789 >>>
We can filter the sequence as well using the below code −
>>> seq_iter = SeqIO.parse(open( path/to/biopython/sample/orchid.fasta ), fasta ) >>> seq_under_600 = [seq_record for seq_record in seq_iter if len(seq_record.seq) < 600] >>> for seq in seq_under_600: ... print(seq.id) ... gi|2765606|emb|Z78481.1|PIZ78481 gi|2765605|emb|Z78480.1|PGZ78480 gi|2765601|emb|Z78476.1|PGZ78476 gi|2765595|emb|Z78470.1|PPZ78470 gi|2765594|emb|Z78469.1|PHZ78469 gi|2765564|emb|Z78439.1|PBZ78439 >>>
Writing a collection of SqlRecord objects (parsed data) into file is as simple as calpng the SeqIO.write method as below −
file = open("converted.fasta", "w) SeqIO.write(seq_record, file, "fasta")
This method can be effectively used to convert the format as specified below −
file = open("converted.gbk", "w) SeqIO.write(seq_record, file, "genbank")
GenBank
It is a richer sequence format for genes and includes fields for various kinds of annotations. Biopython provides an example GenBank file and it can be accessed at
Download and save file into your Biopython sample directory as ‘orchid.gbk’
Since, Biopython provides a single function, parse to parse all bioinformatics format. Parsing GenBank format is as simple as changing the format option in the parse method.
The code for the same has been given below −
>>> from Bio import SeqIO >>> from Bio.SeqIO import parse >>> seq_record = next(parse(open( path/to/biopython/sample/orchid.gbk ), genbank )) >>> seq_record.id Z78533.1 >>> seq_record.name Z78533 >>> seq_record.seq Seq( CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC , IUPACAmbiguousDNA()) >>> seq_record.description C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA >>> seq_record.annotations { molecule_type : DNA , topology : pnear , data_file_spanision : PLN , date : 30-NOV-2006 , accessions : [ Z78533 ], sequence_version : 1, gi : 2765658 , keywords : [ 5.8S ribosomal RNA , 5.8S rRNA gene , internal transcribed spacer , ITS1 , ITS2 ], source : Cypripedium irapeanum , organism : Cypripedium irapeanum , taxonomy : [ Eukaryota , Viridiplantae , Streptophyta , Embryophyta , Tracheophyta , Spermatophyta , Magnopophyta , Lipopsida , Asparagales , Orchidaceae , Cypripedioideae , Cypripedium ], references : [ Reference(title = Phylogenetics of the sppper orchids (Cypripedioideae: Orchidaceae): nuclear rDNA ITS sequences , ...), Reference(title = Direct Submission , ...) ] }Advertisements