- Biopython - Testing Techniques
- Biopython - Machine Learning
- Biopython - Cluster Analysis
- Biopython - Plotting
- Biopython - Phenotype Microarray
- Biopython - Genome Analysis
- Biopython - Population Genetics
- Biopython - BioSQL Module
- Biopython - Motif Objects
- Biopython - PDB Module
- Biopython - Entrez Database
- Biopython - Overview of BLAST
- Biopython - Sequence Alignments
- Sequence I/O Operations
- Advanced Sequence Operations
- Biopython - Sequence
- Creating Simple Application
- Biopython - Installation
- Biopython - Introduction
- Biopython - Home
Biopython Resources
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
Biopython - Entrez Database
Entrez is an onpne search system provided by NCBI. It provides access to nearly all known molecular biology databases with an integrated global query supporting Boolean operators and field search. It returns results from all the databases with information pke the number of hits from each databases, records with pnks to the originating database, etc.
Some of the popular databases which can be accessed through Entrez are psted below −
Pubmed
Pubmed Central
Nucleotide (GenBank Sequence Database)
Protein (Sequence Database)
Genome (Whole Genome Database)
Structure (Three Dimensional Macromolecular Structure)
Taxonomy (Organisms in GenBank)
SNP (Single Nucleotide Polymorphism)
UniGene (Gene Oriented Clusters of Transcript Sequences)
CDD (Conserved Protein Domain Database)
3D Domains (Domains from Entrez Structure)
In addition to the above databases, Entrez provides many more databases to perform the field search.
Biopython provides an Entrez specific module, Bio.Entrez to access Entrez database. Let us learn how to access Entrez using Biopython in this chapter −
Database Connection Steps
To add the features of Entrez, import the following module −
>>> from Bio import Entrez
Next set your email to identify who is connected with the code given below −
>>> Entrez.email = <youremail>
Then, set the Entrez tool parameter and by default, it is Biopython.
>>> Entrez.tool = Demoscript
Now, call einfo function to find index term counts, last update, and available pnks for each database as defined below −
>>> info = Entrez.einfo()
The einfo method returns an object, which provides access to the information through its read method as shown below −
>>> data = info.read() >>> print(data) <?xml version = "1.0" encoding = "UTF-8" ?> <!DOCTYPE eInfoResult PUBLIC "-//NLM//DTD einfo 20130322//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20130322/einfo.dtd"> <eInfoResult> <DbList> <DbName>pubmed</DbName> <DbName>protein</DbName> <DbName>nuccore</DbName> <DbName>ipg</DbName> <DbName>nucleotide</DbName> <DbName>nucgss</DbName> <DbName>nucest</DbName> <DbName>structure</DbName> <DbName>sparcle</DbName> <DbName>genome</DbName> <DbName>annotinfo</DbName> <DbName>assembly</DbName> <DbName>bioproject</DbName> <DbName>biosample</DbName> <DbName>blastdbinfo</DbName> <DbName>books</DbName> <DbName>cdd</DbName> <DbName>cpnvar</DbName> <DbName>clone</DbName> <DbName>gap</DbName> <DbName>gapplus</DbName> <DbName>grasp</DbName> <DbName>dbvar</DbName> <DbName>gene</DbName> <DbName>gds</DbName> <DbName>geoprofiles</DbName> <DbName>homologene</DbName> <DbName>medgen</DbName> <DbName>mesh</DbName> <DbName>ncbisearch</DbName> <DbName>nlmcatalog</DbName> <DbName>omim</DbName> <DbName>orgtrack</DbName> <DbName>pmc</DbName> <DbName>popset</DbName> <DbName>probe</DbName> <DbName>proteinclusters</DbName> <DbName>pcassay</DbName> <DbName>biosystems</DbName> <DbName>pccompound</DbName> <DbName>pcsubstance</DbName> <DbName>pubmedhealth</DbName> <DbName>seqannot</DbName> <DbName>snp</DbName> <DbName>sra</DbName> <DbName>taxonomy</DbName> <DbName>biocollections</DbName> <DbName>unigene</DbName> <DbName>gencoll</DbName> <DbName>gtr</DbName> </DbList> </eInfoResult>
The data is in XML format, and to get the data as python object, use Entrez.read method as soon as Entrez.einfo() method is invoked −
>>> info = Entrez.einfo() >>> record = Entrez.read(info)
Here, record is a dictionary which has one key, DbList as shown below −
>>> record.keys() [u DbList ]
Accessing the DbList key returns the pst of database names shown below −
>>> record[u DbList ] [ pubmed , protein , nuccore , ipg , nucleotide , nucgss , nucest , structure , sparcle , genome , annotinfo , assembly , bioproject , biosample , blastdbinfo , books , cdd , cpnvar , clone , gap , gapplus , grasp , dbvar , gene , gds , geoprofiles , homologene , medgen , mesh , ncbisearch , nlmcatalog , omim , orgtrack , pmc , popset , probe , proteinclusters , pcassay , biosystems , pccompound , pcsubstance , pubmedhealth , seqannot , snp , sra , taxonomy , biocollections , unigene , gencoll , gtr ] >>>
Basically, Entrez module parses the XML returned by Entrez search system and provide it as python dictionary and psts.
Search Database
To search any of one the Entrez databases, we can use Bio.Entrez.esearch() module. It is defined below −
>>> info = Entrez.einfo() >>> info = Entrez.esearch(db = "pubmed",term = "genome") >>> record = Entrez.read(info) >>>print(record) DictElement({u Count : 1146113 , u RetMax : 20 , u IdList : [ 30347444 , 30347404 , 30347317 , 30347292 , 30347286 , 30347249 , 30347194 , 30347187 , 30347172 , 30347088 , 30347075 , 30346992 , 30346990 , 30346982 , 30346980 , 30346969 , 30346962 , 30346954 , 30346941 , 30346939 ], u TranslationStack : [DictElement({u Count : 927819 , u Field : MeSH Terms , u Term : "genome"[MeSH Terms] , u Explode : Y }, attributes = {}) , DictElement({u Count : 422712 , u Field : All Fields , u Term : "genome"[All Fields] , u Explode : N }, attributes = {}), OR , GROUP ], u TranslationSet : [DictElement({u To : "genome"[MeSH Terms] OR "genome"[All Fields] , u From : genome }, attributes = {})], u RetStart : 0 , u QueryTranslation : "genome"[MeSH Terms] OR "genome"[All Fields] }, attributes = {}) >>>
If you assign incorrect db then it returns
>>> info = Entrez.esearch(db = "blastdbinfo",term = "books") >>> record = Entrez.read(info) >>> print(record) DictElement({u Count : 0 , u RetMax : 0 , u IdList : [], u WarningList : DictElement({u OutputMessage : [ No items found. ], u PhraseIgnored : [], u QuotedPhraseNotFound : []}, attributes = {}), u ErrorList : DictElement({u FieldNotFound : [], u PhraseNotFound : [ books ]}, attributes = {}), u TranslationSet : [], u RetStart : 0 , u QueryTranslation : (books[All Fields]) }, attributes = {})
If you want to search across database, then you can use Entrez.egquery. This is similar to Entrez.esearch except it is enough to specify the keyword and skip the database parameter.
>>>info = Entrez.egquery(term = "entrez") >>> record = Entrez.read(info) >>> for row in record["eGQueryResult"]: ... print(row["DbName"], row["Count"]) ... pubmed 458 pmc 12779 mesh 1 ... ... ... biosample 7 biocollections 0
Fetch Records
Enterz provides a special method, efetch to search and download the full details of a record from Entrez. Consider the following simple example −
>>> handle = Entrez.efetch( db = "nucleotide", id = "EU490707", rettype = "fasta")
Now, we can simply read the records using SeqIO object
>>> record = SeqIO.read( handle, "fasta" ) >>> record SeqRecord(seq = Seq( ATTTTTTACGAACCTGTGGAAATTTTTGGTTATGACAATAAATCTAGTTTAGTA...GAA , SingleLetterAlphabet()), id = EU490707.1 , name = EU490707.1 , description = EU490707.1 Selenipedium aequinoctiale maturase K (matK) gene, partial cds; chloroplast , dbxrefs = [])Advertisements