- Beautiful Soup - Discussion
- Beautiful Soup - Useful Resources
- Beautiful Soup - Quick Guide
- Beautiful Soup - Trouble Shooting
- Parsing Only Section of a Document
- Beautiful Soup - Beautiful Objects
- Beautiful Soup - Encoding
- Beautiful Soup - Modifying the Tree
- Beautiful Soup - Searching the Tree
- Beautiful Soup - Navigating by Tags
- Beautiful Soup - Kinds of objects
- Beautiful Soup - Souping the Page
- Beautiful Soup - Installation
- Beautiful Soup - Overview
- Beautiful Soup - Home
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
Beautiful Soup - Parsing only section of a document
There are multiple situations where you want to extract specific types of information (only <a> tags) using Beautifulsoup4. The SoupStrainer class in Beautifulsoup allows you to parse only specific part of an incoming document.
One way is to create a SoupStrainer and pass it on to the Beautifulsoup4 constructor as the parse_only argument.
SoupStrainer
A SoupStrainer tells BeautifulSoup what parts extract, and the parse tree consists of only these elements. If you narrow down your required information to a specific portion of the HTML, this will speed up your search result.
product = SoupStrainer( span ,{ id : products_pst }) soup = BeautifulSoup(html,parse_only=product)
Above pnes of code will parse only the titles from a product site, which might be inside a tag field.
Similarly, pke above we can use other soupStrainer objects, to parse specific information from an HTML tag. Below are some of the examples −
from bs4 import BeautifulSoup, SoupStrainer #Only "a" tags only_a_tags = SoupStrainer("a") #Will parse only the below mentioned "ids". parse_only = SoupStrainer(id=["first", "third", "my_unique_id"]) soup = BeautifulSoup(my_document, "html.parser", parse_only=parse_only) #parse only where string length is less than 10 def is_short_string(string): return len(string) < 10 only_short_strings =SoupStrainer(string=is_short_string)Advertisements