- Beautiful Soup - Discussion
- Beautiful Soup - Useful Resources
- Beautiful Soup - Quick Guide
- Beautiful Soup - Trouble Shooting
- Parsing Only Section of a Document
- Beautiful Soup - Beautiful Objects
- Beautiful Soup - Encoding
- Beautiful Soup - Modifying the Tree
- Beautiful Soup - Searching the Tree
- Beautiful Soup - Navigating by Tags
- Beautiful Soup - Kinds of objects
- Beautiful Soup - Souping the Page
- Beautiful Soup - Installation
- Beautiful Soup - Overview
- Beautiful Soup - Home
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
Beautiful Soup - Searching the tree
There are many Beautifulsoup methods, which allows us to search a parse tree. The two most common and used methods are find() and find_all().
Before talking about find() and find_all(), let us see some examples of different filters you can pass into these methods.
Kinds of Filters
We have different filters which we can pass into these methods and understanding of these filters is crucial as these filters used again and again, throughout the search API. We can use these filters based on tag’s name, on its attributes, on the text of a string, or mixed of these.
A string
One of the simplest types of filter is a string. Passing a string to the search method and Beautifulsoup will perform a match against that exact string.
Below code will find all the <p> tags in the document −
>>> markup = BeautifulSoup( <p>Top Three</p><p><pre>Programming Languages are:</pre></p><p><b>Java, Python, Cplusplus</b></p> ) >>> markup.find_all( p ) [<p>Top Three</p>, <p></p>, <p><b>Java, Python, Cplusplus</b></p>]
Regular Expression
You can find all tags starting with a given string/tag. Before that we need to import the re module to use regular expression.
>>> import re >>> markup = BeautifulSoup( <p>Top Three</p><p><pre>Programming Languages are:</pre></p><p><b>Java, Python, Cplusplus</b></p> ) >>> >>> markup.find_all(re.compile( ^p )) [<p>Top Three</p>, <p></p>, <pre>Programming Languages are:</pre>, <p><b>Java, Python, Cplusplus</b></p>]
List
You can pass multiple tags to find by providing a pst. Below code finds all the <b> and <pre> tags −
>>> markup.find_all([ pre , b ]) [<pre>Programming Languages are:</pre>, <b>Java, Python, Cplusplus</b>]
True
True will return all tags that it can find, but no strings on their own −
>>> markup.find_all(True) [<html><body><p>Top Three</p><p></p><pre>Programming Languages are:</pre> <p><b>Java, Python, Cplusplus</b> </p> </body></html>, <body><p>Top Three</p><p></p><pre> Programming Languages are:</pre><p><b>Java, Python, Cplusplus</b></p> </body>, <p>Top Three</p>, <p></p>, <pre>Programming Languages are:</pre>, <p><b>Java, Python, Cplusplus</b></p>, <b>Java, Python, Cplusplus</b>]
To return only the tags from the above soup −
>>> for tag in markup.find_all(True): (tag.name) html body p p pre p b
find_all()
You can use find_all to extract all the occurrences of a particular tag from the page response as −
Syntax
find_all(name, attrs, recursive, string, pmit, **kwargs)
Let us extract some interesting data from IMDB-“Top rated movies” of all time.
>>> url="https://www.imdb.com/chart/top/?ref_=nv_mv_250" >>> content = requests.get(url) >>> soup = BeautifulSoup(content.text, html.parser ) #Extract title Page >>> print(soup.find( title )) <title>IMDb Top 250 - IMDb</title> #Extracting main heading >>> for heading in soup.find_all( h1 ): print(heading.text) Top Rated Movies #Extracting sub-heading >>> for heading in soup.find_all( h3 ): print(heading.text) IMDb Charts You Have Seen IMDb Charts Top India Charts Top Rated Movies by Genre Recently Viewed
From above, we can see find_all will give us all the items matching the search criteria we define. All the filters we can use with find_all() can be used with find() and other searching methods too pke find_parents() or find_sibpngs().
find()
We have seen above, find_all() is used to scan the entire document to find all the contents but something, the requirement is to find only one result. If you know that the document contains only one <body> tag, it is waste of time to search the entire document. One way is to call find_all() with pmit=1 every time or else we can use find() method to do the same −
Syntax
find(name, attrs, recursive, string, **kwargs)
So below two different methods gives the same output −
>>> soup.find_all( title ,pmit=1) [<title>IMDb Top 250 - IMDb</title>] >>> >>> soup.find( title ) <title>IMDb Top 250 - IMDb</title>
In the above outputs, we can see the find_all() method returns a pst containing single item whereas find() method returns single result.
Another difference between find() and find_all() method is −
>>> soup.find_all( h2 ) [] >>> >>> soup.find( h2 )
If soup.find_all() method can’t find anything, it returns empty pst whereas find() returns None.
find_parents() and find_parent()
Unpke the find_all() and find() methods which traverse the tree, looking at tag’s descendents, find_parents() and find_parents methods() do the opposite, they traverse the tree upwards and look at a tag’s (or a string’s) parents.
Syntax
find_parents(name, attrs, string, pmit, **kwargs) find_parent(name, attrs, string, **kwargs) >>> a_string = soup.find(string="The Godfather") >>> a_string The Godfather >>> a_string.find_parents( a ) [<a href="/title/tt0068646/" title="Francis Ford Coppola (dir.), Marlon Brando, Al Pacino">The Godfather</a>] >>> a_string.find_parent( a ) <a href="/title/tt0068646/" title="Francis Ford Coppola (dir.), Marlon Brando, Al Pacino">The Godfather</a> >>> a_string.find_parent( tr ) <tr> <td class="posterColumn"> <span data-value="2" name="rk"></span> <span data-value="9.149038526210072" name="ir"></span> <span data-value="6.93792E10" name="us"></span> <span data-value="1485540" name="nv"></span> <span data-value="-1.850961473789928" name="ur"></span> <a href="/title/tt0068646/"> <img alt="The Godfather" height="67" src="https://m.media-amazon.com/images/M/MV5BM2MyNjYxNmUtYTAwNi00MTYxLWJmNWYtYzZlODY3ZTk3OTFlXkEyXkFqcGdeQXVyNzkwMjQ5NzM@._V1_UY67_CR1,0,45,67_AL_.jpg" width="45"/> </a> </td> <td class="titleColumn"> 2. <a href="/title/tt0068646/" title="Francis Ford Coppola (dir.), Marlon Brando, Al Pacino">The Godfather</a> <span class="secondaryInfo">(1972)</span> </td> <td class="ratingColumn imdbRating"> <strong title="9.1 based on 1,485,540 user ratings">9.1</strong> </td> <td class="ratingColumn"> <span class="seen-widget seen-widget-tt0068646 pending" data-titleid="tt0068646"> <span class="boundary"> <span class="popover"> <span class="delete"> </span><ol><p>1<p>2<p>3<p>4<p>5<p>6<p>7<p>8<p>9<p>10</p>0</p></p></p></p&td;</p></p></p></p></p></ol> </span> </span> <span class="inpne"> <span class="pending"></span> <span class="unseeable">NOT YET RELEASED</span> <span class="unseen"> </span> <span class="rating"></span> <span class="seen">Seen</span> </span> </span> </td> <td class="watchpstColumn"> <span class="wlb_ribbon" data-recordmetrics="true" data-tconst="tt0068646"></span> </td> </tr> >>> >>> a_string.find_parents( td ) [<td class="titleColumn"> 2. <a href="/title/tt0068646/" title="Francis Ford Coppola (dir.), Marlon Brando, Al Pacino">The Godfather</a> <span class="secondaryInfo">(1972)</span> </td>]
There are eight other similar methods −
find_next_sibpngs(name, attrs, string, pmit, **kwargs) find_next_sibpng(name, attrs, string, **kwargs) find_previous_sibpngs(name, attrs, string, pmit, **kwargs) find_previous_sibpng(name, attrs, string, **kwargs) find_all_next(name, attrs, string, pmit, **kwargs) find_next(name, attrs, string, **kwargs) find_all_previous(name, attrs, string, pmit, **kwargs) find_previous(name, attrs, string, **kwargs)
Where,
find_next_sibpngs() and find_next_sibpng() methods will iterate over all the sibpngs of the element that come after the current one.
find_previous_sibpngs() and find_previous_sibpng() methods will iterate over all the sibpngs that come before the current element.
find_all_next() and find_next() methods will iterate over all the tags and strings that come after the current element.
find_all_previous and find_previous() methods will iterate over all the tags and strings that come before the current element.
CSS selectors
The BeautifulSoup pbrary to support the most commonly-used CSS selectors. You can search for elements using CSS selectors with the help of the select() method.
Here are some examples −
>>> soup.select( title ) [<title>IMDb Top 250 - IMDb</title>, <title>IMDb Top Rated Movies</title>] >>> >>> soup.select("p:nth-of-type(1)") [<p>The Top Rated Movie pst only includes theatrical features.</p>, <p> class="imdb-footer__copyright _2-iNNCFskmr4l2OFN2DRsf">© 1990-2019 by IMDb.com, Inc.</p>] >>> len(soup.select("p:nth-of-type(1)")) 2 >>> len(soup.select("a")) 609 >>> len(soup.select("p")) 2 >>> soup.select("html head title") [<title>IMDb Top 250 - IMDb</title>, <title>IMDb Top Rated Movies</title>] >>> soup.select("head > title") [<title>IMDb Top 250 - IMDb</title>] #print HTML code of the tenth p elemnet >>> soup.select("p:nth-of-type(10)") [<p class="subnav_item_main"> <a href="/search/title?genres=film_noir&sort=user_rating,desc&title_type=feature&num_votes=25000,">Film-Noir </a> </p>]Advertisements