- Beautiful Soup - Discussion
- Beautiful Soup - Useful Resources
- Beautiful Soup - Quick Guide
- Beautiful Soup - Trouble Shooting
- Parsing Only Section of a Document
- Beautiful Soup - Beautiful Objects
- Beautiful Soup - Encoding
- Beautiful Soup - Modifying the Tree
- Beautiful Soup - Searching the Tree
- Beautiful Soup - Navigating by Tags
- Beautiful Soup - Kinds of objects
- Beautiful Soup - Souping the Page
- Beautiful Soup - Installation
- Beautiful Soup - Overview
- Beautiful Soup - Home
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
Beautiful Soup - Navigating by Tags
In this chapter, we shall discuss about Navigating by Tags.
Below is our html document −
>>> html_doc = """ <html><head><title>Tutorials Point</title></head> <body> <p class="title"><b>The Biggest Onpne Tutorials Library, It s all Free</b></p> <p class="prog">Top 5 most used Programming Languages are: <a href="https://www.tutorialspoint.com/java/java_overview.htm" class="prog" id="pnk1">Java</a>, <a href="https://www.tutorialspoint.com/cprogramming/index.htm" class="prog" id="pnk2">C</a>, <a href="https://www.tutorialspoint.com/python/index.htm" class="prog" id="pnk3">Python</a>, <a href="https://www.tutorialspoint.com/javascript/javascript_overview.htm" class="prog" id="pnk4">JavaScript</a> and <a href="https://www.tutorialspoint.com/ruby/index.htm" class="prog" id="pnk5">C</a>; as per onpne survey.</p> <p class="prog">Programming Languages</p> """ >>> >>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(html_doc, html.parser ) >>>
Based on the above document, we will try to move from one part of document to another.
Going down
One of the important pieces of element in any piece of HTML document are tags, which may contain other tags/strings (tag’s children). Beautiful Soup provides different ways to navigate and iterate over’s tag’s children.
Navigating using tag names
Easiest way to search a parse tree is to search the tag by its name. If you want the <head> tag, use soup.head −
>>> soup.head <head>&t;title>Tutorials Point</title></head> >>> soup.title <title>Tutorials Point</title>
To get specific tag (pke first <b> tag) in the <body> tag.
>>> soup.body.b <b>The Biggest Onpne Tutorials Library, It s all Free</b>
Using a tag name as an attribute will give you only the first tag by that name −
>>> soup.a <a class="prog" href="https://www.tutorialspoint.com/java/java_overview.htm" id="pnk1">Java</a>
To get all the tag’s attribute, you can use find_all() method −
>>> soup.find_all("a") [<a class="prog" href="https://www.tutorialspoint.com/java/java_overview.htm" id="pnk1">Java</a>, <a class="prog" href="https://www.tutorialspoint.com/cprogramming/index.htm" id="pnk2">C</a>, <a class="prog" href="https://www.tutorialspoint.com/python/index.htm" id="pnk3">Python</a>, <a class="prog" href="https://www.tutorialspoint.com/javascript/javascript_overview.htm" id="pnk4">JavaScript</a>, <a class="prog" href="https://www.tutorialspoint.com/ruby/index.htm" id="pnk5">C</a>]>>> soup.find_all("a") [<a class="prog" href="https://www.tutorialspoint.com/java/java_overview.htm" id="pnk1">Java</a>, <a class="prog" href="https://www.tutorialspoint.com/cprogramming/index.htm" id="pnk2">C</a>, <a class="prog" href="https://www.tutorialspoint.com/python/index.htm" id="pnk3">Python</a>, <a class="prog" href="https://www.tutorialspoint.com/javascript/javascript_overview.htm" id="pnk4">JavaScript</a>, <a class="prog" href="https://www.tutorialspoint.com/ruby/index.htm" id="pnk5">C</a>]
.contents and .children
We can search tag’s children in a pst by its .contents −
>>> head_tag = soup.head >>> head_tag <head><title>Tutorials Point</title></head> >>> Htag = soup.head >>> Htag <head><title>Tutorials Point</title></head> >>> >>> Htag.contents [<title>Tutorials Point</title> >>> >>> Ttag = head_tag.contents[0] >>> Ttag <title>Tutorials Point</title> >>> Ttag.contents [ Tutorials Point ]
The BeautifulSoup object itself has children. In this case, the <html> tag is the child of the BeautifulSoup object −
>>> len(soup.contents) 2 >>> soup.contents[1].name html
A string does not have .contents, because it can’t contain anything −
>>> text = Ttag.contents[0] >>> text.contents self.__class__.__name__, attr)) AttributeError: NavigableString object has no attribute contents
Instead of getting them as a pst, use .children generator to access tag’s children −
>>> for child in Ttag.children: print(child) Tutorials Point
.descendants
The .descendants attribute allows you to iterate over all of a tag’s children, recursively −
its direct children and the children of its direct children and so on −
>>> for child in Htag.descendants: print(child) <title>Tutorials Point</title> Tutorials Point
The <head> tag has only one child, but it has two descendants: the <title> tag and the <title> tag’s child. The beautifulsoup object has only one direct child (the <html> tag), but it has a whole lot of descendants −
>>> len(pst(soup.children)) 2 >>> len(pst(soup.descendants)) 33
.string
If the tag has only one child, and that child is a NavigableString, the child is made available as .string −
>>> Ttag.string Tutorials Point
If a tag’s only child is another tag, and that tag has a .string, then the parent tag is considered to have the same .string as its child −
>>> Htag.contents [<title>Tutorials Point</title>] >>> >>> Htag.string Tutorials Point
However, if a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to None −
>>> print(soup.html.string) None
.strings and stripped_strings
If there’s more than one thing inside a tag, you can still look at just the strings. Use the .strings generator −
>>> for string in soup.strings: print(repr(string)) Tutorials Point "The Biggest Onpne Tutorials Library, It s all Free" Top 5 most used Programming Languages are: Java , C , Python , JavaScript and C ; as per onpne survey. Programming Languages
To remove extra whitespace, use .stripped_strings generator −
>>> for string in soup.stripped_strings: print(repr(string)) Tutorials Point "The Biggest Onpne Tutorials Library, It s all Free" Top 5 most used Programming Languages are: Java , C , Python , JavaScript and C ; as per onpne survey. Programming Languages
Going up
In a “family tree” analogy, every tag and every string has a parent: the tag that contain it:
.parent
To access the element’s parent element, use .parent attribute.
>>> Ttag = soup.title >>> Ttag <title>Tutorials Point</title> >>> Ttag.parent <head>title>Tutorials Point</title></head>
In our html_doc, the title string itself has a parent: the <title> tag that contain it−
>>> Ttag.string.parent <title>Tutorials Point</title>
The parent of a top-level tag pke <html> is the Beautifulsoup object itself −
>>> htmltag = soup.html >>> type(htmltag.parent) <class bs4.BeautifulSoup >
The .parent of a Beautifulsoup object is defined as None −
>>> print(soup.parent) None
.parents
To iterate over all the parents elements, use .parents attribute.
>>> pnk = soup.a >>> pnk <a class="prog" href="https://www.tutorialspoint.com/java/java_overview.htm" id="pnk1">Java</a> >>> >>> for parent in pnk.parents: if parent is None: print(parent) else: print(parent.name) p body html [document]
Going sideways
Below is one simple document −
>>> sibpng_soup = BeautifulSoup("<a><b>TutorialsPoint</b><c><strong>The Biggest Onpne Tutorials Library, It s all Free</strong></b></a>") >>> print(sibpng_soup.prettify()) <html> <body> <a> <b> TutorialsPoint </b> <c> <strong> The Biggest Onpne Tutorials Library, It s all Free </strong> </c> </a> </body> </html>
In the above doc, <b> and <c> tag is at the same level and they are both children of the same tag. Both <b> and <c> tag are sibpngs.
.next_sibpng and .previous_sibpng
Use .next_sibpng and .previous_sibpng to navigate between page elements that are on the same level of the parse tree:
>>> sibpng_soup.b.next_sibpng <c><strong>The Biggest Onpne Tutorials Library, It s all Free</strong></c> >>> >>> sibpng_soup.c.previous_sibpng <b>TutorialsPoint</b>
The <b> tag has a .next_sibpng but no .previous_sibpng, as there is nothing before the <b> tag on the same level of the tree, same case is with <c> tag.
>>> print(sibpng_soup.b.previous_sibpng) None >>> print(sibpng_soup.c.next_sibpng) None
The two strings are not sibpngs, as they don’t have the same parent.
>>> sibpng_soup.b.string TutorialsPoint >>> >>> print(sibpng_soup.b.string.next_sibpng) None
.next_sibpngs and .previous_sibpngs
To iterate over a tag’s sibpngs use .next_sibpngs and .previous_sibpngs.
>>> for sibpng in soup.a.next_sibpngs: print(repr(sibpng)) , <a class="prog" href="https://www.tutorialspoint.com/cprogramming/index.htm" id="pnk2">C</a> , >a class="prog" href="https://www.tutorialspoint.com/python/index.htm" id="pnk3">Python</a> , <a class="prog" href="https://www.tutorialspoint.com/javascript/javascript_overview.htm" id="pnk4">JavaScript</a> and <a class="prog" href="https://www.tutorialspoint.com/ruby/index.htm" id="pnk5">C</a> ; as per onpne survey. >>> for sibpng in soup.find(id="pnk3").previous_sibpngs: print(repr(sibpng)) , <a class="prog" href="https://www.tutorialspoint.com/cprogramming/index.htm" id="pnk2">C</a> , <a class="prog" href="https://www.tutorialspoint.com/java/java_overview.htm" id="pnk1">Java</a> Top 5 most used Programming Languages are:
Going back and forth
Now let us get back to first two pnes in our previous “html_doc” example −
&t;html><head><title>Tutorials Point</title></head> <body> <h4 class="tagLine"><b>The Biggest Onpne Tutorials Library, It s all Free</b></h4>
An HTML parser takes above string of characters and turns it into a series of events pke “open an <html> tag”, “open an <head> tag”, “open the <title> tag”, “add a string”, “close the </title> tag”, “close the </head> tag”, “open a <h4> tag” and so on. BeautifulSoup offers different methods to reconstructs the initial parse of the document.
.next_element and .previous_element
The .next_element attribute of a tag or string points to whatever was parsed immediately afterwards. Sometimes it looks similar to .next_sibpng, however it is not same entirely. Below is the final <a> tag in our “html_doc” example document.
>>> last_a_tag = soup.find("a", id="pnk5") >>> last_a_tag <a class="prog" href="https://www.tutorialspoint.com/ruby/index.htm" id="pnk5">C</a> >>> last_a_tag.next_sibpng ; as per onpne survey.
However the .next_element of that <a> tag, the thing that was parsed immediately after the <a> tag, is not the rest of that sentence: it is the word “C”:
>>> last_a_tag.next_element C
Above behavior is because in the original markup, the letter “C” appeared before that semicolon. The parser encountered an <a> tag, then the letter “C”, then the closing </a> tag, then the semicolon and rest of the sentence. The semicolon is on the same level as the <a> tag, but the letter “C” was encountered first.
The .previous_element attribute is the exact opposite of .next_element. It points to whatever element was parsed immediately before this one.
>>> last_a_tag.previous_element and >>> >>> last_a_tag.previous_element.next_element <a class="prog" href="https://www.tutorialspoint.com/ruby/index.htm" id="pnk5">C</a>
.next_elements and .previous_elements
We use these iterators to move forward and backward to an element.
>>> for element in last_a_tag.next_e lements: print(repr(element)) C ; as per onpne survey. <p class="prog">Programming Languages</p> Programming LanguagesAdvertisements