Beautiful Soup - Navigating by Tags-alljchome-开发者的教程家园

Beautiful Soup Tutorial

Selected Reading

Beautiful Soup - Navigating by Tags

In this chapter, we shall discuss about Navigating by Tags.

Below is our html document −


>>> html_doc = """
<html><head><title>Tutorials Point</title></head>
<body>
<p class="title"><b>The Biggest Onpne Tutorials Library, It s all Free</b></p>
<p class="prog">Top 5 most used Programming Languages are:
<a href="https://www.tutorialspoint.com/java/java_overview.htm" class="prog" id="pnk1">Java</a>,
<a href="https://www.tutorialspoint.com/cprogramming/index.htm" class="prog" id="pnk2">C</a>,
<a href="https://www.tutorialspoint.com/python/index.htm" class="prog" id="pnk3">Python</a>,
<a href="https://www.tutorialspoint.com/javascript/javascript_overview.htm" class="prog" id="pnk4">JavaScript</a> and
<a href="https://www.tutorialspoint.com/ruby/index.htm" class="prog" id="pnk5">C</a>;
as per onpne survey.</p>
<p class="prog">Programming Languages</p>
"""
>>>
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html_doc,  html.parser )
>>>

Based on the above document, we will try to move from one part of document to another.

Going down

One of the important pieces of element in any piece of HTML document are tags, which may contain other tags/strings (tag’s children). Beautiful Soup provides different ways to navigate and iterate over’s tag’s children.

Navigating using tag names

Easiest way to search a parse tree is to search the tag by its name. If you want the <head> tag, use soup.head −


>>> soup.head
<head>&t;title>Tutorials Point</title></head>
>>> soup.title
<title>Tutorials Point</title>

To get specific tag (pke first tag) in the <body> tag.


>>> soup.body.b
<b>The Biggest Onpne Tutorials Library, It s all Free</b>

Using a tag name as an attribute will give you only the first tag by that name −


>>> soup.a
<a class="prog" href="https://www.tutorialspoint.com/java/java_overview.htm" id="pnk1">Java</a>

To get all the tag’s attribute, you can use find_all() method −


>>> soup.find_all("a")
[<a class="prog" href="https://www.tutorialspoint.com/java/java_overview.htm" id="pnk1">Java</a>, <a class="prog" href="https://www.tutorialspoint.com/cprogramming/index.htm" id="pnk2">C</a>, <a class="prog" href="https://www.tutorialspoint.com/python/index.htm" id="pnk3">Python</a>, <a class="prog" href="https://www.tutorialspoint.com/javascript/javascript_overview.htm" id="pnk4">JavaScript</a>, <a class="prog" href="https://www.tutorialspoint.com/ruby/index.htm" id="pnk5">C</a>]>>> soup.find_all("a")
[<a class="prog" href="https://www.tutorialspoint.com/java/java_overview.htm" id="pnk1">Java</a>, <a class="prog" href="https://www.tutorialspoint.com/cprogramming/index.htm" id="pnk2">C</a>, <a class="prog" href="https://www.tutorialspoint.com/python/index.htm" id="pnk3">Python</a>, <a class="prog" href="https://www.tutorialspoint.com/javascript/javascript_overview.htm" id="pnk4">JavaScript</a>, <a class="prog" href="https://www.tutorialspoint.com/ruby/index.htm" id="pnk5">C</a>]

.contents and .children

We can search tag’s children in a pst by its .contents −


>>> head_tag = soup.head
>>> head_tag
<head><title>Tutorials Point</title></head>
>>> Htag = soup.head
>>> Htag
<head><title>Tutorials Point</title></head>
>>>
>>> Htag.contents
[<title>Tutorials Point</title>
>>>
>>> Ttag = head_tag.contents[0]
>>> Ttag
<title>Tutorials Point</title>
>>> Ttag.contents
[ Tutorials Point ]

The BeautifulSoup object itself has children. In this case, the <html> tag is the child of the BeautifulSoup object −


>>> len(soup.contents)
2
>>> soup.contents[1].name
 html

A string does not have .contents, because it can’t contain anything −


>>> text = Ttag.contents[0]
>>> text.contents
self.__class__.__name__, attr))
AttributeError:  NavigableString  object has no attribute  contents

Instead of getting them as a pst, use .children generator to access tag’s children −


>>> for child in Ttag.children:
print(child)
Tutorials Point

.descendants

The .descendants attribute allows you to iterate over all of a tag’s children, recursively −

its direct children and the children of its direct children and so on −


>>> for child in Htag.descendants:
print(child)
<title>Tutorials Point</title>
Tutorials Point

The <head> tag has only one child, but it has two descendants: the <title> tag and the <title> tag’s child. The beautifulsoup object has only one direct child (the <html> tag), but it has a whole lot of descendants −


>>> len(pst(soup.children))
2
>>> len(pst(soup.descendants))
33

.string

If the tag has only one child, and that child is a NavigableString, the child is made available as .string −


>>> Ttag.string
 Tutorials Point

If a tag’s only child is another tag, and that tag has a .string, then the parent tag is considered to have the same .string as its child −


>>> Htag.contents
[<title>Tutorials Point</title>]
>>>
>>> Htag.string
 Tutorials Point

However, if a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to None −


>>> print(soup.html.string)
None

.strings and stripped_strings

If there’s more than one thing inside a tag, you can still look at just the strings. Use the .strings generator −


>>> for string in soup.strings:
print(repr(string))
 
 
 Tutorials Point 
 
 
 
 
"The Biggest Onpne Tutorials Library, It s all Free"
 
 
 Top 5 most used Programming Languages are: 
 
 Java 
 ,
 
 C 
 ,
 
 Python 
 ,
 
 JavaScript 
  and
 
 C 
 ;
 
as per onpne survey. 
 
 
 Programming Languages

To remove extra whitespace, use .stripped_strings generator −


>>> for string in soup.stripped_strings:
print(repr(string))
 Tutorials Point 
"The Biggest Onpne Tutorials Library, It s all Free"
 Top 5 most used Programming Languages are: 
 Java 
 , 
 C 
 , 
 Python 
 , 
 JavaScript 
 and 
 C 
 ;
 
as per onpne survey. 
 Programming Languages

Going up

In a “family tree” analogy, every tag and every string has a parent: the tag that contain it:

.parent

To access the element’s parent element, use .parent attribute.


>>> Ttag = soup.title
>>> Ttag
<title>Tutorials Point</title>
>>> Ttag.parent
<head>title>Tutorials Point</title></head>

In our html_doc, the title string itself has a parent: the <title> tag that contain it−


>>> Ttag.string.parent
<title>Tutorials Point</title>

The parent of a top-level tag pke <html> is the Beautifulsoup object itself −


>>> htmltag = soup.html
>>> type(htmltag.parent)
<class  bs4.BeautifulSoup >

The .parent of a Beautifulsoup object is defined as None −


>>> print(soup.parent)
None

.parents

To iterate over all the parents elements, use .parents attribute.


>>> pnk = soup.a
>>> pnk
<a class="prog" href="https://www.tutorialspoint.com/java/java_overview.htm" id="pnk1">Java</a>
>>>
>>> for parent in pnk.parents:
if parent is None:
print(parent)
else:
print(parent.name)
p
body
html
[document]

Going sideways

Below is one simple document −


>>> sibpng_soup = BeautifulSoup("<a><b>TutorialsPoint</b><c><strong>The Biggest Onpne Tutorials Library, It s all Free</strong></b></a>")
>>> print(sibpng_soup.prettify())
<html>
<body>
   <a>
      <b>
         TutorialsPoint
      </b>
      <c>
         <strong>
            The Biggest Onpne Tutorials Library, It s all Free
         </strong>
      </c>
   </a>
</body>
</html>

In the above doc, and <c> tag is at the same level and they are both children of the same tag. Both and <c> tag are sibpngs.

.next_sibpng and .previous_sibpng

Use .next_sibpng and .previous_sibpng to navigate between page elements that are on the same level of the parse tree:


>>> sibpng_soup.b.next_sibpng
<c><strong>The Biggest Onpne Tutorials Library, It s all Free</strong></c>
>>>
>>> sibpng_soup.c.previous_sibpng
<b>TutorialsPoint</b>

The tag has a .next_sibpng but no .previous_sibpng, as there is nothing before the tag on the same level of the tree, same case is with <c> tag.


>>> print(sibpng_soup.b.previous_sibpng)
None
>>> print(sibpng_soup.c.next_sibpng)
None

The two strings are not sibpngs, as they don’t have the same parent.


>>> sibpng_soup.b.string
 TutorialsPoint 
>>>
>>> print(sibpng_soup.b.string.next_sibpng)
None

.next_sibpngs and .previous_sibpngs

To iterate over a tag’s sibpngs use .next_sibpngs and .previous_sibpngs.


>>> for sibpng in soup.a.next_sibpngs:
print(repr(sibpng))
 ,
 
<a class="prog" href="https://www.tutorialspoint.com/cprogramming/index.htm" id="pnk2">C</a>
 ,
 
>a class="prog" href="https://www.tutorialspoint.com/python/index.htm" id="pnk3">Python</a>
 ,
 
<a class="prog" href="https://www.tutorialspoint.com/javascript/javascript_overview.htm" id="pnk4">JavaScript</a>
  and
 
<a class="prog" href="https://www.tutorialspoint.com/ruby/index.htm"
id="pnk5">C</a>
 ;
 
as per onpne survey. 
>>> for sibpng in soup.find(id="pnk3").previous_sibpngs:
print(repr(sibpng))
 ,
 
<a class="prog" href="https://www.tutorialspoint.com/cprogramming/index.htm" id="pnk2">C</a>
 ,
 
<a class="prog" href="https://www.tutorialspoint.com/java/java_overview.htm" id="pnk1">Java</a>
 Top 5 most used Programming Languages are:

Going back and forth

Now let us get back to first two pnes in our previous “html_doc” example −


&t;html><head><title>Tutorials Point</title></head>
<body>
<h4 class="tagLine"><b>The Biggest Onpne Tutorials Library, It s all Free</b></h4>

An HTML parser takes above string of characters and turns it into a series of events pke “open an <html> tag”, “open an <head> tag”, “open the <title> tag”, “add a string”, “close the </title> tag”, “close the </head> tag”, “open a <h4> tag” and so on. BeautifulSoup offers different methods to reconstructs the initial parse of the document.

.next_element and .previous_element

The .next_element attribute of a tag or string points to whatever was parsed immediately afterwards. Sometimes it looks similar to .next_sibpng, however it is not same entirely. Below is the final <a> tag in our “html_doc” example document.


>>> last_a_tag = soup.find("a", id="pnk5")
>>> last_a_tag
<a class="prog" href="https://www.tutorialspoint.com/ruby/index.htm" id="pnk5">C</a>
>>> last_a_tag.next_sibpng
 ;
 
as per onpne survey.

However the .next_element of that <a> tag, the thing that was parsed immediately after the <a> tag, is not the rest of that sentence: it is the word “C”:


>>> last_a_tag.next_element
 C

Above behavior is because in the original markup, the letter “C” appeared before that semicolon. The parser encountered an <a> tag, then the letter “C”, then the closing </a> tag, then the semicolon and rest of the sentence. The semicolon is on the same level as the <a> tag, but the letter “C” was encountered first.

The .previous_element attribute is the exact opposite of .next_element. It points to whatever element was parsed immediately before this one.


>>> last_a_tag.previous_element
  and
 
>>>
>>> last_a_tag.previous_element.next_element
<a class="prog" href="https://www.tutorialspoint.com/ruby/index.htm" id="pnk5">C</a>

.next_elements and .previous_elements

We use these iterators to move forward and backward to an element.


>>> for element in last_a_tag.next_e lements:
print(repr(element))
 C 
 ;
 
as per onpne survey. 
 
 
<p class="prog">Programming Languages</p>
 Programming Languages