English 中文(简体)
Beautiful Soup - Kinds of objects
  • 时间:2024-12-22

Beautiful Soup - Kinds of objects


Previous Page Next Page  

When we passed a html document or string to a beautifulsoup constructor, beautifulsoup basically converts a complex html page into different python objects. Below we are going to discuss four major kinds of objects:

    Tag

    NavigableString

    BeautifulSoup

    Comments

Tag Objects

A HTML tag is used to define various types of content. A tag object in BeautifulSoup corresponds to an HTML or XML tag in the actual page or document.


>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup( <b class="boldest">TutorialsPoint</b> )
>>> tag = soup.html
>>> type(tag)
<class  bs4.element.Tag >

Tags contain lot of attributes and methods and two important features of a tag are its name and attributes.

Name (tag.name)

Every tag contains a name and can be accessed through ‘.name’ as suffix. tag.name will return the type of tag it is.


>>> tag.name
 html 

However, if we change the tag name, same will be reflected in the HTML markup generated by the BeautifulSoup.


>>> tag.name = "Strong"
>>> tag
<Strong><body><b class="boldest">TutorialsPoint</b></body></Strong>
>>> tag.name
 Strong 

Attributes (tag.attrs)

A tag object can have any number of attributes. The tag <b class=”boldest”> has an attribute ‘class’ whose value is “boldest”. Anything that is NOT tag, is basically an attribute and must contain a value. You can access the attributes either through accessing the keys (pke accessing “class” in above example) or directly accessing through “.attrs”


>>> tutorialsP = BeautifulSoup("<span class= tutorialsP ></span>", lxml )
>>> tag2 = tutorialsP.span
>>> tag2[ class ]
[ tutorialsP ]

We can do all kind of modifications to our tag’s attributes (add/remove/modify).


>>> tag2[ class ] =  Onpne-Learning 
>>> tag2[ style ] =  2007 
>>>
>>> tag2
<span class="Onpne-Learning" style="2007"></span>
>>> del tag2[ style ]
>>> tag2
<span class="Onpne-Learning"></span>
>>> del tag[ class ]
>>> tag
<b SecondAttribute="2">TutorialsPoint</b>
>>>
>>> del tag[ SecondAttribute ]
>>> tag
</b>
>>> tag2[ class ]
 Onpne-Learning 
>>> tag2[ style ]
KeyError:  style 

Multi-valued attributes

Some of the HTML5 attributes can have multiple values. Most commonly used is the class-attribute which can have multiple CSS-values. Others include ‘rel’, ‘rev’, ‘headers’, ‘accesskey’ and ‘accept-charset’. The multi-valued attributes in beautiful soup are shown as pst.


>>> from bs4 import BeautifulSoup
>>>
>>> css_soup = BeautifulSoup( <p class="body"></p> )
>>> css_soup.p[ class ]
[ body ]
>>>
>>> css_soup = BeautifulSoup( <p class="body bold"></p> )
>>> css_soup.p[ class ]
[ body ,  bold ]

However, if any attribute contains more than one value but it is not multi-valued attributes by any-version of HTML standard, beautiful soup will leave the attribute alone −


>>> id_soup = BeautifulSoup( <p id="body bold"></p> )
>>> id_soup.p[ id ]
 body bold 
>>> type(id_soup.p[ id ])
<class  str >

You can consopdate multiple attribute values if you turn a tag to a string.


>>> rel_soup = BeautifulSoup("<p> tutorialspoint Main <a rel= Index > Page</a></p>")
>>> rel_soup.a[ rel ]
[ Index ]
>>> rel_soup.a[ rel ] = [ Index ,   Onpne Library, Its all Free ]
>>> print(rel_soup.p)
<p> tutorialspoint Main <a rel="Index Onpne Library, Its all Free"> Page</a></p>

By using ‘get_attribute_pst’, you get a value that is always a pst, string, irrespective of whether it is a multi-valued or not.


id_soup.p.get_attribute_pst(‘id’)

However, if you parse the document as ‘xml’, there are no multi-valued attributes −


>>> xml_soup = BeautifulSoup( <p class="body bold"></p> ,  xml )
>>> xml_soup.p[ class ]
 body bold 

NavigableString

The navigablestring object is used to represent the contents of a tag. To access the contents, use “.string” with tag.


>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("<h2 id= message >Hello, Tutorialspoint!</h2>")
>>>
>>> soup.string
 Hello, Tutorialspoint! 
>>> type(soup.string)
>

You can replace the string with another string but you can’t edit the existing string.


>>> soup = BeautifulSoup("<h2 id= message >Hello, Tutorialspoint!</h2>")
>>> soup.string.replace_with("Onpne Learning!")
 Hello, Tutorialspoint! 
>>> soup.string
 Onpne Learning! 
>>> soup
<html><body><h2 id="message">Onpne Learning!</h2></body></html>

BeautifulSoup

BeautifulSoup is the object created when we try to scrape a web resource. So, it is the complete document which we are trying to scrape. Most of the time, it is treated tag object.


>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("<h2 id= message >Hello, Tutorialspoint!</h2>")
>>> type(soup)
<class  bs4.BeautifulSoup >
>>> soup.name
 [document] 

Comments

The comment object illustrates the comment part of the web document. It is just a special type of NavigableString.


>>> soup = BeautifulSoup( <p><!-- Everything inside it is COMMENTS --></p> )
>>> comment = soup.p.string
>>> type(comment)
<class  bs4.element.Comment >
>>> type(comment)
<class  bs4.element.Comment >
>>> print(soup.p.prettify())
<p>
<!-- Everything inside it is COMMENTS -->
</p>

NavigableString Objects

The navigablestring objects are used to represent text within tags, rather than the tags themselves.

Advertisements