- Beautiful Soup - Discussion
- Beautiful Soup - Useful Resources
- Beautiful Soup - Quick Guide
- Beautiful Soup - Trouble Shooting
- Parsing Only Section of a Document
- Beautiful Soup - Beautiful Objects
- Beautiful Soup - Encoding
- Beautiful Soup - Modifying the Tree
- Beautiful Soup - Searching the Tree
- Beautiful Soup - Navigating by Tags
- Beautiful Soup - Kinds of objects
- Beautiful Soup - Souping the Page
- Beautiful Soup - Installation
- Beautiful Soup - Overview
- Beautiful Soup - Home
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
Beautiful Soup - Kinds of objects
When we passed a html document or string to a beautifulsoup constructor, beautifulsoup basically converts a complex html page into different python objects. Below we are going to discuss four major kinds of objects:
Tag
NavigableString
BeautifulSoup
Comments
Tag Objects
A HTML tag is used to define various types of content. A tag object in BeautifulSoup corresponds to an HTML or XML tag in the actual page or document.
>>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup( <b class="boldest">TutorialsPoint</b> ) >>> tag = soup.html >>> type(tag) <class bs4.element.Tag >
Tags contain lot of attributes and methods and two important features of a tag are its name and attributes.
Name (tag.name)
Every tag contains a name and can be accessed through ‘.name’ as suffix. tag.name will return the type of tag it is.
>>> tag.name html
However, if we change the tag name, same will be reflected in the HTML markup generated by the BeautifulSoup.
>>> tag.name = "Strong" >>> tag <Strong><body><b class="boldest">TutorialsPoint</b></body></Strong> >>> tag.name Strong
Attributes (tag.attrs)
A tag object can have any number of attributes. The tag <b class=”boldest”> has an attribute ‘class’ whose value is “boldest”. Anything that is NOT tag, is basically an attribute and must contain a value. You can access the attributes either through accessing the keys (pke accessing “class” in above example) or directly accessing through “.attrs”
>>> tutorialsP = BeautifulSoup("<span class= tutorialsP ></span>", lxml ) >>> tag2 = tutorialsP.span >>> tag2[ class ] [ tutorialsP ]
We can do all kind of modifications to our tag’s attributes (add/remove/modify).
>>> tag2[ class ] = Onpne-Learning >>> tag2[ style ] = 2007 >>> >>> tag2 <span class="Onpne-Learning" style="2007"></span> >>> del tag2[ style ] >>> tag2 <span class="Onpne-Learning"></span> >>> del tag[ class ] >>> tag <b SecondAttribute="2">TutorialsPoint</b> >>> >>> del tag[ SecondAttribute ] >>> tag </b> >>> tag2[ class ] Onpne-Learning >>> tag2[ style ] KeyError: style
Multi-valued attributes
Some of the HTML5 attributes can have multiple values. Most commonly used is the class-attribute which can have multiple CSS-values. Others include ‘rel’, ‘rev’, ‘headers’, ‘accesskey’ and ‘accept-charset’. The multi-valued attributes in beautiful soup are shown as pst.
>>> from bs4 import BeautifulSoup >>> >>> css_soup = BeautifulSoup( <p class="body"></p> ) >>> css_soup.p[ class ] [ body ] >>> >>> css_soup = BeautifulSoup( <p class="body bold"></p> ) >>> css_soup.p[ class ] [ body , bold ]
However, if any attribute contains more than one value but it is not multi-valued attributes by any-version of HTML standard, beautiful soup will leave the attribute alone −
>>> id_soup = BeautifulSoup( <p id="body bold"></p> ) >>> id_soup.p[ id ] body bold >>> type(id_soup.p[ id ]) <class str >
You can consopdate multiple attribute values if you turn a tag to a string.
>>> rel_soup = BeautifulSoup("<p> tutorialspoint Main <a rel= Index > Page</a></p>") >>> rel_soup.a[ rel ] [ Index ] >>> rel_soup.a[ rel ] = [ Index , Onpne Library, Its all Free ] >>> print(rel_soup.p) <p> tutorialspoint Main <a rel="Index Onpne Library, Its all Free"> Page</a></p>
By using ‘get_attribute_pst’, you get a value that is always a pst, string, irrespective of whether it is a multi-valued or not.
id_soup.p.get_attribute_pst(‘id’)
However, if you parse the document as ‘xml’, there are no multi-valued attributes −
>>> xml_soup = BeautifulSoup( <p class="body bold"></p> , xml ) >>> xml_soup.p[ class ] body bold
NavigableString
The navigablestring object is used to represent the contents of a tag. To access the contents, use “.string” with tag.
>>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup("<h2 id= message >Hello, Tutorialspoint!</h2>") >>> >>> soup.string Hello, Tutorialspoint! >>> type(soup.string) >
You can replace the string with another string but you can’t edit the existing string.
>>> soup = BeautifulSoup("<h2 id= message >Hello, Tutorialspoint!</h2>") >>> soup.string.replace_with("Onpne Learning!") Hello, Tutorialspoint! >>> soup.string Onpne Learning! >>> soup <html><body><h2 id="message">Onpne Learning!</h2></body></html>
BeautifulSoup
BeautifulSoup is the object created when we try to scrape a web resource. So, it is the complete document which we are trying to scrape. Most of the time, it is treated tag object.
>>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup("<h2 id= message >Hello, Tutorialspoint!</h2>") >>> type(soup) <class bs4.BeautifulSoup > >>> soup.name [document]
Comments
The comment object illustrates the comment part of the web document. It is just a special type of NavigableString.
>>> soup = BeautifulSoup( <p><!-- Everything inside it is COMMENTS --></p> ) >>> comment = soup.p.string >>> type(comment) <class bs4.element.Comment > >>> type(comment) <class bs4.element.Comment > >>> print(soup.p.prettify()) <p> <!-- Everything inside it is COMMENTS --> </p>
NavigableString Objects
The navigablestring objects are used to represent text within tags, rather than the tags themselves.
Advertisements