- Beautiful Soup - Discussion
- Beautiful Soup - Useful Resources
- Beautiful Soup - Quick Guide
- Beautiful Soup - Trouble Shooting
- Parsing Only Section of a Document
- Beautiful Soup - Beautiful Objects
- Beautiful Soup - Encoding
- Beautiful Soup - Modifying the Tree
- Beautiful Soup - Searching the Tree
- Beautiful Soup - Navigating by Tags
- Beautiful Soup - Kinds of objects
- Beautiful Soup - Souping the Page
- Beautiful Soup - Installation
- Beautiful Soup - Overview
- Beautiful Soup - Home
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
Beautiful Soup - Souping the Page
In the previous code example, we parse the document through beautiful constructor using a string method. Another way is to pass the document through open filehandle.
from bs4 import BeautifulSoup with open("example.html") as fp: soup = BeautifulSoup(fp) soup = BeautifulSoup("<html>data</html>")
First the document is converted to Unicode, and HTML entities are converted to Unicode characters:</p>
import bs4 html = <b>tutorialspoint</b>, <i>&web scraping &data science;</i> soup = bs4.BeautifulSoup(html, lxml ) print(soup)
Output
<html><body><b>tutorialspoint</b>, <i>&web scraping &data science;</i></body></html>
BeautifulSoup then parses the data using HTML parser or you exppcitly tell it to parse using an XML parser.
HTML tree Structure
Before we look into different components of a HTML page, let us first understand the HTML tree structure.
The root element in the document tree is the html, which can have parents, children and sibpngs and this determines by its position in the tree structure. To move among HTML elements, attributes and text, you have to move among nodes in your tree structure.
Let us suppose the webpage is as shown below −
Which translates to an html document as follows −
<html><head><title>TutorialsPoint</title></head><h1>Tutorialspoint Onpne Library</h1><p<<b>It s all Free</b></p></body></html>
Which simply means, for above html document, we have a html tree structure as follows −
Advertisements