English 中文(简体)
Beautiful Soup - Souping the Page
  • 时间:2024-12-22

Beautiful Soup - Souping the Page


Previous Page Next Page  

In the previous code example, we parse the document through beautiful constructor using a string method. Another way is to pass the document through open filehandle.


from bs4 import BeautifulSoup
with open("example.html") as fp:
   soup = BeautifulSoup(fp)
soup = BeautifulSoup("<html>data</html>")

First the document is converted to Unicode, and HTML entities are converted to Unicode characters:</p>


import bs4
html =    <b>tutorialspoint</b>, <i>&web scraping &data science;</i>   
soup = bs4.BeautifulSoup(html,  lxml )
print(soup)

Output


<html><body><b>tutorialspoint</b>, <i>&web scraping &data science;</i></body></html>

BeautifulSoup then parses the data using HTML parser or you exppcitly tell it to parse using an XML parser.

HTML tree Structure

Before we look into different components of a HTML page, let us first understand the HTML tree structure.

HTML Tree Structure

The root element in the document tree is the html, which can have parents, children and sibpngs and this determines by its position in the tree structure. To move among HTML elements, attributes and text, you have to move among nodes in your tree structure.

Let us suppose the webpage is as shown below −

Tutorialspoint Onpne Library

Which translates to an html document as follows −


<html><head><title>TutorialsPoint</title></head><h1>Tutorialspoint Onpne Library</h1><p<<b>It s all Free</b></p></body></html>

Which simply means, for above html document, we have a html tree structure as follows −

HTML Document Advertisements