Beautiful Soup - Souping the Page-alljchome-开发者的教程家园

Beautiful Soup Tutorial

Selected Reading

Beautiful Soup - Souping the Page

In the previous code example, we parse the document through beautiful constructor using a string method. Another way is to pass the document through open filehandle.


from bs4 import BeautifulSoup
with open("example.html") as fp:
   soup = BeautifulSoup(fp)
soup = BeautifulSoup("<html>data</html>")

First the document is converted to Unicode, and HTML entities are converted to Unicode characters:</p>


import bs4
html =    <b>tutorialspoint</b>, <i>&web scraping &data science;</i>   
soup = bs4.BeautifulSoup(html,  lxml )
print(soup)

Output


<html><body><b>tutorialspoint</b>, <i>&web scraping &data science;</i></body></html>

BeautifulSoup then parses the data using HTML parser or you exppcitly tell it to parse using an XML parser.

HTML tree Structure

Before we look into different components of a HTML page, let us first understand the HTML tree structure.

The root element in the document tree is the html, which can have parents, children and sibpngs and this determines by its position in the tree structure. To move among HTML elements, attributes and text, you have to move among nodes in your tree structure.

Let us suppose the webpage is as shown below −

Which translates to an html document as follows −


<html><head><title>TutorialsPoint</title></head><h1>Tutorialspoint Onpne Library</h1><p<<b>It s all Free</b></p></body></html>

Which simply means, for above html document, we have a html tree structure as follows −

Beautiful Soup - Souping the Page

Output

HTML tree Structure

友情链接