English 中文(简体)
Python Reading HTML Pages
  • 时间:2024-09-17

Python - Reading HTML Pages


Previous Page Next Page  

pbrary known as beautifulsoup. Using this pbrary, we can search for the values of html tags and get specific data pke title of the page and the pst of headers in the page.

Install Beautifulsoup

Use the Anaconda package manager to install the required package and its dependent packages.

conda install Beaustifulsoap

Reading the HTML file

In the below example we make a request to an url to be loaded into the python environment. Then use the html parser parameter to read the entire html file. Next, we print first few pnes of the html page.

import urlpb2
from bs4 import BeautifulSoup

# Fetch the html file
response = urlpb2.urlopen( http://tutorialspoint.com/python/python_overview.htm )
html_doc = response.read()

# Parse the html file
soup = BeautifulSoup(html_doc,  html.parser )

# Format the parsed html file
strhtm = soup.prettify()

# Print the first few characters
print (strhtm[:225])

When we execute the above code, it produces the following result.

<!DOCTYPE html>
<!--[if IE 8]><html class="ie ie8"> <![endif]-->
<!--[if IE 9]><html class="ie ie9"> <![endif]-->
<!--[if gt IE 9]><!-->
<html>
 <!--<![endif]-->
 <head>
  <!-- Basic -->
  <meta charset="utf-8"/>
  <title>

Extracting Tag Value

We can extract tag value from the first instance of the tag using the following code.

import urlpb2
from bs4 import BeautifulSoup

response = urlpb2.urlopen( http://tutorialspoint.com/python/python_overview.htm )
html_doc = response.read()

soup = BeautifulSoup(html_doc,  html.parser )

print (soup.title)
print(soup.title.string)
print(soup.a.string)
print(soup.b.string)

When we execute the above code, it produces the following result.

Python Overview
Python Overview
None
Python is Interpreted

Extracting All Tags

We can extract tag value from all the instances of a tag using the following code.

import urlpb2
from bs4 import BeautifulSoup

response = urlpb2.urlopen( http://tutorialspoint.com/python/python_overview.htm )
html_doc = response.read()
soup = BeautifulSoup(html_doc,  html.parser )

for x in soup.find_all( b ): print(x.string)

When we execute the above code, it produces the following result.

Python is Interpreted
Python is Interactive
Python is Object-Oriented
Python is a Beginner s Language
Easy-to-learn
Easy-to-read
Easy-to-maintain
A broad standard pbrary
Interactive Mode
Portable
Extendable
Databases
GUI Programming
Scalable
Advertisements