- Python Data Science - Matplotlib
- Python Data Science - SciPy
- Python Data Science - Numpy
- Python Data Science - Pandas
- Python Data Science - Environment Setup
- Python Data Science - Getting Started
- Python Data Science - Home
Python Data Processing
- Python Stemming and Lemmatization
- Python word tokenization
- Python Processing Unstructured Data
- Python Reading HTML Pages
- Python Data Aggregation
- Python Data Wrangling
- Python Date and Time
- Python NoSQL Databases
- Python Relational databases
- Python Processing XLS Data
- Python Processing JSON Data
- Python Processing CSV Data
- Python Data cleansing
- Python Data Operations
Python Data Visualization
- Python Graph Data
- Python Geographical Data
- Python Time Series
- Python 3D Charts
- Python Bubble Charts
- Python Scatter Plots
- Python Heat Maps
- Python Box Plots
- Python Chart Styling
- Python Chart Properties
Statistical Data Analysis
- Python Linear Regression
- Python Chi-square Test
- Python Correlation
- Python P-Value
- Python Bernoulli Distribution
- Python Poisson Distribution
- Python Binomial Distribution
- Python Normal Distribution
- Python Measuring Variance
- Python Measuring Central Tendency
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
Python - Reading HTML Pages
pbrary known as beautifulsoup. Using this pbrary, we can search for the values of html tags and get specific data pke title of the page and the pst of headers in the page.
Install Beautifulsoup
Use the Anaconda package manager to install the required package and its dependent packages.
conda install Beaustifulsoap
Reading the HTML file
In the below example we make a request to an url to be loaded into the python environment. Then use the html parser parameter to read the entire html file. Next, we print first few pnes of the html page.
import urlpb2 from bs4 import BeautifulSoup # Fetch the html file response = urlpb2.urlopen( http://tutorialspoint.com/python/python_overview.htm ) html_doc = response.read() # Parse the html file soup = BeautifulSoup(html_doc, html.parser ) # Format the parsed html file strhtm = soup.prettify() # Print the first few characters print (strhtm[:225])
When we execute the above code, it produces the following result.
<!DOCTYPE html> <!--[if IE 8]><html class="ie ie8"> <![endif]--> <!--[if IE 9]><html class="ie ie9"> <![endif]--> <!--[if gt IE 9]><!--> <html> <!--<![endif]--> <head> <!-- Basic --> <meta charset="utf-8"/> <title>
Extracting Tag Value
We can extract tag value from the first instance of the tag using the following code.
import urlpb2 from bs4 import BeautifulSoup response = urlpb2.urlopen( http://tutorialspoint.com/python/python_overview.htm ) html_doc = response.read() soup = BeautifulSoup(html_doc, html.parser ) print (soup.title) print(soup.title.string) print(soup.a.string) print(soup.b.string)
When we execute the above code, it produces the following result.
Python Overview Python Overview None Python is Interpreted
Extracting All Tags
We can extract tag value from all the instances of a tag using the following code.
import urlpb2 from bs4 import BeautifulSoup response = urlpb2.urlopen( http://tutorialspoint.com/python/python_overview.htm ) html_doc = response.read() soup = BeautifulSoup(html_doc, html.parser ) for x in soup.find_all( b ): print(x.string)
When we execute the above code, it produces the following result.
Python is Interpreted Python is Interactive Python is Object-Oriented Python is a Beginner s Language Easy-to-learn Easy-to-read Easy-to-maintain A broad standard pbrary Interactive Mode Portable Extendable Databases GUI Programming ScalableAdvertisements