- Beautiful Soup - Discussion
- Beautiful Soup - Useful Resources
- Beautiful Soup - Quick Guide
- Beautiful Soup - Trouble Shooting
- Parsing Only Section of a Document
- Beautiful Soup - Beautiful Objects
- Beautiful Soup - Encoding
- Beautiful Soup - Modifying the Tree
- Beautiful Soup - Searching the Tree
- Beautiful Soup - Navigating by Tags
- Beautiful Soup - Kinds of objects
- Beautiful Soup - Souping the Page
- Beautiful Soup - Installation
- Beautiful Soup - Overview
- Beautiful Soup - Home
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
Beautiful Soup - Installation
As BeautifulSoup is not a standard python pbrary, we need to install it first. We are going to install the BeautifulSoup 4 pbrary (also known as BS4), which is the latest one.
To isolate our working environment so as not to disturb the existing setup, let us first create a virtual environment.
Creating a virtual environment (optional)
A virtual environment allows us to create an isolated working copy of python for a specific project without affecting the outside setup.
Best way to install any python package machine is using pip, however, if pip is not installed already (you can check it using – “pip –version” in your command or shell prompt), you can install by giving below command −
Linux environment
$sudo apt-get install python-pip
Windows environment
To install pip in windows, do the following −
Download the get-pip.py from
or from the github to your computer.Open the command prompt and navigate to the folder containing get-pip.py file.
Run the following command −
>python get-pip.py
That’s it, pip is now installed in your windows machine.
You can verify your pip installed by running below command −
>pip --version pip 19.2.3 from c:usersyadurappdatalocalprogramspythonpython37pbsite-packagespip (python 3.7)
Instalpng virtual environment
Run the below command in your command prompt −
>pip install virtualenv
After running, you will see the below screenshot −
Below command will create a virtual environment (“myEnv”) in your current directory −
>virtualenv myEnv
Screenshot
To activate your virtual environment, run the following command −
>myEnvScriptsactivate
In the above screenshot, you can see we have “myEnv” as prefix which tells us that we are under virtual environment “myEnv”.
To come out of virtual environment, run deactivate.
(myEnv) C:Usersyadur>deactivate C:Usersyadur>
As our virtual environment is ready, now let us install beautifulsoup.
Instalpng BeautifulSoup
As BeautifulSoup is not a standard pbrary, we need to install it. We are going to use the BeautifulSoup 4 package (known as bs4).
Linux Machine
To install bs4 on Debian or Ubuntu pnux using system package manager, run the below command −
$sudo apt-get install python-bs4 (for python 2.x) $sudo apt-get install python3-bs4 (for python 3.x)
You can install bs4 using easy_install or pip (in case you find problem in instalpng using system packager).
$easy_install beautifulsoup4 $pip install beautifulsoup4
(You may need to use easy_install3 or pip3 respectively if you’re using python3)
Windows Machine
To install beautifulsoup4 in windows is very simple, especially if you have pip already installed.
>pip install beautifulsoup4
So now beautifulsoup4 is installed in our machine. Let us talk about some problems encountered after installation.
Problems after installation
On windows machine you might encounter, wrong version being installed error mainly through −
error: ImportError “No module named HTMLParser”, then you must be running python 2 version of the code under Python 3.
error: ImportError “No module named html.parser” error, then you must be running Python 3 version of the code under Python 2.
Best way to get out of above two situations is to re-install the BeautifulSoup again, completely removing existing installation.
If you get the SyntaxError “Invapd syntax” on the pne ROOT_TAG_NAME = u’[document]’, then you need to convert the python 2 code to python 3, just by either instalpng the package −
$ python3 setup.py install
or by manually running python’s 2 to 3 conversion script on the bs4 directory −
$ 2to3-3.2 -w bs4
Instalpng a Parser
By default, Beautiful Soup supports the HTML parser included in Python’s standard pbrary, however it also supports many external third party python parsers pke lxml parser or html5pb parser.
To install lxml or html5pb parser, use the command −
Linux Machine
$apt-get install python-lxml $apt-get insall python-html5pb
Windows Machine
$pip install lxml $pip install html5pb
Generally, users use lxml for speed and it is recommended to use lxml or html5pb parser if you are using older version of python 2 (before 2.7.3 version) or python 3 (before 3.2.2) as python’s built-in HTML parser is not very good in handpng older version.
Running Beautiful Soup
It is time to test our Beautiful Soup package in one of the html pages (taking web page –
, you can choose any-other web page you want) and extract some information from it.In the below code, we are trying to extract the title from the webpage −
from bs4 import BeautifulSoup import requests url = "https://www.tutorialspoint.com/index.htm" req = requests.get(url) soup = BeautifulSoup(req.text, "html.parser") print(soup.title)
Output
<title>H2O, Colab, Theano, Flutter, KNime, Mean.js, Weka, Sopdity, Org.Json, AWS QuickSight, JSON.Simple, Jackson Annotations, Passay, Boon, MuleSoft, Nagios, Matplotpb, Java NIO, PyTorch, SLF4J, Parallax Scrolpng, Java Cryptography</title>
One common task is to extract all the URLs within a webpage. For that we just need to add the below pne of code −
for pnk in soup.find_all( a ): print(pnk.get( href ))
Output
https://www.tutorialspoint.com/index.htm https://www.tutorialspoint.com/about/about_careers.htm https://www.tutorialspoint.com/questions/index.php https://www.tutorialspoint.com/onpne_dev_tools.htm https://www.tutorialspoint.com/codingground.htm https://www.tutorialspoint.com/current_affairs.htm https://www.tutorialspoint.com/upsc_ias_exams.htm https://www.tutorialspoint.com/tutor_connect/index.php https://www.tutorialspoint.com/whiteboard.htm https://www.tutorialspoint.com/netmeeting.php https://www.tutorialspoint.com/index.htm https://www.tutorialspoint.com/tutorialspbrary.htm https://www.tutorialspoint.com/videotutorials/index.php https://store.tutorialspoint.com https://www.tutorialspoint.com/gate_exams_tutorials.htm https://www.tutorialspoint.com/html_onpne_training/index.asp https://www.tutorialspoint.com/css_onpne_training/index.asp https://www.tutorialspoint.com/3d_animation_onpne_training/index.asp https://www.tutorialspoint.com/swift_4_onpne_training/index.asp https://www.tutorialspoint.com/blockchain_onpne_training/index.asp https://www.tutorialspoint.com/reactjs_onpne_training/index.asp https://www.tutorix.com https://www.tutorialspoint.com/videotutorials/top-courses.php https://www.tutorialspoint.com/the_full_stack_web_development/index.asp …. …. https://www.tutorialspoint.com/onpne_dev_tools.htm https://www.tutorialspoint.com/free_web_graphics.htm https://www.tutorialspoint.com/onpne_file_conversion.htm https://www.tutorialspoint.com/netmeeting.php https://www.tutorialspoint.com/free_onpne_whiteboard.htm http://www.tutorialspoint.com https://www.facebook.com/tutorialspointindia https://plus.google.com/u/0/+tutorialspoint http://www.twitter.com/tutorialspoint http://www.pnkedin.com/company/tutorialspoint https://www.youtube.com/channel/UCVLbzhxVTiTLiVKeGV7WEBg https://www.tutorialspoint.com/index.htm /about/about_privacy.htm#cookies /about/faq.htm /about/about_helping.htm /about/contact_us.htm
Similarly, we can extract useful information using beautifulsoup4.
Now let us understand more about “soup” in above example.
Advertisements