We can get data from web pages with Beautiful Soup.
It lets us parse the DOM and extract the data we want.
In this article, we’ll look at how to scrape HTML documents with Beautiful Soup.
Searching by CSS Class
We can get an element with the given CSS class with Beautiful Soup.
For example, we can write:
from bs4 import BeautifulSoup
import re
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find_all("a", class_="sister"))
We get all the a
tags with class sister
, so we see:
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
printed.
Also, we can search with a regex:
from bs4 import BeautifulSoup
import re
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find_all(class_=re.compile("itl")))
We get all the elements with class having the 'itl'
substring, so we get:
[<p class="title"><b>The Dormouse's story</b></p>]
printed.
Also, we can set a function:
from bs4 import BeautifulSoup
import re
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
def has_six_characters(css_class):
return css_class is not None and len(css_class) == 6
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find_all(class_=has_six_characters))
We set the class_
parameter to the has_six_characters
function, so we can get all the elements with classes with 6 characters.
So we see:
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
printed.
The class
attribute can have more than one value, and we can search for them all.
For example, we can write:
from bs4 import BeautifulSoup
css_soup = BeautifulSoup('<p class="body strikeout"></p>', 'html.parser')
print(css_soup.find_all("p", class_="body strikeout"))
to search for nodes with the class
set to body strikeout
.
They have to be in the same order.
The string
Argument
We can also search for string content instead of tags.
So we can write:
from bs4 import BeautifulSoup
import re
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find_all(string=["Tillie", "Elsie", "Lacie"]))
Then we see:
[u'Elsie', u'Lacie', u'Tillie']
logged.
Conclusion
We can search for elements with the CSS class and strings with Beautiful Soup.