Web Scraping with Beautiful Soup — CSS Class and Strings

Spread the love

We can get data from web pages with Beautiful Soup.

It lets us parse the DOM and extract the data we want.

In this article, we’ll look at how to scrape HTML documents with Beautiful Soup.

Searching by CSS Class

We can get an element with the given CSS class with Beautiful Soup.

For example, we can write:

from bs4 import BeautifulSoup
import re

html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find_all("a", class_="sister"))

We get all the a tags with class sister , so we see:

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

printed.

Also, we can search with a regex:

from bs4 import BeautifulSoup
import re

html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find_all(class_=re.compile("itl")))

We get all the elements with class having the 'itl' substring, so we get:

[<p class="title"><b>The Dormouse's story</b></p>]

printed.

Also, we can set a function:

from bs4 import BeautifulSoup
import re

html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

def has_six_characters(css_class):
    return css_class is not None and len(css_class) == 6

soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find_all(class_=has_six_characters))

We set the class_ parameter to the has_six_characters function, so we can get all the elements with classes with 6 characters.

So we see:

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

printed.

The class attribute can have more than one value, and we can search for them all.

For example, we can write:

from bs4 import BeautifulSoup

css_soup = BeautifulSoup('<p class="body strikeout"></p>', 'html.parser')
print(css_soup.find_all("p", class_="body strikeout"))

to search for nodes with the class set to body strikeout .

They have to be in the same order.

The `string` Argument

We can also search for string content instead of tags.

So we can write:

from bs4 import BeautifulSoup
import re
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find_all(string=["Tillie", "Elsie", "Lacie"]))

Then we see:

[u'Elsie', u'Lacie', u'Tillie']

logged.

Conclusion

We can search for elements with the CSS class and strings with Beautiful Soup.

Searching by CSS Class

The string Argument

Conclusion

Related Posts

By John Au-Yeung

Leave a Reply Cancel reply

The `string` Argument