Web Scraping with Beautiful Soup — Equality, Copies, and Parsing Part of a Document

Spread the love

We can get data from web pages with Beautiful Soup.

It lets us parse the DOM and extract the data we want.

In this article, we’ll look at how to scrape HTML documents with Beautiful Soup.

Comparing Objects for Equality

We can compare objects for equality.

For example, we can write:

from bs4 import BeautifulSoup
markup = "<p>I want <b>pizza</b> and more <b>pizza</b>!</p>"
soup = BeautifulSoup(markup, 'html.parser')
first_b, second_b = soup.find_all('b')
print(first_b == second_b)
print(first_b.previous_element == second_b.previous_element)

Then we the first print prints True since the first b element and the 2nd one has the same structure and content.

The 2nd print prints False because the previous element to each b element is different.

Copying Beautiful Soup Objects

We can copy Beautiful Soup objects.

We can use the copy library to do this:

from bs4 import BeautifulSoup
import copy

markup = "<p>I want <b>pizza</b> and more <b>pizza</b>!</p>"
soup = BeautifulSoup(markup, 'html.parser')
p_copy = copy.copy(soup.p)
print(p_copy)

The copy is considered to be equal to the original.

Parsing Only Part of a Document

For example, we can write:

from bs4 import BeautifulSoup, SoupStrainer

html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
only_a_tags = SoupStrainer("a")
only_tags_with_id_link2 = SoupStrainer(id="link2")

def is_short_string(string):
    return string is not None and len(string) < 10
only_short_strings = SoupStrainer(string=is_short_string)

print(only_a_tags)
print(only_tags_with_id_link2)
print(only_short_strings)

We can only select the elements we want with SoupStrainer .

The selection can be done with a selector, or we can pass in an id , or pass in a function to do the selection.

Then we see:

a|{}
None|{'id': u'link2'}
None|{'string': <function is_short_string at 0x00000000036FC908>}

printed.

Conclusion

We can parse part of a document, compare parsed objects for equality, and copy objects with Beautiful Soup.

Comparing Objects for Equality

Copying Beautiful Soup Objects

Parsing Only Part of a Document

Conclusion

Related Posts

By John Au-Yeung

Leave a Reply Cancel reply