Web Scraping with Beautiful Soup — Siblings and Selectors

Spread the love

We can get data from web pages with Beautiful Soup.

It lets us parse the DOM and extract the data we want.

In this article, we’ll look at how to scrape HTML documents with Beautiful Soup.

`.next_element` and `.previous_element`

We can get sibling elements with the .next_element and .previous_element properties.

For example, we can write:

from bs4 import BeautifulSoup

html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
last_a_tag = soup.find("a", id="link3")
print(last_a_tag.next_element)

We get the a element with the ID link3 .

Then we get the element next to it with the next_element property.

So we see:

Tillie

printed.

We can also get the previous element with the previous_element property:

from bs4 import BeautifulSoup

html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
last_a_tag = soup.find("a", id="link3")
print(last_a_tag.previous_element)

And we see:

and

printed.

`find_all()`

We can find all elements with the given selector with the find_all method.

For example, we can write:

from bs4 import BeautifulSoup

html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find_all("title"))

to get all the title elements, so we see:

[<title>The Dormouse's story</title>]

printed.

We can get more than one kind of element. For example, we can write:

from bs4 import BeautifulSoup

html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find_all("p", "title"))

Then we get:

[<p class="title"><b>The Dormouse's story</b></p>]

logged.

The Keyword Arguments

We can pass in other selectors.

For example, we can write:

from bs4 import BeautifulSoup

html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find_all(id='link2'))

and get the a element with ID link2 .

We can also pass in a regex object to select nodes:

from bs4 import BeautifulSoup
import re

html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find_all(href=re.compile("elsie")))

We get all the elements with href that has the substring 'elsie' .

So we get:

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

printed.

We can also search for nodes with the given attributes.

To do that, we write:

from bs4 import BeautifulSoup
import re

soup = BeautifulSoup('<div data-foo="value">foo!</div>', 'html.parser')
print(soup.find_all(attrs={"data-foo": "value"}))

We get the nodes with the data-foo attribute set to value .

So we see:

[<div data-foo="value">foo!</div>]

printed.

To search for node with a given name element value, we can write:

from bs4 import BeautifulSoup

name_soup = BeautifulSoup('<input name="email"/>', 'html.parser')
print(name_soup.find_all(attrs={"name": "email"}))

Then we get:

[<input name="email"/>]

logged.

Conclusion

We can get nodes at various locations and with various attributes with Beautiful Soup.

.next_element and .previous_element

find_all()

The Keyword Arguments

Conclusion

Related Posts

By John Au-Yeung

Leave a Reply Cancel reply

`.next_element` and `.previous_element`

`find_all()`