Web Scraping with Beautiful Soup — Siblings and Parent Nodes

Spread the love

We can get data from web pages with Beautiful Soup.

It lets us parse the DOM and extract the data we want.

In this article, we’ll look at how to scrape HTML documents with Beautiful Soup.

`find_parents()` and `find_parent()`

We can find parent elements of a given element with the find_parents method.

The find_parent method returns the first parent element only.

For example, we can write:

from bs4 import BeautifulSoup
import re
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')
a_string = soup.find(string="Lacie")
print(a_string.find_parents("a"))

And we get:

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

logged.

We get the element with the string "Lacie" .

Then we get the parents of that with the find_parents method.

If we replace find_parents with find_parent , then we get:

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

printed.

`find_next_siblings()` and `find_next_sibling()`

We can call find_next_siblings and find_next_sibling to get the sibling elements of a given element.

For instance, we can write:

from bs4 import BeautifulSoup
import re
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')
first_link = soup.a
print(first_link.find_next_siblings("a"))

And then we get the siblings of the first a element.

And so we see:

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

logged.

If we call find_next_sibling on first_link , then we get:

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

`find_previous_siblings()` and `find_previous_sibling()`

We can find previous siblings with the find_previous_siblings and find_previous_sibling .

For instance, we can write:

from bs4 import BeautifulSoup
import re
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')
last_link = soup.find("a", id="link3")
print(last_link.find_previous_siblings("a"))

Then we call find_previous_siblings to get all the previous links.

So we get:

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

displayed.

find_previous_sibling returns the first result only.

`find_all_next()` and `find_next()`

We can call the find_all_next method to return the sibling nodes next to the given node.

For example, we can write:

from bs4 import BeautifulSoup
import re
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')
first_link = soup.a
print(first_link.find_all_next(string=True))

Then we get:

[u'Elsie', u',n', u'Lacie', u' andn', u'Tillie', u';nand they lived at the bottom of a well.', u'n', u'...', u'n']

returned.

find_next only returns the first sibling that comes after a node.

Conclusion

We can get siblings and parent nodes with Beautiful Soup.

find_parents() and find_parent()

find_next_siblings() and find_next_sibling()

find_previous_siblings() and find_previous_sibling()

find_all_next() and find_next()

Conclusion

Related Posts

By John Au-Yeung

Leave a Reply Cancel reply

`find_parents()` and `find_parent()`

`find_next_siblings()` and `find_next_sibling()`

`find_previous_siblings()` and `find_previous_sibling()`

`find_all_next()` and `find_next()`