We can get data from web pages with Beautiful Soup.
It lets us parse the DOM and extract the data we want.
In this article, we’ll look at how to scrape HTML documents with Beautiful Soup.
find_parents()
and find_parent()
We can find parent elements of a given element with the find_parents
method.
The find_parent
method returns the first parent element only.
For example, we can write:
from bs4 import BeautifulSoup
import re
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
a_string = soup.find(string="Lacie")
print(a_string.find_parents("a"))
And we get:
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
logged.
We get the element with the string "Lacie"
.
Then we get the parents of that with the find_parents
method.
If we replace find_parents
with find_parent
, then we get:
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
printed.
find_next_siblings()
and find_next_sibling()
We can call find_next_siblings
and find_next_sibling
to get the sibling elements of a given element.
For instance, we can write:
from bs4 import BeautifulSoup
import re
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
first_link = soup.a
print(first_link.find_next_siblings("a"))
And then we get the siblings of the first a
element.
And so we see:
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
logged.
If we call find_next_sibling
on first_link
, then we get:
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
find_previous_siblings()
and find_previous_sibling()
We can find previous siblings with the find_previous_siblings
and find_previous_sibling
.
For instance, we can write:
from bs4 import BeautifulSoup
import re
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
last_link = soup.find("a", id="link3")
print(last_link.find_previous_siblings("a"))
Then we call find_previous_siblings
to get all the previous links.
So we get:
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
displayed.
find_previous_sibling
returns the first result only.
find_all_next()
and find_next()
We can call the find_all_next
method to return the sibling nodes next to the given node.
For example, we can write:
from bs4 import BeautifulSoup
import re
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
first_link = soup.a
print(first_link.find_all_next(string=True))
Then we get:
[u'Elsie', u',n', u'Lacie', u' andn', u'Tillie', u';nand they lived at the bottom of a well.', u'n', u'...', u'n']
returned.
find_next
only returns the first sibling that comes after a node.
Conclusion
We can get siblings and parent nodes with Beautiful Soup.