We can get data from web pages with Beautiful Soup.
It lets us parse the DOM and extract the data we want.
In this article, we’ll look at how to scrape HTML documents with Beautiful Soup.
find_all_previous()
and find_previous()
We can get all the nodes that comes before a given node with the find_all_previous
method.
For example, if we have:
from bs4 import BeautifulSoup
import re
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
first_link = soup.a
print(first_link.find_all_previous('p'))
Then we see:
[<p class="story">Once upon a time there were three little sisters; and their names weren<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,n<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> andn<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;nand they lived at the bottom of a well.</p>, <p class="title"><b>The Dormouse's story</b></p>]
printed.
We get all the p
elements that comes before the first a
element.
The find_previous
method returns the first node only.
CSS Selectors
We can find elements by tags:
from bs4 import BeautifulSoup
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.select("title"))
print(soup.select("p:nth-of-type(3)"))
print(soup.select("body a"))
print(soup.select("html head title"))
print(soup.select("head > title"))
print(soup.select("p > a"))
print(soup.select(".sister"))
print(soup.select("#link1"))
Then we get the elements with the given CSS selectors with the soup.select
method.
Modifying the Tree
We can change the text content of an element by writing:
from bs4 import BeautifulSoup
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.a
tag.string = "New link text."
print(tag)
We get the a
element with soup.a
.
Then we set the string
property to set the text content.
And then we see print the tag
and see:
<a href="http://example.com/">New link text.</a>
append()
We can add to a tag’s content with the append
method.
For example, we can write:
from bs4 import BeautifulSoup
soup = BeautifulSoup("<a>Foo</a>", 'html.parser')
soup.a.append("Bar")
print(soup.a.contents)
Then we add 'Bar'
to the a
element as the child of a
.
So soup.a.contents
is:
[u'Foo', u'Bar']
extend()
The extend
method adds every elemnt of a list to a tag.
For instance., we can write:
from bs4 import BeautifulSoup
soup = BeautifulSoup("<a>Foo</a>", 'html.parser')
soup.a.extend([' ', 'bar', ' ', 'baz'])
print(soup.a)
And we get:
<a>Foo bar baz</a>
as the result.
NavigableString()
and .new_tag()
We can add navigable strings into an element.
For example, we can write:
from bs4 import BeautifulSoup, NavigableString
soup = BeautifulSoup("<b></b>", 'html.parser')
tag = soup.b
tag.append("Hello")
new_string = NavigableString(" there")
tag.append(new_string)
print(tag)
print(tag.contents)
And we get:
<b>Hello there</b>
for tag
and:
[u'Hello', u' there']
for tag.contents
.
Also, we can add a comment node with the Comment
class:
from bs4 import BeautifulSoup, Comment
soup = BeautifulSoup("<b></b>", 'html.parser')
tag = soup.b
new_comment = Comment("Nice to see you.")
tag.append(new_comment)
print(tag)
print(tag.contents)
Then tag
is:
<b><!--Nice to see you.--></b>
and tag.contents
is:
[u'Nice to see you.']
Conclusion
We can get elements and add nodes to other nodes with Beautiful Soup.