We can get data from web pages with Beautiful Soup.
It lets us parse the DOM and extract the data we want.
In this article, we’ll look at how to scrape HTML documents with Beautiful Soup.
CData
We can get the CData from a document with Beautiful Soup.
For example, wen can write:
from bs4 import BeautifulSoup, CData
markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup, 'html.parser')
comment = soup.b.string
cdata = CData("A CDATA block")
comment.replace_with(cdata)
print(soup.b.prettify())
We replaced the comment inside the b
tag with the CData block, so the print
function will print:
<b>
<![CDATA[A CDATA block]]>
</b>
Going Down
We can get tags with other tags.
For example, we can write:
from bs4 import BeautifulSoup
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.head)
print(soup.title)
The first print
call gets the head
element’s content.
And the 2nd print
call gets the title
element’s content.
So we get:
<head><title>The Dormouse's story</title></head>
and:
<title>The Dormouse's story</title>
respectively.
We can also get the b
element by writing:
print(soup.body.b)
to get the first b
element in body
.
So we get:
<b>The Dormouse's story</b>
printed.
And:
print(soup.a)
to get the first a
element.
So we tet:
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
printed.
We can use the find_all
method to find all elements with the given selector.
For example, we can write:
print(soup.find_all('a'))
And we get:
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
printed.
.contents
and .children
We can get the contents of tags with the contents
property.
For exam[ple, we can write:
from bs4 import BeautifulSoup
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
head_tag = soup.head
print(head_tag.contents)
And we see:
[<title>The Dormouse's story</title>]
printed.
We can get the content of the title
tag by writing:
from bs4 import BeautifulSoup
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
head_tag = soup.head
title_tag = head_tag.contents[0]
print(title_tag.contents)
We get the head
element with soup.head
.
And we get the content of it with head_tag.contents[0]
.
And we get the title
tag’s content with title_tag.contents
.
So we see:
[u"The Dormouse's story"]
printed.
We can also loop through the title_tag
‘s content with a for
loop:
from bs4 import BeautifulSoup
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
head_tag = soup.head
title_tag = head_tag.contents[0]
for child in title_tag.children:
print(child)
Then we see ‘The Dormouse’s story’
logged.
.descendants
We can get the descendants of an elemnt with the descendants
property.
For example, we can write:
from bs4 import BeautifulSoup
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
head_tag = soup.head
for child in head_tag.descendants:
print(child)
Then we see:
<title>The Dormouse's story</title>
The Dormouse's story
logged.
We get the title
element and the content of it, so it goes through the tree.
Conclusion
Beautiful Soup can work with CData and child nodes.