We can get data from web pages with Beautiful Soup.
It lets us parse the DOM and extract the data we want.
In this article, we’ll look at how to manipulate HTML documents with Beautiful Soup.
extract()
The extract
method removes a node from the tree.
For examp[le, we can write:
from bs4 import BeautifulSoup, NavigableString
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')
a_tag = soup.a
i_tag = soup.i.extract()
print(i_tag)
print(a_tag)
Then we get:
<i>example.com</i>
as the value of i_tag
and:
<a href="http://example.com/">I linked to </a>
as the value of a_tag
.
decompose()
The decompose
method removes a tag from the tree and completely destroy it and its contents.
So if we write:
from bs4 import BeautifulSoup, NavigableString
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')
a_tag = soup.a
i_tag = soup.i.decompose()
print(i_tag)
print(a_tag)
Then i_tag
is None
and a_tag
is:
<a href="http://example.com/">I linked to </a>
replace_with()
The replace_with
method removes a node from the tree and replaces it with the tag or string of our choice.
For instance, we can use it by writing:
from bs4 import BeautifulSoup, NavigableString
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')
a_tag = soup.a
new_tag = soup.new_tag("b")
new_tag.string = "example.net"
a_tag.i.replace_with(new_tag)
print(a_tag)
Then a_tag
is now:
<a href="http://example.com/">I linked to <b>example.net</b></a>
wrap()
The wrap
method wraps an element with the tag we specified.
It returns the new wrapper.
For example, we can write:
from bs4 import BeautifulSoup
soup = BeautifulSoup("<p>I wish I was bold.</p>", 'html.parser')
soup.p.string.wrap(soup.new_tag("b"))
soup.p.wrap(soup.new_tag("div"))
print(soup)
The soup
is now:
<div><p><b>I wish I was bold.</b></p></div>
after we called wrap
to wrap a div around our p
element.
unwrap()
The unwrap
method removes the wrapper element from the content.
For instance, we can write:
from bs4 import BeautifulSoup
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')
a_tag = soup.a
a_tag.i.unwrap()
print(a_tag)
Then we get:
<a href="http://example.com/">I linked to example.com</a>
as the new value of a_tag
.
Output
We can pretty print our HTML with the prettify
method.
For example, we can write:
from bs4 import BeautifulSoup
markup = '<html><head><body><a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')
print(soup.prettify())
Then we see:
<html>
<head>
<body>
<a href="http://example.com/">
I linked to
<i>
example.com
</i>
</a>
</body>
</head>
</html>
displayed.
We can also prettify child nodes.
For example, we can write:
from bs4 import BeautifulSoup
markup = '<html><head><body><a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')
print(soup.a.prettify())
Then:
<a href="http://example.com/">
I linked to
<i>
example.com
</i>
</a>
is printed.
Non-pretty Printing
We can just print the node objects without prettifying.
For example, we can write:
from bs4 import BeautifulSoup
markup = '<html><head><body><a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')
print(soup)
print(soup.a)
Then we get:
<html><head><body><a href="http://example.com/">I linked to <i>example.com</i></a></body></head></html>
from the first print
call and:
<a href="http://example.com/">I linked to <i>example.com</i></a>
from the 2nd print call.
Conclusion
We can remove nodes, wrap and unwrap nodes, and print them with Beautiful Soup.