We can get data from web pages with Beautiful Soup.
It lets us parse the DOM and extract the data we want.
In this article, we’ll look at how to manipulate HTML documents with Beautiful Soup.
NavigableString()
and .new_tag()
We can add a text node with the NavigableString
constructor.
For example, we can write:
from bs4 import BeautifulSoup, NavigableString
soup = BeautifulSoup("<b></b>", 'html.parser')
tag = soup.b
tag.append("Hello")
new_string = NavigableString(" there")
tag.append(new_string)
print(tag)
print(tag.contents)
Then call append
with the NavigableString
instance to append the items.
Therefore, we get:
<b>Hello there</b>
from the first print
and:
[u'Hello', u' there']
from the 2nd print
.
To create a new tag, we can use the new_tag
method:
from bs4 import BeautifulSoup, NavigableString
soup = BeautifulSoup("<b></b>", 'html.parser')
new_tag = soup.new_tag("a", href="http://www.example.com")
new_tag.string = "Link text."
print(new_tag)
print(new_tag.contents)
We call new_tag
on the BeautifulSoup
instance to add a new tag.
We set the text content by setting the string
method.
Then new_tag
should be:
<a href="http://www.example.com">Link text.</a>
And new_tag.contents
is:
[u'Link text.']
insert()
The insert
method is like append
, but we can insert our node wherever we like.
For example, we can write:
from bs4 import BeautifulSoup, NavigableString
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.a
tag.insert(1, "but did not endorse ")
print(tag)
print(tag.contents)
We call insert
with the index we want to insert the node at and the node content.
Then we get:
<a href="http://example.com/">I linked to but did not endorse <i>example.com</i></a>
printed with the first print
and:
[u'I linked to ', u'but did not endorse ', <i>example.com</i>]
printed with the 2nd print
.
insert_before()
and insert_after()
The insert_before
method lets us insert a node immediately before something else in the parse tree:
from bs4 import BeautifulSoup, NavigableString
soup = BeautifulSoup("<b>leave</b>", 'html.parser')
tag = soup.new_tag("i")
tag.string = "Don't"
soup.b.string.insert_before(tag)
print(tag)
print(tag.contents)
We call insert_before
with the tag
element since we want to insert our new node before that.
Therefore, we get:
<i>Don't</i>
and:
[u"Don't"]
respectively from the 2 print
calls.
Similarly, we can call insert_after
with the node object that we want to insert after:
from bs4 import BeautifulSoup, NavigableString
soup = BeautifulSoup("<b>leave</b>", 'html.parser')
tag = soup.new_tag("i")
tag.string = "Don't"
soup.b.string.insert_before(tag)
div = soup.new_tag('div')
div.string = 'ever'
soup.b.i.insert_after(" you ", div)
print(soup.b)
print(soup.b.contents)
We call insert_before
as we do before the i
tag.
Then we add ' you '
after the div.
Then we get:
<b><i>Don't</i> you <div>ever</div>leave</b>
as the value of soup.b
and:
[<i>Don't</i>, u' you ', <div>ever</div>, u'leave']
as the value of soup.b.contents
.
clear()
The clear
method removes the contents of a tag.
For example, we can use it by writing
from bs4 import BeautifulSoup, NavigableString
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.a
tag.clear()
tag
print(tag)
Then tag
is:
<a href="http://example.com/"></a>
Conclusion
We can insert nodes with various methods that comes with Beautiful Soup.