Categories
Beautiful Soup

DOM Manipulation with Beautiful Soup — Inserting Tags

Spread the love

We can get data from web pages with Beautiful Soup.

It lets us parse the DOM and extract the data we want.

In this article, we’ll look at how to manipulate HTML documents with Beautiful Soup.

NavigableString() and .new_tag()

We can add a text node with the NavigableString constructor.

For example, we can write:

from bs4 import BeautifulSoup, NavigableString
soup = BeautifulSoup("<b></b>", 'html.parser')
tag = soup.b
tag.append("Hello")
new_string = NavigableString(" there")
tag.append(new_string)
print(tag)
print(tag.contents)

Then call append with the NavigableString instance to append the items.

Therefore, we get:

<b>Hello there</b>

from the first print and:

[u'Hello', u' there']

from the 2nd print .

To create a new tag, we can use the new_tag method:

from bs4 import BeautifulSoup, NavigableString
soup = BeautifulSoup("<b></b>", 'html.parser')
new_tag = soup.new_tag("a", href="http://www.example.com")
new_tag.string = "Link text."
print(new_tag)
print(new_tag.contents)

We call new_tag on the BeautifulSoup instance to add a new tag.

We set the text content by setting the string method.

Then new_tag should be:

<a href="http://www.example.com">Link text.</a>

And new_tag.contents is:

[u'Link text.']

insert()

The insert method is like append , but we can insert our node wherever we like.

For example, we can write:

from bs4 import BeautifulSoup, NavigableString
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.a
tag.insert(1, "but did not endorse ")
print(tag)
print(tag.contents)

We call insert with the index we want to insert the node at and the node content.

Then we get:

<a href="http://example.com/">I linked to but did not endorse <i>example.com</i></a>

printed with the first print and:

[u'I linked to ', u'but did not endorse ', <i>example.com</i>]

printed with the 2nd print .

insert_before() and insert_after()

The insert_before method lets us insert a node immediately before something else in the parse tree:

from bs4 import BeautifulSoup, NavigableString
soup = BeautifulSoup("<b>leave</b>", 'html.parser')
tag = soup.new_tag("i")
tag.string = "Don't"
soup.b.string.insert_before(tag)
print(tag)
print(tag.contents)

We call insert_before with the tag element since we want to insert our new node before that.

Therefore, we get:

<i>Don't</i>

and:

[u"Don't"]

respectively from the 2 print calls.

Similarly, we can call insert_after with the node object that we want to insert after:

from bs4 import BeautifulSoup, NavigableString
soup = BeautifulSoup("<b>leave</b>", 'html.parser')
tag = soup.new_tag("i")
tag.string = "Don't"
soup.b.string.insert_before(tag)
div = soup.new_tag('div')
div.string = 'ever'
soup.b.i.insert_after(" you ", div)
print(soup.b)
print(soup.b.contents)

We call insert_before as we do before the i tag.

Then we add ' you ' after the div.

Then we get:

<b><i>Don't</i> you <div>ever</div>leave</b>

as the value of soup.b and:

[<i>Don't</i>, u' you ', <div>ever</div>, u'leave']

as the value of soup.b.contents .

clear()

The clear method removes the contents of a tag.

For example, we can use it by writing

from bs4 import BeautifulSoup, NavigableString
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.a
tag.clear()
tag
print(tag)

Then tag is:

<a href="http://example.com/"></a>

Conclusion

We can insert nodes with various methods that comes with Beautiful Soup.

Leave a Reply

Your email address will not be published. Required fields are marked *