Categories
Beautiful Soup

DOM Manipulation with Beautiful Soup — Removing Nodes, Wrap and Unwrap Elements, and Printing

Spread the love

We can get data from web pages with Beautiful Soup.

It lets us parse the DOM and extract the data we want.

In this article, we’ll look at how to manipulate HTML documents with Beautiful Soup.

extract()

The extract method removes a node from the tree.

For examp[le, we can write:

from bs4 import BeautifulSoup, NavigableString
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')
a_tag = soup.a
i_tag = soup.i.extract()
print(i_tag)
print(a_tag)

Then we get:

<i>example.com</i>

as the value of i_tag and:

<a href="http://example.com/">I linked to </a>

as the value of a_tag .

decompose()

The decompose method removes a tag from the tree and completely destroy it and its contents.

So if we write:

from bs4 import BeautifulSoup, NavigableString
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')
a_tag = soup.a
i_tag = soup.i.decompose()
print(i_tag)
print(a_tag)

Then i_tag is None and a_tag is:

<a href="http://example.com/">I linked to </a>

replace_with()

The replace_with method removes a node from the tree and replaces it with the tag or string of our choice.

For instance, we can use it by writing:

from bs4 import BeautifulSoup, NavigableString
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')
a_tag = soup.a
new_tag = soup.new_tag("b")
new_tag.string = "example.net"
a_tag.i.replace_with(new_tag)
print(a_tag)

Then a_tag is now:

<a href="http://example.com/">I linked to <b>example.net</b></a>

wrap()

The wrap method wraps an element with the tag we specified.

It returns the new wrapper.

For example, we can write:

from bs4 import BeautifulSoup
soup = BeautifulSoup("<p>I wish I was bold.</p>", 'html.parser')
soup.p.string.wrap(soup.new_tag("b"))
soup.p.wrap(soup.new_tag("div"))
print(soup)

The soup is now:

<div><p><b>I wish I was bold.</b></p></div>

after we called wrap to wrap a div around our p element.

unwrap()

The unwrap method removes the wrapper element from the content.

For instance, we can write:

from bs4 import BeautifulSoup
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')
a_tag = soup.a
a_tag.i.unwrap()
print(a_tag)

Then we get:

<a href="http://example.com/">I linked to example.com</a>

as the new value of a_tag .

Output

We can pretty print our HTML with the prettify method.

For example, we can write:

from bs4 import BeautifulSoup
markup = '<html><head><body><a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')
print(soup.prettify())

Then we see:

<html>
 <head>
  <body>
   <a href="http://example.com/">
    I linked to
    <i>
     example.com
    </i>
   </a>
  </body>
 </head>
</html>

displayed.

We can also prettify child nodes.

For example, we can write:

from bs4 import BeautifulSoup
markup = '<html><head><body><a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')
print(soup.a.prettify())

Then:

<a href="http://example.com/">
 I linked to
 <i>
  example.com
 </i>
</a>

is printed.

Non-pretty Printing

We can just print the node objects without prettifying.

For example, we can write:

from bs4 import BeautifulSoup
markup = '<html><head><body><a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')
print(soup)
print(soup.a)

Then we get:

<html><head><body><a href="http://example.com/">I linked to <i>example.com</i></a></body></head></html>

from the first print call and:

<a href="http://example.com/">I linked to <i>example.com</i></a>

from the 2nd print call.

Conclusion

We can remove nodes, wrap and unwrap nodes, and print them with Beautiful Soup.

Leave a Reply

Your email address will not be published. Required fields are marked *