Categories
Beautiful Soup

DOM Manipulation with Beautiful Soup — Removing Nodes, Wrap and Unwrap Elements, and Printing

We can get data from web pages with Beautiful Soup.

It lets us parse the DOM and extract the data we want.

In this article, we’ll look at how to manipulate HTML documents with Beautiful Soup.

extract()

The extract method removes a node from the tree.

For examp[le, we can write:

from bs4 import BeautifulSoup, NavigableString
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')
a_tag = soup.a
i_tag = soup.i.extract()
print(i_tag)
print(a_tag)

Then we get:

<i>example.com</i>

as the value of i_tag and:

<a href="http://example.com/">I linked to </a>

as the value of a_tag .

decompose()

The decompose method removes a tag from the tree and completely destroy it and its contents.

So if we write:

from bs4 import BeautifulSoup, NavigableString
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')
a_tag = soup.a
i_tag = soup.i.decompose()
print(i_tag)
print(a_tag)

Then i_tag is None and a_tag is:

<a href="http://example.com/">I linked to </a>

replace_with()

The replace_with method removes a node from the tree and replaces it with the tag or string of our choice.

For instance, we can use it by writing:

from bs4 import BeautifulSoup, NavigableString
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')
a_tag = soup.a
new_tag = soup.new_tag("b")
new_tag.string = "example.net"
a_tag.i.replace_with(new_tag)
print(a_tag)

Then a_tag is now:

<a href="http://example.com/">I linked to <b>example.net</b></a>

wrap()

The wrap method wraps an element with the tag we specified.

It returns the new wrapper.

For example, we can write:

from bs4 import BeautifulSoup
soup = BeautifulSoup("<p>I wish I was bold.</p>", 'html.parser')
soup.p.string.wrap(soup.new_tag("b"))
soup.p.wrap(soup.new_tag("div"))
print(soup)

The soup is now:

<div><p><b>I wish I was bold.</b></p></div>

after we called wrap to wrap a div around our p element.

unwrap()

The unwrap method removes the wrapper element from the content.

For instance, we can write:

from bs4 import BeautifulSoup
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')
a_tag = soup.a
a_tag.i.unwrap()
print(a_tag)

Then we get:

<a href="http://example.com/">I linked to example.com</a>

as the new value of a_tag .

Output

We can pretty print our HTML with the prettify method.

For example, we can write:

from bs4 import BeautifulSoup
markup = '<html><head><body><a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')
print(soup.prettify())

Then we see:

<html>
 <head>
  <body>
   <a href="http://example.com/">
    I linked to
    <i>
     example.com
    </i>
   </a>
  </body>
 </head>
</html>

displayed.

We can also prettify child nodes.

For example, we can write:

from bs4 import BeautifulSoup
markup = '<html><head><body><a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')
print(soup.a.prettify())

Then:

<a href="http://example.com/">
 I linked to
 <i>
  example.com
 </i>
</a>

is printed.

Non-pretty Printing

We can just print the node objects without prettifying.

For example, we can write:

from bs4 import BeautifulSoup
markup = '<html><head><body><a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')
print(soup)
print(soup.a)

Then we get:

<html><head><body><a href="http://example.com/">I linked to <i>example.com</i></a></body></head></html>

from the first print call and:

<a href="http://example.com/">I linked to <i>example.com</i></a>

from the 2nd print call.

Conclusion

We can remove nodes, wrap and unwrap nodes, and print them with Beautiful Soup.

Categories
Beautiful Soup

DOM Manipulation with Beautiful Soup — Inserting Tags

We can get data from web pages with Beautiful Soup.

It lets us parse the DOM and extract the data we want.

In this article, we’ll look at how to manipulate HTML documents with Beautiful Soup.

NavigableString() and .new_tag()

We can add a text node with the NavigableString constructor.

For example, we can write:

from bs4 import BeautifulSoup, NavigableString
soup = BeautifulSoup("<b></b>", 'html.parser')
tag = soup.b
tag.append("Hello")
new_string = NavigableString(" there")
tag.append(new_string)
print(tag)
print(tag.contents)

Then call append with the NavigableString instance to append the items.

Therefore, we get:

<b>Hello there</b>

from the first print and:

[u'Hello', u' there']

from the 2nd print .

To create a new tag, we can use the new_tag method:

from bs4 import BeautifulSoup, NavigableString
soup = BeautifulSoup("<b></b>", 'html.parser')
new_tag = soup.new_tag("a", href="http://www.example.com")
new_tag.string = "Link text."
print(new_tag)
print(new_tag.contents)

We call new_tag on the BeautifulSoup instance to add a new tag.

We set the text content by setting the string method.

Then new_tag should be:

<a href="http://www.example.com">Link text.</a>

And new_tag.contents is:

[u'Link text.']

insert()

The insert method is like append , but we can insert our node wherever we like.

For example, we can write:

from bs4 import BeautifulSoup, NavigableString
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.a
tag.insert(1, "but did not endorse ")
print(tag)
print(tag.contents)

We call insert with the index we want to insert the node at and the node content.

Then we get:

<a href="http://example.com/">I linked to but did not endorse <i>example.com</i></a>

printed with the first print and:

[u'I linked to ', u'but did not endorse ', <i>example.com</i>]

printed with the 2nd print .

insert_before() and insert_after()

The insert_before method lets us insert a node immediately before something else in the parse tree:

from bs4 import BeautifulSoup, NavigableString
soup = BeautifulSoup("<b>leave</b>", 'html.parser')
tag = soup.new_tag("i")
tag.string = "Don't"
soup.b.string.insert_before(tag)
print(tag)
print(tag.contents)

We call insert_before with the tag element since we want to insert our new node before that.

Therefore, we get:

<i>Don't</i>

and:

[u"Don't"]

respectively from the 2 print calls.

Similarly, we can call insert_after with the node object that we want to insert after:

from bs4 import BeautifulSoup, NavigableString
soup = BeautifulSoup("<b>leave</b>", 'html.parser')
tag = soup.new_tag("i")
tag.string = "Don't"
soup.b.string.insert_before(tag)
div = soup.new_tag('div')
div.string = 'ever'
soup.b.i.insert_after(" you ", div)
print(soup.b)
print(soup.b.contents)

We call insert_before as we do before the i tag.

Then we add ' you ' after the div.

Then we get:

<b><i>Don't</i> you <div>ever</div>leave</b>

as the value of soup.b and:

[<i>Don't</i>, u' you ', <div>ever</div>, u'leave']

as the value of soup.b.contents .

clear()

The clear method removes the contents of a tag.

For example, we can use it by writing

from bs4 import BeautifulSoup, NavigableString
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.a
tag.clear()
tag
print(tag)

Then tag is:

<a href="http://example.com/"></a>

Conclusion

We can insert nodes with various methods that comes with Beautiful Soup.

Categories
Beautiful Soup

Web Scraping with Beautiful Soup — Siblings, CSS Selectors, and Node Manipulation

We can get data from web pages with Beautiful Soup.

It lets us parse the DOM and extract the data we want.

In this article, we’ll look at how to scrape HTML documents with Beautiful Soup.

find_all_previous() and find_previous()

We can get all the nodes that comes before a given node with the find_all_previous method.

For example, if we have:

from bs4 import BeautifulSoup
import re
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')
first_link = soup.a
print(first_link.find_all_previous('p'))

Then we see:

[<p class="story">Once upon a time there were three little sisters; and their names weren<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,n<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> andn<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;nand they lived at the bottom of a well.</p>, <p class="title"><b>The Dormouse's story</b></p>]

printed.

We get all the p elements that comes before the first a element.

The find_previous method returns the first node only.

CSS Selectors

We can find elements by tags:

from bs4 import BeautifulSoup

html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.select("title"))
print(soup.select("p:nth-of-type(3)"))
print(soup.select("body a"))
print(soup.select("html head title"))
print(soup.select("head > title"))
print(soup.select("p > a"))
print(soup.select(".sister"))
print(soup.select("#link1"))

Then we get the elements with the given CSS selectors with the soup.select method.

Modifying the Tree

We can change the text content of an element by writing:

from bs4 import BeautifulSoup
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')

tag = soup.a
tag.string = "New link text."
print(tag)

We get the a element with soup.a .

Then we set the string property to set the text content.

And then we see print the tag and see:

<a href="http://example.com/">New link text.</a>

append()

We can add to a tag’s content with the append method.

For example, we can write:

from bs4 import BeautifulSoup
soup = BeautifulSoup("<a>Foo</a>", 'html.parser')
soup.a.append("Bar")

print(soup.a.contents)

Then we add 'Bar' to the a element as the child of a .

So soup.a.contents is:

[u'Foo', u'Bar']

extend()

The extend method adds every elemnt of a list to a tag.

For instance., we can write:

from bs4 import BeautifulSoup
soup = BeautifulSoup("<a>Foo</a>", 'html.parser')
soup.a.extend([' ', 'bar', ' ', 'baz'])
print(soup.a)

And we get:

<a>Foo bar baz</a>

as the result.

NavigableString() and .new_tag()

We can add navigable strings into an element.

For example, we can write:

from bs4 import BeautifulSoup, NavigableString
soup = BeautifulSoup("<b></b>", 'html.parser')
tag = soup.b
tag.append("Hello")
new_string = NavigableString(" there")
tag.append(new_string)
print(tag)
print(tag.contents)

And we get:

<b>Hello there</b>

for tag and:

[u'Hello', u' there']

for tag.contents .

Also, we can add a comment node with the Comment class:

from bs4 import BeautifulSoup, Comment
soup = BeautifulSoup("<b></b>", 'html.parser')
tag = soup.b
new_comment = Comment("Nice to see you.")
tag.append(new_comment)
print(tag)
print(tag.contents)

Then tag is:

<b><!--Nice to see you.--></b>

and tag.contents is:

[u'Nice to see you.']

Conclusion

We can get elements and add nodes to other nodes with Beautiful Soup.

Categories
Beautiful Soup

Web Scraping with Beautiful Soup — Siblings and Parent Nodes

We can get data from web pages with Beautiful Soup.

It lets us parse the DOM and extract the data we want.

In this article, we’ll look at how to scrape HTML documents with Beautiful Soup.

find_parents() and find_parent()

We can find parent elements of a given element with the find_parents method.

The find_parent method returns the first parent element only.

For example, we can write:

from bs4 import BeautifulSoup
import re
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')
a_string = soup.find(string="Lacie")
print(a_string.find_parents("a"))

And we get:

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

logged.

We get the element with the string "Lacie" .

Then we get the parents of that with the find_parents method.

If we replace find_parents with find_parent , then we get:

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

printed.

find_next_siblings() and find_next_sibling()

We can call find_next_siblings and find_next_sibling to get the sibling elements of a given element.

For instance, we can write:

from bs4 import BeautifulSoup
import re
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')
first_link = soup.a
print(first_link.find_next_siblings("a"))

And then we get the siblings of the first a element.

And so we see:

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

logged.

If we call find_next_sibling on first_link , then we get:

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

find_previous_siblings() and find_previous_sibling()

We can find previous siblings with the find_previous_siblings and find_previous_sibling .

For instance, we can write:

from bs4 import BeautifulSoup
import re
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')
last_link = soup.find("a", id="link3")
print(last_link.find_previous_siblings("a"))

Then we call find_previous_siblings to get all the previous links.

So we get:

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

displayed.

find_previous_sibling returns the first result only.

find_all_next() and find_next()

We can call the find_all_next method to return the sibling nodes next to the given node.

For example, we can write:

from bs4 import BeautifulSoup
import re
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')
first_link = soup.a
print(first_link.find_all_next(string=True))

Then we get:

[u'Elsie', u',n', u'Lacie', u' andn', u'Tillie', u';nand they lived at the bottom of a well.', u'n', u'...', u'n']

returned.

find_next only returns the first sibling that comes after a node.

Conclusion

We can get siblings and parent nodes with Beautiful Soup.

Categories
Beautiful Soup

Web Scraping with Beautiful Soup — Searching Nodes

We can get data from web pages with Beautiful Soup.

It lets us parse the DOM and extract the data we want.

In this article, we’ll look at how to scrape HTML documents with Beautiful Soup.

Searching Strings with Regex

We can search strings with regex.

For example, we can write:

from bs4 import BeautifulSoup
import re
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find_all(string=re.compile("Dormouse")))

We call re.compile to create our regex.

Also, we can search for strings with a function:

from bs4 import BeautifulSoup
import re
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

def is_the_only_string_within_a_tag(s):
    return (s == s.parent.string)

soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find_all(string=is_the_only_string_within_a_tag))

We get the string from the node with s.parent.string .

s is the string node we’re searching for.

The limit Argument

We can limit the number of items returned with find_all with the limit argument.

For example, we can write:

from bs4 import BeautifulSoup
import re
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find_all("a", limit=2))

And we see:

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

logged.

The recursive Argument

We can set whether to search for elements recursively with the recursive argument.

For example, if we want to disable recursive search, we write:

from bs4 import BeautifulSoup
import re
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.html.find_all("title", recursive=False))

then we get an empty array since we turn off recursive search.

This is because title has descendants but we turned off recursive search so we won’t get them.

find()

We can find the first element with the given selector with find :

from bs4 import BeautifulSoup
import re
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find('title'))

Then we get:

<title>The Dormouse's story</title>

printed.

We can chain find calls:

from bs4 import BeautifulSoup
import re
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find("head").find("title"))

Conclusion

We can search for various elements with Beautiful Soup.