Categories
Beautiful Soup

Web Scraping with Beautiful Soup — Equality, Copies, and Parsing Part of a Document

We can get data from web pages with Beautiful Soup.

It lets us parse the DOM and extract the data we want.

In this article, we’ll look at how to scrape HTML documents with Beautiful Soup.

Comparing Objects for Equality

We can compare objects for equality.

For example, we can write:

from bs4 import BeautifulSoup
markup = "<p>I want <b>pizza</b> and more <b>pizza</b>!</p>"
soup = BeautifulSoup(markup, 'html.parser')
first_b, second_b = soup.find_all('b')
print(first_b == second_b)
print(first_b.previous_element == second_b.previous_element)

Then we the first print prints True since the first b element and the 2nd one has the same structure and content.

The 2nd print prints False because the previous element to each b element is different.

Copying Beautiful Soup Objects

We can copy Beautiful Soup objects.

We can use the copy library to do this:

from bs4 import BeautifulSoup
import copy

markup = "<p>I want <b>pizza</b> and more <b>pizza</b>!</p>"
soup = BeautifulSoup(markup, 'html.parser')
p_copy = copy.copy(soup.p)
print(p_copy)

The copy is considered to be equal to the original.

Parsing Only Part of a Document

For example, we can write:

from bs4 import BeautifulSoup, SoupStrainer

html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
only_a_tags = SoupStrainer("a")
only_tags_with_id_link2 = SoupStrainer(id="link2")

def is_short_string(string):
    return string is not None and len(string) < 10
only_short_strings = SoupStrainer(string=is_short_string)

print(only_a_tags)
print(only_tags_with_id_link2)
print(only_short_strings)

We can only select the elements we want with SoupStrainer .

The selection can be done with a selector, or we can pass in an id , or pass in a function to do the selection.

Then we see:

a|{}
None|{'id': u'link2'}
None|{'string': <function is_short_string at 0x00000000036FC908>}

printed.

Conclusion

We can parse part of a document, compare parsed objects for equality, and copy objects with Beautiful Soup.

Categories
Beautiful Soup

Web Scraping with Beautiful Soup — Encoding

We can get data from web pages with Beautiful Soup.

It lets us parse the DOM and extract the data we want.

In this article, we’ll look at how to scrape HTML documents with Beautiful Soup.

Output Formatters

We can format our output with Beautiful Soup.

For example, we can write:

from bs4 import BeautifulSoup
french = "<p>Il a dit &lt;&lt;Sacr&eacute; bleu!&gt;&gt;</p>"
soup = BeautifulSoup(french, 'html.parser')
print(soup.prettify(formatter="html"))

to set the formatter to the one we want when we call prettify .

Also we can use the html5 formatter,

For example, we can write:

from bs4 import BeautifulSoup
br = BeautifulSoup("<br>", 'html.parser').br
print(br.prettify(formatter="html"))
print(br.prettify(formatter="html5"))

Then from the first print , we see:

<br/>

And from the 2nd print , we see:

<br>

Also, we can set the formatter to None :

from bs4 import BeautifulSoup
link_soup = BeautifulSoup('<a href="http://example.com/?foo=val1&bar=val2">A link</a>', 'html.parser')
print(link_soup.a.encode(formatter=None))

Then the string is printed as-is.

get_text()

We can call the get_text method to get the text from an element,.

For instance, we can write:

from bs4 import BeautifulSoup
markup = '<a href="http://example.com/">nI linked to <i>example.com</i>n</a>'
soup = BeautifulSoup(markup, 'html.parser')

print(soup.get_text())

Then we see:

I linked to example.com

printed.

We can specify how the bits of text can be joined together by passing in an argument.

For example, if we write:

from bs4 import BeautifulSoup
markup = '<a href="http://example.com/">nI linked to <i>example.com</i>n</a>'
soup = BeautifulSoup(markup, 'html.parser')

print(soup.get_text('|'))

Then we write:

I linked to |example.com|

Encodings

We can get the encoding of the markup string.

For example, we can write:

from bs4 import BeautifulSoup
markup = "<h1>Sacrxc3xa9 bleu!</h1>"
soup = BeautifulSoup(markup, 'html.parser')
print(soup.original_encoding)

Then soup.original_encoding is ‘utf-8’ .

We specify the encoding of the string with the from_encoding parameter.

For instance, we can write:

from bs4 import BeautifulSoup
markup = b"<h1>xedxe5xecxf9</h1>"
soup = BeautifulSoup(markup, 'html.parser', from_encoding="iso-8859-8")
print(soup.h1)
print(soup.original_encoding)

We set the encoding in the BeautifulSoup class so that we get what we expect parsed.

Also, we can call encode on a parsed node to parse it with the given encoding.

For example, we can write:

from bs4 import BeautifulSoup
markup = u"<b>N{SNOWMAN}</b>"
snowman_soup = BeautifulSoup(markup, 'html.parser')
tag = snowman_soup.b
print(tag.encode("latin-1"))

to set the encoding.

Then we see:

<b>&#9731;</b>

printed.

Unicode, Dammit

We can use the UnicodeDammit class from Beautiful Soup to convert a string with any encoding to Unicode.

For example, we can write:

from bs4 import BeautifulSoup, UnicodeDammit
dammit = UnicodeDammit("Sacrxc3xa9 bleu!")
print(dammit.unicode_markup)
print(dammit.original_encoding)

Then dammit.unicode_markup is ‘Sacré bleu!’ and dammit.original_encoding is utf-8 .

Smart Quotes

We can use Unicode, Dammit to convert Microsoft smart quotes to HTML or XML entities:

from bs4 import BeautifulSoup, UnicodeDammit
markup = b"<p>I just x93lovex94 Microsoft Wordx92s smart quotes</p>"
print(UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="html").unicode_markup)
print(UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="xml").unicode_markup)

Then we get:

<p>I just &ldquo;love&rdquo; Microsoft Word&rsquo;s smart quotes</p>

from the first print and:

<p>I just &#x201C;love&#x201D; Microsoft Word&#x2019;s smart quotes</p>

from the 2nd print .

Conclusion

Beautiful can work with strings with various encodings.

Categories
Beautiful Soup

DOM Manipulation with Beautiful Soup — Removing Nodes, Wrap and Unwrap Elements, and Printing

We can get data from web pages with Beautiful Soup.

It lets us parse the DOM and extract the data we want.

In this article, we’ll look at how to manipulate HTML documents with Beautiful Soup.

extract()

The extract method removes a node from the tree.

For examp[le, we can write:

from bs4 import BeautifulSoup, NavigableString
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')
a_tag = soup.a
i_tag = soup.i.extract()
print(i_tag)
print(a_tag)

Then we get:

<i>example.com</i>

as the value of i_tag and:

<a href="http://example.com/">I linked to </a>

as the value of a_tag .

decompose()

The decompose method removes a tag from the tree and completely destroy it and its contents.

So if we write:

from bs4 import BeautifulSoup, NavigableString
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')
a_tag = soup.a
i_tag = soup.i.decompose()
print(i_tag)
print(a_tag)

Then i_tag is None and a_tag is:

<a href="http://example.com/">I linked to </a>

replace_with()

The replace_with method removes a node from the tree and replaces it with the tag or string of our choice.

For instance, we can use it by writing:

from bs4 import BeautifulSoup, NavigableString
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')
a_tag = soup.a
new_tag = soup.new_tag("b")
new_tag.string = "example.net"
a_tag.i.replace_with(new_tag)
print(a_tag)

Then a_tag is now:

<a href="http://example.com/">I linked to <b>example.net</b></a>

wrap()

The wrap method wraps an element with the tag we specified.

It returns the new wrapper.

For example, we can write:

from bs4 import BeautifulSoup
soup = BeautifulSoup("<p>I wish I was bold.</p>", 'html.parser')
soup.p.string.wrap(soup.new_tag("b"))
soup.p.wrap(soup.new_tag("div"))
print(soup)

The soup is now:

<div><p><b>I wish I was bold.</b></p></div>

after we called wrap to wrap a div around our p element.

unwrap()

The unwrap method removes the wrapper element from the content.

For instance, we can write:

from bs4 import BeautifulSoup
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')
a_tag = soup.a
a_tag.i.unwrap()
print(a_tag)

Then we get:

<a href="http://example.com/">I linked to example.com</a>

as the new value of a_tag .

Output

We can pretty print our HTML with the prettify method.

For example, we can write:

from bs4 import BeautifulSoup
markup = '<html><head><body><a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')
print(soup.prettify())

Then we see:

<html>
 <head>
  <body>
   <a href="http://example.com/">
    I linked to
    <i>
     example.com
    </i>
   </a>
  </body>
 </head>
</html>

displayed.

We can also prettify child nodes.

For example, we can write:

from bs4 import BeautifulSoup
markup = '<html><head><body><a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')
print(soup.a.prettify())

Then:

<a href="http://example.com/">
 I linked to
 <i>
  example.com
 </i>
</a>

is printed.

Non-pretty Printing

We can just print the node objects without prettifying.

For example, we can write:

from bs4 import BeautifulSoup
markup = '<html><head><body><a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')
print(soup)
print(soup.a)

Then we get:

<html><head><body><a href="http://example.com/">I linked to <i>example.com</i></a></body></head></html>

from the first print call and:

<a href="http://example.com/">I linked to <i>example.com</i></a>

from the 2nd print call.

Conclusion

We can remove nodes, wrap and unwrap nodes, and print them with Beautiful Soup.

Categories
Beautiful Soup

DOM Manipulation with Beautiful Soup — Inserting Tags

We can get data from web pages with Beautiful Soup.

It lets us parse the DOM and extract the data we want.

In this article, we’ll look at how to manipulate HTML documents with Beautiful Soup.

NavigableString() and .new_tag()

We can add a text node with the NavigableString constructor.

For example, we can write:

from bs4 import BeautifulSoup, NavigableString
soup = BeautifulSoup("<b></b>", 'html.parser')
tag = soup.b
tag.append("Hello")
new_string = NavigableString(" there")
tag.append(new_string)
print(tag)
print(tag.contents)

Then call append with the NavigableString instance to append the items.

Therefore, we get:

<b>Hello there</b>

from the first print and:

[u'Hello', u' there']

from the 2nd print .

To create a new tag, we can use the new_tag method:

from bs4 import BeautifulSoup, NavigableString
soup = BeautifulSoup("<b></b>", 'html.parser')
new_tag = soup.new_tag("a", href="http://www.example.com")
new_tag.string = "Link text."
print(new_tag)
print(new_tag.contents)

We call new_tag on the BeautifulSoup instance to add a new tag.

We set the text content by setting the string method.

Then new_tag should be:

<a href="http://www.example.com">Link text.</a>

And new_tag.contents is:

[u'Link text.']

insert()

The insert method is like append , but we can insert our node wherever we like.

For example, we can write:

from bs4 import BeautifulSoup, NavigableString
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.a
tag.insert(1, "but did not endorse ")
print(tag)
print(tag.contents)

We call insert with the index we want to insert the node at and the node content.

Then we get:

<a href="http://example.com/">I linked to but did not endorse <i>example.com</i></a>

printed with the first print and:

[u'I linked to ', u'but did not endorse ', <i>example.com</i>]

printed with the 2nd print .

insert_before() and insert_after()

The insert_before method lets us insert a node immediately before something else in the parse tree:

from bs4 import BeautifulSoup, NavigableString
soup = BeautifulSoup("<b>leave</b>", 'html.parser')
tag = soup.new_tag("i")
tag.string = "Don't"
soup.b.string.insert_before(tag)
print(tag)
print(tag.contents)

We call insert_before with the tag element since we want to insert our new node before that.

Therefore, we get:

<i>Don't</i>

and:

[u"Don't"]

respectively from the 2 print calls.

Similarly, we can call insert_after with the node object that we want to insert after:

from bs4 import BeautifulSoup, NavigableString
soup = BeautifulSoup("<b>leave</b>", 'html.parser')
tag = soup.new_tag("i")
tag.string = "Don't"
soup.b.string.insert_before(tag)
div = soup.new_tag('div')
div.string = 'ever'
soup.b.i.insert_after(" you ", div)
print(soup.b)
print(soup.b.contents)

We call insert_before as we do before the i tag.

Then we add ' you ' after the div.

Then we get:

<b><i>Don't</i> you <div>ever</div>leave</b>

as the value of soup.b and:

[<i>Don't</i>, u' you ', <div>ever</div>, u'leave']

as the value of soup.b.contents .

clear()

The clear method removes the contents of a tag.

For example, we can use it by writing

from bs4 import BeautifulSoup, NavigableString
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.a
tag.clear()
tag
print(tag)

Then tag is:

<a href="http://example.com/"></a>

Conclusion

We can insert nodes with various methods that comes with Beautiful Soup.

Categories
Beautiful Soup

Web Scraping with Beautiful Soup — Siblings, CSS Selectors, and Node Manipulation

We can get data from web pages with Beautiful Soup.

It lets us parse the DOM and extract the data we want.

In this article, we’ll look at how to scrape HTML documents with Beautiful Soup.

find_all_previous() and find_previous()

We can get all the nodes that comes before a given node with the find_all_previous method.

For example, if we have:

from bs4 import BeautifulSoup
import re
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')
first_link = soup.a
print(first_link.find_all_previous('p'))

Then we see:

[<p class="story">Once upon a time there were three little sisters; and their names weren<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,n<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> andn<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;nand they lived at the bottom of a well.</p>, <p class="title"><b>The Dormouse's story</b></p>]

printed.

We get all the p elements that comes before the first a element.

The find_previous method returns the first node only.

CSS Selectors

We can find elements by tags:

from bs4 import BeautifulSoup

html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.select("title"))
print(soup.select("p:nth-of-type(3)"))
print(soup.select("body a"))
print(soup.select("html head title"))
print(soup.select("head > title"))
print(soup.select("p > a"))
print(soup.select(".sister"))
print(soup.select("#link1"))

Then we get the elements with the given CSS selectors with the soup.select method.

Modifying the Tree

We can change the text content of an element by writing:

from bs4 import BeautifulSoup
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'html.parser')

tag = soup.a
tag.string = "New link text."
print(tag)

We get the a element with soup.a .

Then we set the string property to set the text content.

And then we see print the tag and see:

<a href="http://example.com/">New link text.</a>

append()

We can add to a tag’s content with the append method.

For example, we can write:

from bs4 import BeautifulSoup
soup = BeautifulSoup("<a>Foo</a>", 'html.parser')
soup.a.append("Bar")

print(soup.a.contents)

Then we add 'Bar' to the a element as the child of a .

So soup.a.contents is:

[u'Foo', u'Bar']

extend()

The extend method adds every elemnt of a list to a tag.

For instance., we can write:

from bs4 import BeautifulSoup
soup = BeautifulSoup("<a>Foo</a>", 'html.parser')
soup.a.extend([' ', 'bar', ' ', 'baz'])
print(soup.a)

And we get:

<a>Foo bar baz</a>

as the result.

NavigableString() and .new_tag()

We can add navigable strings into an element.

For example, we can write:

from bs4 import BeautifulSoup, NavigableString
soup = BeautifulSoup("<b></b>", 'html.parser')
tag = soup.b
tag.append("Hello")
new_string = NavigableString(" there")
tag.append(new_string)
print(tag)
print(tag.contents)

And we get:

<b>Hello there</b>

for tag and:

[u'Hello', u' there']

for tag.contents .

Also, we can add a comment node with the Comment class:

from bs4 import BeautifulSoup, Comment
soup = BeautifulSoup("<b></b>", 'html.parser')
tag = soup.b
new_comment = Comment("Nice to see you.")
tag.append(new_comment)
print(tag)
print(tag.contents)

Then tag is:

<b><!--Nice to see you.--></b>

and tag.contents is:

[u'Nice to see you.']

Conclusion

We can get elements and add nodes to other nodes with Beautiful Soup.