Web Scraping with Beautiful Soup — Encoding

Spread the love

We can get data from web pages with Beautiful Soup.

It lets us parse the DOM and extract the data we want.

In this article, we’ll look at how to scrape HTML documents with Beautiful Soup.

Output Formatters

We can format our output with Beautiful Soup.

For example, we can write:

from bs4 import BeautifulSoup
french = "<p>Il a dit &lt;&lt;Sacr&eacute; bleu!&gt;&gt;</p>"
soup = BeautifulSoup(french, 'html.parser')
print(soup.prettify(formatter="html"))

to set the formatter to the one we want when we call prettify .

Also we can use the html5 formatter,

For example, we can write:

from bs4 import BeautifulSoup
br = BeautifulSoup("<br>", 'html.parser').br
print(br.prettify(formatter="html"))
print(br.prettify(formatter="html5"))

Then from the first print , we see:

<br/>

And from the 2nd print , we see:

<br>

Also, we can set the formatter to None :

from bs4 import BeautifulSoup
link_soup = BeautifulSoup('<a href="http://example.com/?foo=val1&bar=val2">A link</a>', 'html.parser')
print(link_soup.a.encode(formatter=None))

Then the string is printed as-is.

`get_text()`

We can call the get_text method to get the text from an element,.

For instance, we can write:

from bs4 import BeautifulSoup
markup = '<a href="http://example.com/">nI linked to <i>example.com</i>n</a>'
soup = BeautifulSoup(markup, 'html.parser')

print(soup.get_text())

Then we see:

I linked to example.com

printed.

We can specify how the bits of text can be joined together by passing in an argument.

For example, if we write:

from bs4 import BeautifulSoup
markup = '<a href="http://example.com/">nI linked to <i>example.com</i>n</a>'
soup = BeautifulSoup(markup, 'html.parser')

print(soup.get_text('|'))

Then we write:

I linked to |example.com|

Encodings

We can get the encoding of the markup string.

For example, we can write:

from bs4 import BeautifulSoup
markup = "<h1>Sacrxc3xa9 bleu!</h1>"
soup = BeautifulSoup(markup, 'html.parser')
print(soup.original_encoding)

Then soup.original_encoding is ‘utf-8’ .

We specify the encoding of the string with the from_encoding parameter.

For instance, we can write:

from bs4 import BeautifulSoup
markup = b"<h1>xedxe5xecxf9</h1>"
soup = BeautifulSoup(markup, 'html.parser', from_encoding="iso-8859-8")
print(soup.h1)
print(soup.original_encoding)

We set the encoding in the BeautifulSoup class so that we get what we expect parsed.

Also, we can call encode on a parsed node to parse it with the given encoding.

For example, we can write:

from bs4 import BeautifulSoup
markup = u"<b>N{SNOWMAN}</b>"
snowman_soup = BeautifulSoup(markup, 'html.parser')
tag = snowman_soup.b
print(tag.encode("latin-1"))

to set the encoding.

Then we see:

<b>&#9731;</b>

printed.

Unicode, Dammit

We can use the UnicodeDammit class from Beautiful Soup to convert a string with any encoding to Unicode.

For example, we can write:

from bs4 import BeautifulSoup, UnicodeDammit
dammit = UnicodeDammit("Sacrxc3xa9 bleu!")
print(dammit.unicode_markup)
print(dammit.original_encoding)

Then dammit.unicode_markup is ‘Sacré bleu!’ and dammit.original_encoding is utf-8 .

Smart Quotes

We can use Unicode, Dammit to convert Microsoft smart quotes to HTML or XML entities:

from bs4 import BeautifulSoup, UnicodeDammit
markup = b"<p>I just x93lovex94 Microsoft Wordx92s smart quotes</p>"
print(UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="html").unicode_markup)
print(UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="xml").unicode_markup)

Then we get:

<p>I just &ldquo;love&rdquo; Microsoft Word&rsquo;s smart quotes</p>

from the first print and:

<p>I just &#x201C;love&#x201D; Microsoft Word&#x2019;s smart quotes</p>

from the 2nd print .

Conclusion

Beautiful can work with strings with various encodings.

Output Formatters

get_text()

Encodings

Unicode, Dammit

Smart Quotes

Conclusion

Related Posts

By John Au-Yeung

Leave a Reply Cancel reply

`get_text()`