We can get data from web pages with Beautiful Soup.
It lets us parse the DOM and extract the data we want.
In this article, we’ll look at how to scrape HTML documents with Beautiful Soup.
Output Formatters
We can format our output with Beautiful Soup.
For example, we can write:
from bs4 import BeautifulSoup
french = "<p>Il a dit <<Sacré bleu!>></p>"
soup = BeautifulSoup(french, 'html.parser')
print(soup.prettify(formatter="html"))
to set the formatter to the one we want when we call prettify
.
Also we can use the html5
formatter,
For example, we can write:
from bs4 import BeautifulSoup
br = BeautifulSoup("<br>", 'html.parser').br
print(br.prettify(formatter="html"))
print(br.prettify(formatter="html5"))
Then from the first print
, we see:
<br/>
And from the 2nd print
, we see:
<br>
Also, we can set the formatter
to None
:
from bs4 import BeautifulSoup
link_soup = BeautifulSoup('<a href="http://example.com/?foo=val1&bar=val2">A link</a>', 'html.parser')
print(link_soup.a.encode(formatter=None))
Then the string is printed as-is.
get_text()
We can call the get_text
method to get the text from an element,.
For instance, we can write:
from bs4 import BeautifulSoup
markup = '<a href="http://example.com/">nI linked to <i>example.com</i>n</a>'
soup = BeautifulSoup(markup, 'html.parser')
print(soup.get_text())
Then we see:
I linked to example.com
printed.
We can specify how the bits of text can be joined together by passing in an argument.
For example, if we write:
from bs4 import BeautifulSoup
markup = '<a href="http://example.com/">nI linked to <i>example.com</i>n</a>'
soup = BeautifulSoup(markup, 'html.parser')
print(soup.get_text('|'))
Then we write:
I linked to |example.com|
Encodings
We can get the encoding of the markup string.
For example, we can write:
from bs4 import BeautifulSoup
markup = "<h1>Sacrxc3xa9 bleu!</h1>"
soup = BeautifulSoup(markup, 'html.parser')
print(soup.original_encoding)
Then soup.original_encoding
is ‘utf-8’
.
We specify the encoding of the string with the from_encoding
parameter.
For instance, we can write:
from bs4 import BeautifulSoup
markup = b"<h1>xedxe5xecxf9</h1>"
soup = BeautifulSoup(markup, 'html.parser', from_encoding="iso-8859-8")
print(soup.h1)
print(soup.original_encoding)
We set the encoding in the BeautifulSoup
class so that we get what we expect parsed.
Also, we can call encode
on a parsed node to parse it with the given encoding.
For example, we can write:
from bs4 import BeautifulSoup
markup = u"<b>N{SNOWMAN}</b>"
snowman_soup = BeautifulSoup(markup, 'html.parser')
tag = snowman_soup.b
print(tag.encode("latin-1"))
to set the encoding.
Then we see:
<b>☃</b>
printed.
Unicode, Dammit
We can use the UnicodeDammit
class from Beautiful Soup to convert a string with any encoding to Unicode.
For example, we can write:
from bs4 import BeautifulSoup, UnicodeDammit
dammit = UnicodeDammit("Sacrxc3xa9 bleu!")
print(dammit.unicode_markup)
print(dammit.original_encoding)
Then dammit.unicode_markup
is ‘Sacré bleu!’
and dammit.original_encoding
is utf-8
.
Smart Quotes
We can use Unicode, Dammit to convert Microsoft smart quotes to HTML or XML entities:
from bs4 import BeautifulSoup, UnicodeDammit
markup = b"<p>I just x93lovex94 Microsoft Wordx92s smart quotes</p>"
print(UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="html").unicode_markup)
print(UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="xml").unicode_markup)
Then we get:
<p>I just “love” Microsoft Word’s smart quotes</p>
from the first print
and:
<p>I just “love” Microsoft Word’s smart quotes</p>
from the 2nd print
.
Conclusion
Beautiful can work with strings with various encodings.