ps%3A%2F%2Funsplash.com%3Futm_source%3Dmedium%26utm_medium%3Dreferral)
We can get data from web pages with Beautiful Soup.
It lets us parse the DOM and extract the data we want.
In this article, we’ll look at how to scrape HTML documents with Beautiful Soup.
Manipulating Attributes
We can manipulate attributes with Beautiful Soup.
For example, we can write:
from bs4 import BeautifulSoup
tag = BeautifulSoup('<b id="boldest">bold</b>', 'html.parser').b
tag['id'] = 'verybold'
tag['another-attribute'] = 1
print(tag)
del tag['id']
del tag['another-attribute']
print(tag)
We just add and remove items from the tag
dictionary to manipulate attributes.
Then the first print
statement prints:
<b another-attribute="1" id="verybold">bold</b>
and the 2nd one prints:
<b>bold</b>
Multi-Valued Attributes
Beautiful Soup works with attributes with multiple values.
For example, we can parse:
from bs4 import BeautifulSoup
css_soup = BeautifulSoup('<p class="body bold"></p>', 'html.parser')
print(css_soup.p['class'])
Then we get [u’body’, u’bold’]
printed.
All the values will be added after we turn the dictionary back to a string:
from bs4 import BeautifulSoup
rel_soup = BeautifulSoup('<p>Back to the <a rel="index">homepage</a></p>', 'html.parser')
rel_soup.a['rel'] = ['index', 'contents']
print(rel_soup.p)
The print
statement will print:
<p>Back to the <a rel="index contents">homepage</a></p>
If we parse a document withn XML with LXML, we get the same result:
from bs4 import BeautifulSoup
xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'lxml')
print(xml_soup.p['class'])
We still get:
['body', 'strikeout']
printed.
NavigableString
We can get text within a tag. For example, we can write:
from bs4 import BeautifulSoup
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
tag = soup.b
tag.string
print(type(tag.string))
Then we get:
<class 'bs4.element.NavigableString'>
printed.
The tag.string
property has a navigable string in the b
tag.
We can convert it into a Python string by writing:
from bs4 import BeautifulSoup
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
tag = soup.b
tag.string
unicode_string = str(tag.string)
print(unicode_string)
Then ‘Extremely bold’
is printed.
We can replace a navigable string with a different string.
To do that, we write:
from bs4 import BeautifulSoup
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
tag = soup.b
print(tag.string)
tag.string.replace_with("No longer bold")
print(tag.string)
Then we see:
Extremely bold
No longer bold
printed.
BeautifulSoup Object
The BeautifulSoup
object represents the whole parsed document.
For example, if we have:
from bs4 import BeautifulSoup
doc = BeautifulSoup("<document><content/>INSERT FOOTER HERE</document", "xml")
footer = BeautifulSoup("<footer>Here's the footer</footer>", "xml")
doc.find(text="INSERT FOOTER HERE").replace_with(footer)
print(doc)
print(doc.name)
Then we see:
<?xml version="1.0" encoding="utf-8"?>
<document><content/><footer>Here's the footer</footer></document>
printed from the first print
call.
And:
[document]
printed from the 2nd print
call.
Comments and Other Special Strings
Beautiful Soup can parse comments and other special strings.
For example, we can write:
from bs4 import BeautifulSoup
markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup, 'html.parser')
comment = soup.b.string
print(type(comment))
print(soup.b.prettify())
Then we can get the comment string from the b
element with the soup.b.string
property.
So the first print
call prints:
<class 'bs4.element.Comment'>
And the 2nd print
call prints:
<b>
<!--Hey, buddy. Want to buy a used parser?-->
</b>
Conclusion
We can manipulate attributes and work with strings with Beautiful Soup.