Categories
Beautiful Soup

Web Scraping with Beautiful Soup — Strings

We can get data from web pages with Beautiful Soup.

It lets us parse the DOM and extract the data we want.

In this article, we’ll look at how to scrape HTML documents with Beautiful Soup.

.string

We can get the text content from the element if there’s only one child and the child is a NavigableString .

For example, we can write:

from bs4 import BeautifulSoup
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
title_tag = soup.head.title
print(title_tag.string)

to get the title from the head tag.

Then we get:

The Dormouse's story

printed.

If the only child has another tag and that tag has a string, then the parent yag is considered to have the same string as the child.

So if we write:

from bs4 import BeautifulSoup
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
head_tag = soup.head
print(head_tag.string)

We also get:

The Dormouse's story

printed.

And if we replace tyhe last line with:

print(head_tag.contents)

We see:

[<title>The Dormouse's story</title>]

printed.

If there’s more than one thing inside a tag, we can use the .strings generator to look at all the contents:

from bs4 import BeautifulSoup
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
for string in soup.strings:
    print(repr(string))

Then we see:

u"The Dormouse's story"
u'n'
u'n'
u"The Dormouse's story"
u'n'
u'Once upon a time there were three little sisters; and their names weren'
u'Elsie'
u',n'
u'Lacie'
u' andn'
u'Tillie'
u';nand they lived at the bottom of a well.'
u'n'
u'...'
u'n'

printed.

To get the strings without the extra whitespace, we can use the stripped_strings generator.

For example, we can write:

from bs4 import BeautifulSoup
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
for string in soup.stripped_strings:
    print(repr(string))

And we get:

u"The Dormouse's story"
u"The Dormouse's story"
u'Once upon a time there were three little sisters; and their names were'
u'Elsie'
u','
u'Lacie'
u'and'
u'Tillie'
u';nand they lived at the bottom of a well.'
u'...'

printed.

Conclusion

We can get strings at various locations with Beautiful Soup.

Categories
Beautiful Soup

Web Scraping with Beautiful Soup — Child Nodes

We can get data from web pages with Beautiful Soup.

It lets us parse the DOM and extract the data we want.

In this article, we’ll look at how to scrape HTML documents with Beautiful Soup.

CData

We can get the CData from a document with Beautiful Soup.

For example, wen can write:

from bs4 import BeautifulSoup, CData
markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup, 'html.parser')
comment = soup.b.string
cdata = CData("A CDATA block")
comment.replace_with(cdata)

print(soup.b.prettify())

We replaced the comment inside the b tag with the CData block, so the print function will print:

<b>
 <![CDATA[A CDATA block]]>
</b>

Going Down

We can get tags with other tags.

For example, we can write:

from bs4 import BeautifulSoup
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.head)
print(soup.title)

The first print call gets the head element’s content.

And the 2nd print call gets the title element’s content.

So we get:

<head><title>The Dormouse's story</title></head>

and:

<title>The Dormouse's story</title>

respectively.

We can also get the b element by writing:

print(soup.body.b)

to get the first b element in body .

So we get:

<b>The Dormouse's story</b>

printed.

And:

print(soup.a)

to get the first a element.

So we tet:

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

printed.

We can use the find_all method to find all elements with the given selector.

For example, we can write:

print(soup.find_all('a'))

And we get:

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

printed.

.contents and .children

We can get the contents of tags with the contents property.

For exam[ple, we can write:

from bs4 import BeautifulSoup
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
head_tag = soup.head
print(head_tag.contents)

And we see:

[<title>The Dormouse's story</title>]

printed.

We can get the content of the title tag by writing:

from bs4 import BeautifulSoup
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
head_tag = soup.head
title_tag = head_tag.contents[0]
print(title_tag.contents)

We get the head element with soup.head .

And we get the content of it with head_tag.contents[0] .

And we get the title tag’s content with title_tag.contents .

So we see:

[u"The Dormouse's story"]

printed.

We can also loop through the title_tag ‘s content with a for loop:

from bs4 import BeautifulSoup
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
head_tag = soup.head
title_tag = head_tag.contents[0]
for child in title_tag.children:
    print(child)

Then we see ‘The Dormouse’s story’ logged.

.descendants

We can get the descendants of an elemnt with the descendants property.

For example, we can write:

from bs4 import BeautifulSoup
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
head_tag = soup.head
for child in head_tag.descendants:
    print(child)

Then we see:

<title>The Dormouse's story</title>
The Dormouse's story

logged.

We get the title element and the content of it, so it goes through the tree.

Conclusion

Beautiful Soup can work with CData and child nodes.

Categories
Beautiful Soup

Web Scraping with Beautiful Soup — Attributes and Strings

ps%3A%2F%2Funsplash.com%3Futm_source%3Dmedium%26utm_medium%3Dreferral)

We can get data from web pages with Beautiful Soup.

It lets us parse the DOM and extract the data we want.

In this article, we’ll look at how to scrape HTML documents with Beautiful Soup.

Manipulating Attributes

We can manipulate attributes with Beautiful Soup.

For example, we can write:

from bs4 import BeautifulSoup

tag = BeautifulSoup('<b id="boldest">bold</b>', 'html.parser').b
tag['id'] = 'verybold'
tag['another-attribute'] = 1
print(tag)
del tag['id']
del tag['another-attribute']
print(tag)

We just add and remove items from the tag dictionary to manipulate attributes.

Then the first print statement prints:

<b another-attribute="1" id="verybold">bold</b>

and the 2nd one prints:

<b>bold</b>

Multi-Valued Attributes

Beautiful Soup works with attributes with multiple values.

For example, we can parse:

from bs4 import BeautifulSoup

css_soup = BeautifulSoup('<p class="body bold"></p>', 'html.parser')
print(css_soup.p['class'])

Then we get [u’body’, u’bold’] printed.

All the values will be added after we turn the dictionary back to a string:

from bs4 import BeautifulSoup

rel_soup = BeautifulSoup('<p>Back to the <a rel="index">homepage</a></p>', 'html.parser')
rel_soup.a['rel'] = ['index', 'contents']
print(rel_soup.p)

The print statement will print:

<p>Back to the <a rel="index contents">homepage</a></p>

If we parse a document withn XML with LXML, we get the same result:

from bs4 import BeautifulSoup

xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'lxml')
print(xml_soup.p['class'])

We still get:

['body', 'strikeout']

printed.

NavigableString

We can get text within a tag. For example, we can write:

from bs4 import BeautifulSoup

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
tag = soup.b
tag.string
print(type(tag.string))

Then we get:

<class 'bs4.element.NavigableString'>

printed.

The tag.string property has a navigable string in the b tag.

We can convert it into a Python string by writing:

from bs4 import BeautifulSoup

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
tag = soup.b
tag.string
unicode_string = str(tag.string)
print(unicode_string)

Then ‘Extremely bold’ is printed.

We can replace a navigable string with a different string.

To do that, we write:

from bs4 import BeautifulSoup

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
tag = soup.b
print(tag.string)
tag.string.replace_with("No longer bold")
print(tag.string)

Then we see:

Extremely bold
No longer bold

printed.

BeautifulSoup Object

The BeautifulSoup object represents the whole parsed document.

For example, if we have:

from bs4 import BeautifulSoup

doc = BeautifulSoup("<document><content/>INSERT FOOTER HERE</document", "xml")
footer = BeautifulSoup("<footer>Here's the footer</footer>", "xml")
doc.find(text="INSERT FOOTER HERE").replace_with(footer)
print(doc)
print(doc.name)

Then we see:

<?xml version="1.0" encoding="utf-8"?>
<document><content/><footer>Here's the footer</footer></document>

printed from the first print call.

And:

[document]

printed from the 2nd print call.

Comments and Other Special Strings

Beautiful Soup can parse comments and other special strings.

For example, we can write:

from bs4 import BeautifulSoup

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup, 'html.parser')
comment = soup.b.string
print(type(comment))
print(soup.b.prettify())

Then we can get the comment string from the b element with the soup.b.string property.

So the first print call prints:

<class 'bs4.element.Comment'>

And the 2nd print call prints:

<b>
 <!--Hey, buddy. Want to buy a used parser?-->
</b>

Conclusion

We can manipulate attributes and work with strings with Beautiful Soup.

Categories
Beautiful Soup

Getting Started with Web Scraping with Beautiful Soup

We can get data from web pages with Beautiful Soup.

It lets us parse the DOM and extract the data we want.

In this article, we’ll look at how to scrape HTML documents with Beautiful Soup.

Getting Started

We get started by running:

pip install beautifulsoup

Then we can write:

from bs4 import BeautifulSoup
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())

to add an HTML string and parse it with the BeautifulSoup class.

Then we can print the parsed document in the last line.

Get Links and Text

We can get the links from the HTML string with the find_all method:

from bs4 import BeautifulSoup
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
for link in soup.find_all('a'):
    print(link.get('href'))

We just pass in the selector for the elements we wan to get.

Also, we can get all the text from the page with get_text():

from bs4 import BeautifulSoup
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.get_text())

Parse an External Document

We can parse an external document by opening it with open :

from bs4 import BeautifulSoup

with open("index.html") as fp:
    soup = BeautifulSoup(fp, 'html.parser')
    print(soup.prettify())

Kinds of Objects

We can get a few kinds of objects with Beautiful Soup.

They include Tag , NavigableString , BeautifulSoup , and Comment .

Tag

A Tag corresponds to an XML or HTML tag in the original docuemnt.

For example, we can write:

from bs4 import BeautifulSoup

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
tag = soup.b
print(type(tag))

to get the b tag from the HTML string.

Then we get:

<class 'bs4.element.Tag'>

printed from the last line.

Name

We can get the name of the tag:

from bs4 import BeautifulSoup

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
tag = soup.b
print(tag.name)

Then we see b printed.

Attributes

We can get attributes from the returned dictionary:

from bs4 import BeautifulSoup

tag = BeautifulSoup('<b id="boldest">bold</b>', 'html.parser').b
print(tag['id'])

We get the b element.

Then we get the id value from the returned dictionary.

Conclusion

We can get parse HTML and XML and get various elements, text, and attributes easily with Beautiful Soup.