Categories
Beautiful Soup

Web Scraping with Beautiful Soup — Attributes and Strings

ps%3A%2F%2Funsplash.com%3Futm_source%3Dmedium%26utm_medium%3Dreferral)

We can get data from web pages with Beautiful Soup.

It lets us parse the DOM and extract the data we want.

In this article, we’ll look at how to scrape HTML documents with Beautiful Soup.

Manipulating Attributes

We can manipulate attributes with Beautiful Soup.

For example, we can write:

from bs4 import BeautifulSoup

tag = BeautifulSoup('<b id="boldest">bold</b>', 'html.parser').b
tag['id'] = 'verybold'
tag['another-attribute'] = 1
print(tag)
del tag['id']
del tag['another-attribute']
print(tag)

We just add and remove items from the tag dictionary to manipulate attributes.

Then the first print statement prints:

<b another-attribute="1" id="verybold">bold</b>

and the 2nd one prints:

<b>bold</b>

Multi-Valued Attributes

Beautiful Soup works with attributes with multiple values.

For example, we can parse:

from bs4 import BeautifulSoup

css_soup = BeautifulSoup('<p class="body bold"></p>', 'html.parser')
print(css_soup.p['class'])

Then we get [u’body’, u’bold’] printed.

All the values will be added after we turn the dictionary back to a string:

from bs4 import BeautifulSoup

rel_soup = BeautifulSoup('<p>Back to the <a rel="index">homepage</a></p>', 'html.parser')
rel_soup.a['rel'] = ['index', 'contents']
print(rel_soup.p)

The print statement will print:

<p>Back to the <a rel="index contents">homepage</a></p>

If we parse a document withn XML with LXML, we get the same result:

from bs4 import BeautifulSoup

xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'lxml')
print(xml_soup.p['class'])

We still get:

['body', 'strikeout']

printed.

NavigableString

We can get text within a tag. For example, we can write:

from bs4 import BeautifulSoup

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
tag = soup.b
tag.string
print(type(tag.string))

Then we get:

<class 'bs4.element.NavigableString'>

printed.

The tag.string property has a navigable string in the b tag.

We can convert it into a Python string by writing:

from bs4 import BeautifulSoup

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
tag = soup.b
tag.string
unicode_string = str(tag.string)
print(unicode_string)

Then ‘Extremely bold’ is printed.

We can replace a navigable string with a different string.

To do that, we write:

from bs4 import BeautifulSoup

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
tag = soup.b
print(tag.string)
tag.string.replace_with("No longer bold")
print(tag.string)

Then we see:

Extremely bold
No longer bold

printed.

BeautifulSoup Object

The BeautifulSoup object represents the whole parsed document.

For example, if we have:

from bs4 import BeautifulSoup

doc = BeautifulSoup("<document><content/>INSERT FOOTER HERE</document", "xml")
footer = BeautifulSoup("<footer>Here's the footer</footer>", "xml")
doc.find(text="INSERT FOOTER HERE").replace_with(footer)
print(doc)
print(doc.name)

Then we see:

<?xml version="1.0" encoding="utf-8"?>
<document><content/><footer>Here's the footer</footer></document>

printed from the first print call.

And:

[document]

printed from the 2nd print call.

Comments and Other Special Strings

Beautiful Soup can parse comments and other special strings.

For example, we can write:

from bs4 import BeautifulSoup

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup, 'html.parser')
comment = soup.b.string
print(type(comment))
print(soup.b.prettify())

Then we can get the comment string from the b element with the soup.b.string property.

So the first print call prints:

<class 'bs4.element.Comment'>

And the 2nd print call prints:

<b>
 <!--Hey, buddy. Want to buy a used parser?-->
</b>

Conclusion

We can manipulate attributes and work with strings with Beautiful Soup.

Categories
Beautiful Soup

Getting Started with Web Scraping with Beautiful Soup

We can get data from web pages with Beautiful Soup.

It lets us parse the DOM and extract the data we want.

In this article, we’ll look at how to scrape HTML documents with Beautiful Soup.

Getting Started

We get started by running:

pip install beautifulsoup

Then we can write:

from bs4 import BeautifulSoup
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())

to add an HTML string and parse it with the BeautifulSoup class.

Then we can print the parsed document in the last line.

Get Links and Text

We can get the links from the HTML string with the find_all method:

from bs4 import BeautifulSoup
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
for link in soup.find_all('a'):
    print(link.get('href'))

We just pass in the selector for the elements we wan to get.

Also, we can get all the text from the page with get_text():

from bs4 import BeautifulSoup
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.get_text())

Parse an External Document

We can parse an external document by opening it with open :

from bs4 import BeautifulSoup

with open("index.html") as fp:
    soup = BeautifulSoup(fp, 'html.parser')
    print(soup.prettify())

Kinds of Objects

We can get a few kinds of objects with Beautiful Soup.

They include Tag , NavigableString , BeautifulSoup , and Comment .

Tag

A Tag corresponds to an XML or HTML tag in the original docuemnt.

For example, we can write:

from bs4 import BeautifulSoup

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
tag = soup.b
print(type(tag))

to get the b tag from the HTML string.

Then we get:

<class 'bs4.element.Tag'>

printed from the last line.

Name

We can get the name of the tag:

from bs4 import BeautifulSoup

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
tag = soup.b
print(tag.name)

Then we see b printed.

Attributes

We can get attributes from the returned dictionary:

from bs4 import BeautifulSoup

tag = BeautifulSoup('<b id="boldest">bold</b>', 'html.parser').b
print(tag['id'])

We get the b element.

Then we get the id value from the returned dictionary.

Conclusion

We can get parse HTML and XML and get various elements, text, and attributes easily with Beautiful Soup.

Categories
Flask

Python Web Development with Flask — Favicon, Background Tasks, and HTTP Method Overrides

Flask is a simple web framework written in Python.

In this article, we’ll look at how to develop simple Python web apps with Flask.

Favicon

We can add a favicon by putting it in the static folder and then referencing it.

For example, we can write:

app.py

from flask import send_from_directory, Flask, render_template
import os

app = Flask(__name__)

@app.route('/favicon.ico')
def favicon():
    return send_from_directory(os.path.join(app.root_path, 'static'),
                               'favicon.ico', mimetype='image/vnd.microsoft.icon')

@app.route('/')
def hello_world():
    return render_template('index.html')

templates/index.html

<link rel="shortcut icon"
    href="{{ url_for('static', filename='favicon.ico') }}">
<p>hello world</p>

Then we put our favicon.ico file into the static folder.

Now we should see the favicon displayed in our browser’s tab.

Deferred Request Callbacks

We can add callbacks that are called before or after the current request.

For example, we can write:

from flask import Flask

app = Flask(__name__)

@app.before_request
def before_request():
    print('before called')

@app.after_request
def after_request(response):
    print('after called')
    return response

@app.route('/')
def hello_world():
    return 'hello world'

We call the @app.before_request to add a callback that’s run before a request is made.

The @app.after_request decorator lets us add a callback that’s run after a request is made.

The response parameter has the response that we return.

HTTP Method Overrides

We can add HTTP method overrides with our own class.

For example, we can write:

from flask import Flask

class HTTPMethodOverrideMiddleware(object):
    allowed_methods = frozenset([
        'GET',
        'HEAD',
        'POST',
        'DELETE',
        'PUT',
        'PATCH',
        'OPTIONS'
    ])
    bodyless_methods = frozenset(['GET', 'HEAD', 'OPTIONS', 'DELETE'])

    def __init__(self, app):
        self.app = app

    def __call__(self, environ, start_response):
        method = environ.get('HTTP_X_HTTP_METHOD_OVERRIDE', '').upper()
        if method in self.allowed_methods:
            environ['REQUEST_METHOD'] = method
        if method in self.bodyless_methods:
            environ['CONTENT_LENGTH'] = '0'
        return self.app(environ, start_response)

app = Flask(__name__)
app.wsgi_app = HTTPMethodOverrideMiddleware(app.wsgi_app)

@app.route('/')
def hello_world():
    return 'hello world'

to add the HTTPMethodOverrideMiddleware class.

It has the allowed_methods variable to let us set the kinds of HTTP requests that can be made.

bodyless_methods has the types of HTTP requests that doesn’t require a request body.

The __call__ method lets us set the REQUEST_METHOD and the CONTENT_LENGTH request headers and return the request with the changes.

Then we add the override with:

app.wsgi_app = HTTPMethodOverrideMiddleware(app.wsgi_app)

Celery Background Tasks

We can add background tasks in our app with Celery.

To use it, we run:

pip install celery redis

Then we can use it by writing:

from flask import Flask
from celery import Celery

def make_celery(app):
    celery = Celery(
        app.import_name,
        backend=app.config['CELERY_RESULT_BACKEND'],
        broker=app.config['CELERY_BROKER_URL']
    )

    class ContextTask(celery.Task):
        def __call__(self, *args, **kwargs):
            with app.app_context():
                return self.run(*args, **kwargs)

    celery.Task = ContextTask
    return celery

app = Flask(__name__)
app.config.update(
    CELERY_BROKER_URL='redis://localhost:6379',
    CELERY_RESULT_BACKEND='redis://localhost:6379'
)
celery = make_celery(app)

@celery.task()
def add_together(a, b):
    return a + b

@app.route('/')
def hello_world():
    return 'hello world'

We have the make_celery function that creates the Celery instance to let us connect to Redis.

Then we set the config with app.config.update .

And then we call make_celery to create the Celery object.

Then we can use the celery object to run our worker and create a Celery task with the @celery.task decorator.

Once we did that and started Redis, we can run:

result = add_together.delay(23, 42)
print(result.wait())

to run the task.

Conclusion

We can add favicons, request callbacks, and background tasks with Flask.

Categories
Flask

Python Web Development with Flask — JSON Requests, Error Pages, and MongoDB

Flask is a simple web framework written in Python.

In this article, we’ll look at how to develop simple Python web apps with Flask.

Accepting JSON Requests

Flask can accept JSON request bodies out of the box.

For example, we can write:

from flask import Flask, jsonify, render_template, request
app = Flask(__name__)

@app.route('/add_numbers', methods=['POST'])
def add_numbers():
    content = request.json
    a = content['a']
    b = content['b']
    return jsonify(result=a + b)

to get the request JSON body with the request.json property.

Get Query Parameters

To get query parameters in our route function, we can use the request.args.get method:

from flask import Flask, jsonify, render_template, request
app = Flask(__name__)

@app.route('/add_numbers')
def add_numbers():
    a = request.args.get('a', 0, type=int)
    b = request.args.get('b', 0, type=int)
    return jsonify(result=a + b)

We get the value of the a and b URL parameters.

The 2nd argument is the default value of each.

And the type parameter has the data type to return.

So if we go to http://localhost:5000/add_numbers?a=1&b=2, we get:

{
  "result": 3
}

as the response.

Custom Error Pages

We can add custom error pages for various kinds of errors.

Common error codes that we run into include:

  • 404 — not found
  • 403 — accessing a disallowed resource
  • 410 — access a deleted item
  • 500 — internal server error

We can add error handlers for them by using the @app.errorhandler decorator.

For example, we can write:

app.py

from flask import Flask, render_template
app = Flask(__name__)

@app.errorhandler(404)
def page_not_found(e):
    return render_template('404.html'), 404

templates/404.html

<p>not found</p>

We pass in 404 ito the @app.errorhandler decorator to add a route function for the 404 errors.

We just render the template whenever we encounter a 404 error.

Therefore, we’ll see ‘not found’ when we go to a URL that’s not mapped to a route handler.

Returning API Errors as JSON

Also, we can return API errors as JSON.

To do that, we use the jsonify function.

For instance, we can write:

from flask import Flask, jsonify, abort
app = Flask(__name__)

@app.errorhandler(404)
def resource_not_found(e):
    return jsonify(error=str(e)), 404

@app.route("/cheese")
def get_one_cheese():
    resource = None

    if resource is None:
        abort(404, description="Resource not found")

    return jsonify(resource)

We have the get_one_cheese function that returns a 404 response if resource is None .

Since it’s none, we see the JSON.

The JSON response is created in the resource_not_found function, which is the handler for 404 errors.

We call jsonify in there with the error in the response.

abort will pass the error into the e parameter of resource_not_found .

So, we get:

{
  "error": "404 Not Found: Resource not found"
}

returned in the response body when we go to http://localhost:5000/cheese.

MongoDB with MongoEngine

We can manipulate MongoDB databases easily with MongoEngine.

To use it, we install:

pip install flask_mongoengine mongoengine

to install the required libraries.

Then we write:

from flask import Flask, jsonify
from flask_mongoengine import MongoEngine
import mongoengine as me

class Movie(me.Document):
    title = me.StringField(required=True)
    year = me.IntField()
    rated = me.StringField()
    director = me.StringField()
    actors = me.ListField()

class Imdb(me.EmbeddedDocument):
    imdb_id = me.StringField()
    rating = me.DecimalField()
    votes = me.IntField()

app = Flask(__name__)
app.config['MONGODB_SETTINGS'] = {
    "db": "myapp",
}
db = MongoEngine(app)

@app.route('/')
def hello_world():
    bttf = Movie(title="Back To The Future", year=1985)
    bttf.actors = [
        "Michael J. Fox",
        "Christopher Lloyd"
    ]
    bttf.imdb = Imdb(imdb_id="tt0088763", rating=8.5)
    bttf.save()
    return 'movie saved'

@app.route('/query')
def query():
    bttf = Movie.objects(title="Back To The Future")
    return jsonify(bttf)

to connect to the MongoDB database and do the database queries in our route.

Before we add our routes, we create the classes for the MongDB document schemas.

The classes are subclasses of the me.Document class.

me.StringField creates a string field.

me.ListField creates a list field.

me.DecimalField creates a floating-point number field.

And me.IntField creates an integer field.

Then we create our Flask app with the Flask clas.

And then we add the database connection settings to the MONGO_SETTINGS config.

Then we invoke the MongoEngine class with the app argument to let our routes connect to the database.

Then in hello_world we create a Movie document and call save to save it.

In the query route, we get the Movie that are saved with the title set to 'Back To The Future' .

Then we can jsonify on the result to return it as the JSON response.

Conclusion

We can accept a JSON body and query parameters in our requests.

Also, we can create custom error pages and use MongoDB databases with Flask.

Categories
Flask

Python Web Development with Flask — Flash Messages

Flask is a simple web framework written in Python.

In this article, we’ll look at how to develop simple Python web apps with Flask.

Message Flashing

We can send messages to a template that can be accessed with the next request.

For example, we can write:

app.py

from flask import Flask, render_template, redirect, url_for, flash, request
app = Flask(__name__)
app.secret_key = b'secret'

@app.route('/')
def index():
    return render_template('index.html')

@app.route('/login', methods=['GET', 'POST'])
def login():
    error = None
    if request.method == 'POST':
        flash('You were successfully logged in')
        return redirect(url_for('index'))
    return render_template('login.html', error=error)

templates/login.html

{% block body %}
  <h1>Login</h1>
  {% if error %}
    <p class=error><strong>Error:</strong> {{ error }}
  {% endif %}
  <form method=post>
    <dl>
      <dt>Username:
      <dd><input type=text name=username value="{{
          request.form.username }}">
      <dt>Password:
      <dd><input type=password name=password>
    </dl>
    <p><input type=submit value=Login>
  </form>
{% endblock %}

templates/layout.html

<!doctype html>
<title>My Application</title>
{% with messages = get_flashed_messages() %}
{% if messages %}
<ul class=flashes>
  {% for message in messages %}
  <li>{{ message }}</li>
  {% endfor %}
</ul>
{% endif %}
{% endwith %}
{% block body %}{% endblock %}

templates/index.html

{% extends "layout.html" %}
{% block body %}
  <h1>Overview</h1>
  <p>Do you want to <a href="{{ url_for('login') }}">log in?</a>
{% endblock %}

We create the flash message with the flash function in app.py .

Then when login is successful, we should the message in the template with the get_flashed_messages function.

Flashing With Categories

We can call flash with a second argument.

The 2nd argument is the category name.

To add the category and render it, we can write:

app.py

from flask import Flask, render_template, redirect, url_for, flash, request
app = Flask(__name__)
app.secret_key = b'secret'

@app.route('/')
def index():
    return render_template('index.html')

@app.route('/login', methods=['GET', 'POST'])
def login():
    error = None
    if request.method == 'POST':
        flash('You were successfully logged in', 'info')
        return redirect(url_for('index'))
    return render_template('login.html', error=error)

templates/login.html

{% block body %}
  <h1>Login</h1>
  {% if error %}
    <p class=error><strong>Error:</strong> {{ error }}
  {% endif %}
  <form method=post>
    <dl>
      <dt>Username:
      <dd><input type=text name=username value="{{
          request.form.username }}">
      <dt>Password:
      <dd><input type=password name=password>
    </dl>
    <p><input type=submit value=Login>
  </form>
{% endblock %}

templates/layout.html

<!doctype html>
<title>My Application</title>
{% with messages = get_flashed_messages(with_categories=true) %}
{% if messages %}
<ul class=flashes>
  {% for category, message in messages %}
  <li class="{{ category }}">{{ message }}</li>
  {% endfor %}
</ul>
{% endif %}
{% endwith %}
{% block body %}{% endblock %}

templates/index.html

{% extends "layout.html" %}
{% block body %}
  <h1>Overview</h1>
  <p>Do you want to <a href="{{ url_for('login') }}">log in?</a>
{% endblock %}

In app.py , we have:

flash('You were successfully logged in', 'info')

to add the 'info' category.

The in layout.html . we call get_flashed_messages with the with_categories parameter set to true to render the category.

Then in the for loop, we get both the category and message and render them.

Filtering Flash Messages

We can also filter flash messages in the template.

For example in templates/layout.html , we write:

<!doctype html>
<title>My Application</title>
{% with messages = get_flashed_messages(category_filter=["info"]) %}
{% if messages %}
<ul class=flashes>
  {% for message in messages %}
  <li>{{ message }}</li>
  {% endfor %}
</ul>
{% endif %}
{% endwith %}
{% block body %}{% endblock %}

to add the category_filter argument to only display flash messages with category 'info' .

Conclusion

We can add flash messages that are displayed in the next request into our templates.