How to grab visible webpage text with Python BeautifulSoup?

Sometimes, we want to grab visible webpage text with Python BeautifulSoup.

In this article, we’ll look at how to grab visible webpage text with Python BeautifulSoup.

To grab visible webpage text with Python BeautifulSoup, we can call findAll with the text argument set to True.

For instance, we write

from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request

def tag_visible(element):
    if in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True

def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    return u" ".join(t.strip() for t in visible_texts)

html = urllib.request.urlopen('').read()

to call urlopen to make GET request to a URL.

Then we call text_from_html to parse the html returned.

In text_from_html, we create a BeautifulSoup object.

And then we call findAll on the BeautifulSoup object with text set to True to get the visible text.

Next, we call filter with tag_visible to return the items that has the tags for visible elements.

And then we call join on the returned iterator to return the visible text as a string.


