How to grab visible webpage text with Python BeautifulSoup?

Spread the love

Sometimes, we want to grab visible webpage text with Python BeautifulSoup.

In this article, we’ll look at how to grab visible webpage text with Python BeautifulSoup.

How to grab visible webpage text with Python BeautifulSoup?

To grab visible webpage text with Python BeautifulSoup, we can call findAll with the text argument set to True.

For instance, we write

from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request


def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True


def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    return u" ".join(t.strip() for t in visible_texts)

html = urllib.request.urlopen('http://www.example.com').read()
print(text_from_html(html))

to call urlopen to make GET request to a URL.

Then we call text_from_html to parse the html returned.

In text_from_html, we create a BeautifulSoup object.

And then we call findAll on the BeautifulSoup object with text set to True to get the visible text.

Next, we call filter with tag_visible to return the items that has the tags for visible elements.

And then we call join on the returned iterator to return the visible text as a string.

Conclusion

To grab visible webpage text with Python BeautifulSoup, we can call findAll with the text argument set to True.

How to grab visible webpage text with Python BeautifulSoup?

Conclusion

Related Posts

By John Au-Yeung

Leave a Reply Cancel reply