How to grab visible webpage text with BeautifulSoup?

Spread the love

Sometimes, we want to grab visible webpage text with BeautifulSoup.

In this article, we’ll look at how to grab visible webpage text with BeautifulSoup.

How to grab visible webpage text with BeautifulSoup?

To grab visible webpage text with BeautifulSoup, we can call filter when we’re grabbing the webpage content.

For instance, we write:

from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request


def tag_visible(element):
    if element.parent.name in [
            'style', 'script', 'head', 'title', 'meta', '[document]'
    ]:
        return False
    if isinstance(element, Comment):
        return False
    return True


def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)
    return u" ".join(t.strip() for t in visible_texts)


html = urllib.request.urlopen('https://yahoo.com').read()
print(text_from_html(html))

We have the tag_visible function that checks for tags for invisible elements by checking the element.parent.name for the tags that aren’t displayed.

We return True for the visible tags and False otherwise.

Then we define the text_from_html function to grab the text.

We use the BeautifulSoup constructor with body to get the content.

Then we call soup.findAll with text set to True to get all the nodes with text content.

And then we call filter with tag_visible and texts to get the visible nodes.

And finally, we call join to join all the results together.

We then get the HTML with urllib.request.urlopen and call text_from_html with the returned HTML.

Conclusion

To grab visible webpage text with BeautifulSoup, we can call filter when we’re grabbing the webpage content.

How to grab visible webpage text with BeautifulSoup?

Conclusion

Related Posts

By John Au-Yeung

Leave a Reply Cancel reply