Sometimes, we want to grab visible webpage text with Python BeautifulSoup.
In this article, we’ll look at how to grab visible webpage text with Python BeautifulSoup.
How to grab visible webpage text with Python BeautifulSoup?
To grab visible webpage text with Python BeautifulSoup, we can call findAll
with the text
argument set to True
.
For instance, we write
from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request
def tag_visible(element):
if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
return False
if isinstance(element, Comment):
return False
return True
def text_from_html(body):
soup = BeautifulSoup(body, 'html.parser')
texts = soup.findAll(text=True)
visible_texts = filter(tag_visible, texts)
return u" ".join(t.strip() for t in visible_texts)
html = urllib.request.urlopen('http://www.example.com').read()
print(text_from_html(html))
to call urlopen
to make GET request to a URL.
Then we call text_from_html
to parse the html
returned.
In text_from_html
, we create a BeautifulSoup
object.
And then we call findAll
on the BeautifulSoup
object with text
set to True
to get the visible text.
Next, we call filter
with tag_visible
to return the items that has the tags for visible elements.
And then we call join
on the returned iterator to return the visible text as a string.
Conclusion
To grab visible webpage text with Python BeautifulSoup, we can call findAll
with the text
argument set to True
.