Sometimes, we want to extract text from HTML file using Python.
In this article, we’ll look at how to extract text from HTML file using Python.
How to extract text from HTML file using Python?
To extract text from HTML file using Python, we can use BeautifulSoup.
To install it, we run:
pip install bs4
Then we write:
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = urlopen(url).read()
soup = BeautifulSoup(html, features="html.parser")
for script in soup(["script", "style"]):
script.extract()
text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = '\n'.join(chunk for chunk in chunks if chunk)
print(text)
We call urllib.request.urlopen
with the url
we want to get the HTML text from.
Then we call read
to read the response into a string.
Next, we use the BeautifulSoup
constructor with html
.
Then we loop through the script and style tags in the HTML string and remove them with:
for script in soup(["script", "style"]):
script.extract()
Then we get the text chunks and join them together with:
text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = '\n'.join(chunk for chunk in chunks if chunk)
We call splitlines
to split the text
into lines.
And we call strip
on each line and phrase
to remove any leading and trailing whitespaces.
Finally, we call join
to join the substrings together into one string with newlines in between them.
Conclusion
To extract text from HTML file using Python, we can use BeautifulSoup.