Sometimes, we want to extract text from HTML file using Python.
In this article, we’ll look at how to extract text from HTML file using Python.
How to extract text from HTML file using Python?
To extract text from HTML file using Python, we can use BeautifulSoup.
To install it, we run:
pip install bs4
Then we write:
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = urlopen(url).read()
soup = BeautifulSoup(html, features="html.parser")
for script in soup(["script", "style"]):
script.extract()
text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = '\n'.join(chunk for chunk in chunks if chunk)
print(text)
We call urllib.request.urlopen with the url we want to get the HTML text from.
Then we call read to read the response into a string.
Next, we use the BeautifulSoup constructor with html.
Then we loop through the script and style tags in the HTML string and remove them with:
for script in soup(["script", "style"]):
script.extract()
Then we get the text chunks and join them together with:
text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = '\n'.join(chunk for chunk in chunks if chunk)
We call splitlines to split the text into lines.
And we call strip on each line and phrase to remove any leading and trailing whitespaces.
Finally, we call join to join the substrings together into one string with newlines in between them.
Conclusion
To extract text from HTML file using Python, we can use BeautifulSoup.