We can extract text from an HTML file using Python by using various libraries such as BeautifulSoup or lxml.
Here’s how you can do it using BeautifulSoup, one of the most popular HTML parsing libraries:
First, make sure you have BeautifulSoup installed. We can install it via pip:
pip install beautifulsoup4
Then, you can use BeautifulSoup to extract text from an HTML file:
from bs4 import BeautifulSoup
# Read the HTML file
with open("example.html", "r") as file:
html_content = file.read()
# Create a BeautifulSoup object
soup = BeautifulSoup(html_content, "html.parser")
# Extract text from the HTML
text = soup.get_text()
# Print the extracted text
print(text)
In this code, we open the HTML file (“example.html” in this case) and read its content.
We create a BeautifulSoup object soup
using the HTML content and the HTML parser.
We use the get_text()
method of the BeautifulSoup object to extract text from the HTML, stripping out any HTML tags.
Finally, we print the extracted text.
We can then manipulate, process, or save this extracted text as needed.