Categories
Python Answers

How to extract text from HTML file using Python?

Spread the love

We can extract text from an HTML file using Python by using various libraries such as BeautifulSoup or lxml.

Here’s how you can do it using BeautifulSoup, one of the most popular HTML parsing libraries:

First, make sure you have BeautifulSoup installed. We can install it via pip:

pip install beautifulsoup4

Then, you can use BeautifulSoup to extract text from an HTML file:

from bs4 import BeautifulSoup

# Read the HTML file
with open("example.html", "r") as file:
    html_content = file.read()

# Create a BeautifulSoup object
soup = BeautifulSoup(html_content, "html.parser")

# Extract text from the HTML
text = soup.get_text()

# Print the extracted text
print(text)

In this code, we open the HTML file (“example.html” in this case) and read its content.

We create a BeautifulSoup object soup using the HTML content and the HTML parser.

We use the get_text() method of the BeautifulSoup object to extract text from the HTML, stripping out any HTML tags.

Finally, we print the extracted text.

We can then manipulate, process, or save this extracted text as needed.

By John Au-Yeung

Web developer specializing in React, Vue, and front end development.

Leave a Reply

Your email address will not be published. Required fields are marked *