Sometimes, we want to strip HTML from strings in Python.
In this article, we’ll look at how to strip HTML from strings in Python.
How to strip HTML from strings in Python?
To strip HTML from strings in Python, we can create a subclass of the HTMLParser
class.
For instance, we write
from io import StringIO
from html.parser import HTMLParser
class MLStripper(HTMLParser):
def __init__(self):
super().__init__()
self.reset()
self.strict = False
self.convert_charrefs= True
self.text = StringIO()
def handle_data(self, d):
self.text.write(d)
def get_data(self):
return self.text.getvalue()
def strip_tags(html):
s = MLStripper()
s.feed(html)
return s.get_data()
to create the MLStripper
class.
In it, we call self.text.getvalue
in the get_data
method to return the HTML content without the tags.
And then in the strip_tags
function, we create an MLStripper
object.
We put the html
in the object with feed
to parse it.
And then we call s.get_data
to return the parsed HTML without the tags.
Conclusion
To strip HTML from strings in Python, we can create a subclass of the HTMLParser
class.