Sometimes, we want to strip HTML from strings in Python.
In this article, we’ll look at how to strip HTML from strings in Python.
How to strip HTML from strings in Python?
To strip HTML from strings in Python, we can use the StringIO
and HTMLParser
modules.
For instance, we write:
from io import StringIO
from html.parser import HTMLParser
class MLStripper(HTMLParser):
def __init__(self):
super().__init__()
self.reset()
self.strict = False
self.convert_charrefs = True
self.text = StringIO()
def handle_data(self, d):
self.text.write(d)
def get_data(self):
return self.text.getvalue()
def strip_tags(html):
s = MLStripper()
s.feed(html)
return s.get_data()
print(strip_tags('<p>hello world</p>'))
We create the MLStripper
class with the constructor setting the options for parsing HTML.
convert_charrefs
converts all characters to Unicode characters.
text
sets the source of the text.
In the handle_data
method we write the converted text with text_write
.
And we return the result in get_data
.
Next, we create the strip_tags
function that creates a new MLStripper
instance.
Then we call s.feed
with html
to strip the tags off the html
string.
And then we return the stripped string that we retrieved from get_data
.
Therefore, the print
function should print ‘hello world’.
Conclusion
To strip HTML from strings in Python, we can use the StringIO
and HTMLParser
modules.