How to strip HTML from strings in Python?

Spread the love

Sometimes, we want to strip HTML from strings in Python.

In this article, we’ll look at how to strip HTML from strings in Python.

How to strip HTML from strings in Python?

To strip HTML from strings in Python, we can use the StringIO and HTMLParser modules.

For instance, we write:

from io import StringIO
from html.parser import HTMLParser


class MLStripper(HTMLParser):
    def __init__(self):
        super().__init__()
        self.reset()
        self.strict = False
        self.convert_charrefs = True
        self.text = StringIO()

    def handle_data(self, d):
        self.text.write(d)

    def get_data(self):
        return self.text.getvalue()


def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()


print(strip_tags('<p>hello world</p>'))

We create the MLStripper class with the constructor setting the options for parsing HTML.

convert_charrefs converts all characters to Unicode characters.

text sets the source of the text.

In the handle_data method we write the converted text with text_write.

And we return the result in get_data.

Next, we create the strip_tags function that creates a new MLStripper instance.

Then we call s.feed with html to strip the tags off the html string.

And then we return the stripped string that we retrieved from get_data.

Therefore, the print function should print ‘hello world’.

Conclusion

To strip HTML from strings in Python, we can use the StringIO and HTMLParser modules.

How to strip HTML from strings in Python?

Conclusion

Related Posts

By John Au-Yeung

Leave a Reply Cancel reply