Categories
Python Answers

How to extract text from MS word files in Python?

Spread the love

Sometimes, we want to extract text from MS word files in Python.

In this article, we’ll look at how to extract text from MS word files in Python.

How to extract text from MS word files in Python?

To extract text from MS word files in Python, we can use the zipfile library.

For instance, we write

import zipfile, re

docx = zipfile.ZipFile('/path/to/file/mydocument.docx')
content = docx.read('word/document.xml').decode('utf-8')
cleaned = re.sub('<(.|\n)*?>','',content)
print(cleaned)

to create ZipFile object with the path string to the Word file.

Then we call read with 'word/document.xml' to read the Word file.

And we call decode to decode the text as Unicode.

Next, we call re.sub to replace the tags with empty strings.

Conclusion

To extract text from MS word files in Python, we can use the zipfile library.

By John Au-Yeung

Web developer specializing in React, Vue, and front end development.

Leave a Reply

Your email address will not be published. Required fields are marked *