Sometimes, we want to extract text from MS word files in Python.
In this article, we’ll look at how to extract text from MS word files in Python.
How to extract text from MS word files in Python?
To extract text from MS word files in Python, we can use the zipfile
library.
For instance, we write
import zipfile, re
docx = zipfile.ZipFile('/path/to/file/mydocument.docx')
content = docx.read('word/document.xml').decode('utf-8')
cleaned = re.sub('<(.|\n)*?>','',content)
print(cleaned)
to create ZipFile
object with the path string to the Word file.
Then we call read
with 'word/document.xml'
to read the Word file.
And we call decode
to decode the text as Unicode.
Next, we call re.sub
to replace the tags with empty strings.
Conclusion
To extract text from MS word files in Python, we can use the zipfile
library.