Jakarta POI is an Apache Software Foundation project for reading OLE2 Compound Documents from Java - that's Microsoft Office documents to you and me.
Homepage:
http://jakarta.apache.org/poi/ News site:
http://nagoya.apache.org/poi/news/
Terrier uses POI to parse Microsoft Word, Excel and Powerpoint documents. See the classes uk.ac.gla.terrier.indexing.(MSWordDocument,MSExcelDocument,MSPowerpointDocument)