Friday, October 16, 2009

Putting POI on a diet

The Apache POI team is doing an amazing job at making Microsoft Office file formats more accessible to the open source Java world. One of the projects that benefits from their work is Apache Tika that uses POI to extract text content and metadata from all sorts of Office documents.

Apache POI

However, there's one problem with POI that I'd like to see fixed: It's too big.

More specifically, the ooxml-schemas jar used by POI for the pre-generated XMLBeans bindings for the Office Open XML schemas is taking up over 50% of the 25MB size of the current Tika application. The pie chart below illustrates the relative sizes of the different parser library dependencies of Tika:

Relative sizes of Tika parser dependencies

Both PDF and the Microsoft Office formats are pretty big and complex, so one can expect the relevant parser libraries to be large. But the 14MB size of the ooxml-schemas jar seems excessive, especially since the standard OOXML schema package from which the ooxml-schemas jar is built is only 220KB in size.

Does anyone have good ideas on how to best trim down this OOXML dependency?

3 comments:

  1. Why do you even need the schemas at runtime? Couldn't you autogen the parser from the schema?

    Also does poi have multiple xmlparsers in use? Why not cut down to a single one.

    ReplyDelete
  2. Good questions. I've yet to look deeper under the hood, but I'm sure something like what you propose could be done to streamline things.

    I'm following up on the dev@poi mailing list, see http://markmail.org/message/u5csdmq4t2wvjgtd for more details.

    ReplyDelete
  3. guillaume cottenceauOctober 28, 2009 at 2:27 AM

    Hi,

    Do you mind posting here any concrete developments if you hear of any? I've seen that there's not much on the ML for the time being. This 1.4MB => 15MB change when upgrading from POI 3.2 to 3.5 is a problem for us too :/

    Thanks!

    ReplyDelete