Jukka Zitting: October 2009

Tuesday, October 27, 2009

NoSQL interests

We're organizing a NoSQL meetup in Oakland on Monday next week. In addition to helping set the meetup agenda, the "Topics you are interested in" question in the sign up form provides some interesting insight on the current interests of the NoSQL community. Here's a quick breakdown of the key terms distilled from the 88 signups we've received so far.

Note that the data is biased towards Apache projects due to the meetup being organized at ApacheCon US 2009.

Projects

The following open source projects were mentioned. The list is in alphabetical order, as the data set is too small to make any reasonable ordering by popularity.

Cassandra

CouchDB

Hadoop

HBase

HDFS

Jackrabbit

Lucene

Mahout

memcached

MongoDB

Redis

Riak

Scalaris

Sling

Tokyo Cabinet

Voldemort

Topics

Many responses were about the "big data" aspect of the NoSQL movement. Some frequent keywords: distributed storage, large transactional data, consistency, failover, availability, reliability, stability, failure detection, failed node replacement, (petabyte) scalability, consistency levels, storage technology, performance, benchmarks, optimization, backup and recovery, map/reduce

Another common theme were the various database types and the NoSQL "development model". Keywods: document stores, key/value stores, consistent hashing, graph databases, object databases, persistent queues, content modeling, migration from the relational model, social graphs, streaming, software as a service, offline applications, full text search, natural language processing

Beyond the above big themes, I found it interesting that the following technologies were specifically named: Erlang, Java, WebSimpleDB, WebDAV

In addition to specific topics, many people were asking for case studies or "lessons learned" -type presentations.

Friday, October 16, 2009

Putting POI on a diet

The Apache POI team is doing an amazing job at making Microsoft Office file formats more accessible to the open source Java world. One of the projects that benefits from their work is Apache Tika that uses POI to extract text content and metadata from all sorts of Office documents.

However, there's one problem with POI that I'd like to see fixed: It's too big.

More specifically, the ooxml-schemas jar used by POI for the pre-generated XMLBeans bindings for the Office Open XML schemas is taking up over 50% of the 25MB size of the current Tika application. The pie chart below illustrates the relative sizes of the different parser library dependencies of Tika:

Relative sizes of Tika parser dependencies

Relative sizes of Tika parser dependencies

Both PDF and the Microsoft Office formats are pretty big and complex, so one can expect the relevant parser libraries to be large. But the 14MB size of the ooxml-schemas jar seems excessive, especially since the standard OOXML schema package from which the ooxml-schemas jar is built is only 220KB in size.

Does anyone have good ideas on how to best trim down this OOXML dependency?