Thursday, May 27, 2010

Forking a JVM

The thread model of Java is pretty good and works well for many use cases, but every now and then you need a separate process for better isolation of certain computations. For example in Apache Tika we're looking for a way to avoid OutOfMemoryErrors or JVM crashes caused by faulty libraries or troublesome input data.

In C and many other programming languages the straightforward way to achieve this is to fork separate processes for such tasks. Unfortunately Java doesn't support the concept of a fork (i.e. creating a copy of a running process). Instead, all you can do is to start up a completely new process. To create a mirror copy of your current process you'd need to start a new JVM instance with a recreated classpath and make sure that the new process reaches a state where you can get useful results from it. This is quite complicated and typically depends on predefined knowledge of what your classpath looks like. Certainly not something for a simple library to do when deployed somewhere inside a complex application server.

But there's another way! The latest Tika trunk now contains an early version of a fork feature that allows you to start a new JVM for running computations with the classes and data that you have in your current JVM instance. This is achieved by copying a few supporting class files to a temporary directory and starting the "child JVM" with only those classes. Once started, the supporting code in the child JVM establishes a simple communication protocol with the parent JVM using the standard input and output streams. You can then send serialized data and processing agents to the child JVM, where they will be deserialized using a special class loader that uses the communication link to access classes and other resources from the parent JVM.

My code is still far from production-ready, but I believe I've already solved all the tricky parts and everything seems to work as expected. Perhaps this code should go into an Apache Commons component, since it seems like it would be useful also to other projects beyond Tika. Initial searching didn't bring up other implementations of the same idea, but I wouldn't be surprised if there are some out there. Pointers welcome.

Tuesday, May 25, 2010

Apache meritocracy vs. architects

Ceki Gülcü recently wrote an interesting post on the Apache community model and its vulnerability in cases where consensus can not be reached with reasonable effort. Also the discussion in the comments is interesting.

Ceki's done some amazing work especially on Java logging libraries, and his design vision shines through the code he's written. He's clearly at the high edge of the talent curve even among a community of highly qualified open source developers, which is why I'm not surprised that he dislikes the conservative nature of the consensus-based development model used at Apache. And the log4j history certainly is a sorry example of conservative forces more or less killing active development. In hindsight Ceki's decision to start the slf4j and logback projects may have been the best way out of the deadlock.

Software development is a complex task where best results are achieved when a clear sense of architecture and design is combined with hard work and attention to details. A consensus-based development model is great for the latter parts, but can easily suffer from the design-by-committee syndrome when dealing with architectural changes or other design issues. From this perspective it's no surprise that the Apache Software Foundation is considered a great place for maintaining stable projects. Even the Apache Incubator is geared towards established codebases.

Even fairly simple refactorings like the one I'm currently proposing for Apache Jackrabbit can require quite a bit of time-consuming consensus-building, which can easily frustrate people who are proposing such changes. In Jackrabbit I'm surrounded by highly talented people so I treat the consensus-building time as a chance to learn more and to challenge my own assumptions, but I can easily envision cases where this would just seem like extra effort and delay.

More extensive design work is almost always best performed mainly by a single person based on reviews and comments by other community members.  Most successful open and closed source projects can trace their core architectures back to the work of a single person or a small tightly-knit team of like-minded developers. This is why many projects recognize such a "benevolent dictator" as the person with the final word on matters of project architecture.

The Apache practices for resolving vetos and other conflicts work well when dealing with localized changes where it's possible to objectively review two or more competing solutions to a problem, but in my experience they don't scale that well to larger design issues. The best documented practice for such cases that I've seen is the "Rules for revolutionaries" post, but it doesn't cover the case where there are multiple competing visions for the future. Any ideas on how such situations should best be handled in Apache communities?

Friday, May 14, 2010

Buzzword conference in June

Like the Lucene conference I mentioned earlier, Berlin Buzzwords 2010 is a new conference that fills in the space left by the decision not to organize an ApacheCon in Europe this year. Going beyond the Apache scope, Berlin Buzzwords is a conference for all things related to scalability, storage and search. Some of the key projects in this space are Hadoop, CouchDB and Lucene.

I'll be there to make a case for hierarchical databases (including JCR and Jackrabbit) and to present Apache Tika project. The abstracts of my talks are:

The return of the hierarchical model

After its introduction the relational model quickly replaced the network and hierarchical models used by many early databases, but the hierarchical model has lived on in file systems, directory services, XML and many other domains. There are many cases where the features of the hierarchical model fit the needs of modern use cases and distributed deployments better than the relational model, so it's a good time to reconsider the idea of a general-purpose hierarchical database.

The first part of this presentation explores the features that differentiate hierarchical databases from relational databases and NoSQL alternatives like document databases and distributed key-value stores. Existing hierarchical database products like XML databases, LDAP servers and advanced filesystems are reviewed and compared.

The second part of the presentation introduces the Content Repositories for the Java Technology (JCR) standard as a modern take on standardizing generic hierarchical databases. We also look at Apache Jackrabbit, the open source JCR reference implementation, and how it implements the hierarchical model.


Text and metadata extraction with Apache Tika

Apache Tika is a toolkit for extracting text and metadata from digital documents. It's the perfect companion to search engines and any other applications where it's useful to know more than just the name and size of a file. Powered by parser libraries like Apache POI and PDFBox, Tika offers a simple and unified way to access content in dozens of document formats.

This presentation introduces Apache Tika and shows how it's being used in projects like Apache Solr and Apache Jackrabbit. You will learn how to integrate Tika with your application and how to configure and extend Tika to best suit your needs. The presentation also summarizes the key characteristics of the more widely used file formats and metadata standards, and shows how Tika can help deal with that complexity.

I hear there are still some early bird tickets available. See you in Berlin!

Commit early, commit often!

A huge commit was made in a log4j branch yesterday. The followup discussion:


"I haven't had a chance to review the rest of the commit, but it seems like a substantial amount of work that was done in isolation. While things are still fresh, can you walk through the whats in this thing and the decisions that you made."


"I didn't want to commit code until I had the core of something that actually functioned. I struggled for a couple of weeks over how to attack XMLConfiguration. [...] See below for what I came up with."

Followed by ten bullet points about the changes made. Unfortunately the only thing our version control system now knows about these changes is "First version".