Friday, November 16, 2007

Presenting Apache Tika

Yesterday, during the Fast Feather Track at the ApacheCon US, I presented the incubating Apache Tika project. See below for the slides:

[slideshare id=168085&doc=apache-tika-1195158817320413-4&w=425]


I was positively surprised about the level of attendance and also the interest in Tika during the Search Roundtable BOF later in the evening. Even though the project is still just starting, it's already generating lots of interest and I really look forward to getting the first releases out.

Digital Media at Apache

There's an emerging cluster of digital media projects at the Apache Software Foundation. The Tika and Sanselan projects are currently incubating, and PDFBox is likely to soon follow the example. Together with existing projects like Batik, FOP, POI and the many HTML generators at Apache they form a nice set of tools for consuming, manipulating, and producing various types of digital media.

asf-media.png


It would be nice to see also some audio or video processing software entering the ASF...

The Apache Cloud

The heavy concentration of Apache developers here at the ApacheCon US in Atlanta got me thinking about how the various Apache projects are related and whether we could come up with some ways to visualize the existing (and emerging) community patterns.

As a quick first step I just took the committer lists of all Apache projects (excluding meta-projects like Jakarta or Incubator) and ran it through an ad-hoc Perl script that identified any pairs of projects that have five or more committers in common. Running those relationships through Graphviz produced the following diagram:


Interesting stuff... Of course the committer lists are not a very accurate source of information as many committers are no longer active in the projects they once contributed to, so I perhaps should be looking at svn commit logs instead, but as a first approximation the above diagram is already quite nice.