Friday, November 16, 2007

Presenting Apache Tika

Yesterday, during the Fast Feather Track at the ApacheCon US, I presented the incubating Apache Tika project. See below for the slides:

[slideshare id=168085&doc=apache-tika-1195158817320413-4&w=425]


I was positively surprised about the level of attendance and also the interest in Tika during the Search Roundtable BOF later in the evening. Even though the project is still just starting, it's already generating lots of interest and I really look forward to getting the first releases out.

Digital Media at Apache

There's an emerging cluster of digital media projects at the Apache Software Foundation. The Tika and Sanselan projects are currently incubating, and PDFBox is likely to soon follow the example. Together with existing projects like Batik, FOP, POI and the many HTML generators at Apache they form a nice set of tools for consuming, manipulating, and producing various types of digital media.

asf-media.png


It would be nice to see also some audio or video processing software entering the ASF...

The Apache Cloud

The heavy concentration of Apache developers here at the ApacheCon US in Atlanta got me thinking about how the various Apache projects are related and whether we could come up with some ways to visualize the existing (and emerging) community patterns.

As a quick first step I just took the committer lists of all Apache projects (excluding meta-projects like Jakarta or Incubator) and ran it through an ad-hoc Perl script that identified any pairs of projects that have five or more committers in common. Running those relationships through Graphviz produced the following diagram:


Interesting stuff... Of course the committer lists are not a very accurate source of information as many committers are no longer active in the projects they once contributed to, so I perhaps should be looking at svn commit logs instead, but as a first approximation the above diagram is already quite nice.

Thursday, October 4, 2007

Concurrency and River

If you're interested in concurrency, distributed systems, and ways to best use the manycore processors we're being promised, then check out the Concurrency and River thread on the development mailing list of the incubating Apache River project. The thread is about concurrency and ways the River project (a continuation of Jini from Sun) and related technologies like JavaSpaces could be used to parallellize many computing tasks. There are also some nice comparisons to Erlang and Scala, and how the actor model used by them is related to the Jini network model.

Friday, September 7, 2007

In Hong Kong

I arrived in Hong Kong for the first time in my life two days ago. So far I've only seen the airport, the hotel, and the customer office (yeah, it's been a busy two days) but the weekend is just starting and I'm planning to do some sightseeing in and around the city.

Today I'll do some hiking on the Lantau Trail and try to get a glimpse of Hong Kong Central in the afternoon/evening. Pictures to come...

Tuesday, July 3, 2007

Content Management with Apache Jackrabbit

Last week I was at the Jazoon conference and gave a presentation on Content Management with Apache Jackrabbit. It was a nice conference and I think my message was well received.

[slideshare id=73263&doc=content-management-with-apache-jackrabbit2120&w=425]

Lately I've been focusing more and more on JCR and content repositories as an "extended file system" instead of comparisons with relational databases or similar technologies. I find that if you see your content repository as a transactional, searchable, and observable file system with cool stuff like fine-grained content modeling capabilities, you end up with much better content models than if you think of the repository as a database with hierarchy as an added feature.

Friday, June 22, 2007

False positives

GMailGMail has typically been good at keeping spam out of my inbox, and even the false positive rate has been amazingly good. So good, in fact, that I've mostly stopped scanning my spam folder for false positives.

Recently though I started wondering why I would see responses to messages I've never seen and if there's something wrong with commit and other notifications occasionally getting lost. Digging deeper I realized that GMail has started to flag a considerable portion of my legitimate email as spam. The false positive rate seems to be something like 8%. And it's not just random commit notifications that get lost, for example these messages got marked as spam!

I've now explicitly marked all false positives as "Not spam", and I really hope that the GMail spam algorithm learns something from this.