Friday, November 16, 2007

Presenting Apache Tika

Yesterday, during the Fast Feather Track at the ApacheCon US, I presented the incubating Apache Tika project. See below for the slides:

[slideshare id=168085&doc=apache-tika-1195158817320413-4&w=425]


I was positively surprised about the level of attendance and also the interest in Tika during the Search Roundtable BOF later in the evening. Even though the project is still just starting, it's already generating lots of interest and I really look forward to getting the first releases out.

Digital Media at Apache

There's an emerging cluster of digital media projects at the Apache Software Foundation. The Tika and Sanselan projects are currently incubating, and PDFBox is likely to soon follow the example. Together with existing projects like Batik, FOP, POI and the many HTML generators at Apache they form a nice set of tools for consuming, manipulating, and producing various types of digital media.

asf-media.png


It would be nice to see also some audio or video processing software entering the ASF...

The Apache Cloud

The heavy concentration of Apache developers here at the ApacheCon US in Atlanta got me thinking about how the various Apache projects are related and whether we could come up with some ways to visualize the existing (and emerging) community patterns.

As a quick first step I just took the committer lists of all Apache projects (excluding meta-projects like Jakarta or Incubator) and ran it through an ad-hoc Perl script that identified any pairs of projects that have five or more committers in common. Running those relationships through Graphviz produced the following diagram:


Interesting stuff... Of course the committer lists are not a very accurate source of information as many committers are no longer active in the projects they once contributed to, so I perhaps should be looking at svn commit logs instead, but as a first approximation the above diagram is already quite nice.

Thursday, October 4, 2007

Concurrency and River

If you're interested in concurrency, distributed systems, and ways to best use the manycore processors we're being promised, then check out the Concurrency and River thread on the development mailing list of the incubating Apache River project. The thread is about concurrency and ways the River project (a continuation of Jini from Sun) and related technologies like JavaSpaces could be used to parallellize many computing tasks. There are also some nice comparisons to Erlang and Scala, and how the actor model used by them is related to the Jini network model.

Friday, September 7, 2007

In Hong Kong

I arrived in Hong Kong for the first time in my life two days ago. So far I've only seen the airport, the hotel, and the customer office (yeah, it's been a busy two days) but the weekend is just starting and I'm planning to do some sightseeing in and around the city.

Today I'll do some hiking on the Lantau Trail and try to get a glimpse of Hong Kong Central in the afternoon/evening. Pictures to come...

Tuesday, July 3, 2007

Content Management with Apache Jackrabbit

Last week I was at the Jazoon conference and gave a presentation on Content Management with Apache Jackrabbit. It was a nice conference and I think my message was well received.

[slideshare id=73263&doc=content-management-with-apache-jackrabbit2120&w=425]

Lately I've been focusing more and more on JCR and content repositories as an "extended file system" instead of comparisons with relational databases or similar technologies. I find that if you see your content repository as a transactional, searchable, and observable file system with cool stuff like fine-grained content modeling capabilities, you end up with much better content models than if you think of the repository as a database with hierarchy as an added feature.

Friday, June 22, 2007

False positives

GMailGMail has typically been good at keeping spam out of my inbox, and even the false positive rate has been amazingly good. So good, in fact, that I've mostly stopped scanning my spam folder for false positives.

Recently though I started wondering why I would see responses to messages I've never seen and if there's something wrong with commit and other notifications occasionally getting lost. Digging deeper I realized that GMail has started to flag a considerable portion of my legitimate email as spam. The false positive rate seems to be something like 8%. And it's not just random commit notifications that get lost, for example these messages got marked as spam!

I've now explicitly marked all false positives as "Not spam", and I really hope that the GMail spam algorithm learns something from this.

Friday, May 25, 2007

Visiting New York

I spent the last few days in New Jersey preparing a pretty cool presales demo. The on-site schedule ended yesterday, and I decided to take today off to spend some quality time in New York. It's my fourth time in the city, but I still feel amazed by the overwhelming sense of life around here.

It seems I couldn't have picked a better day to stay here, the weather is just perfect and people are getting ready for the long Memorial Day weekend. Even the navy is visiting the city for the Fleet Week.

I left the hotel just south of the Central Park this morning for a day of sightseeing. I walked down the East River and stopped for a while to admire the old ships at the South Street Seaport. Then I continued to Battery Park at the south end of Manhattan, from where I took a boat trip to see the Ellis Island and the Statue of Liberty. After the trip I had a light lunch before heading up by the Hudson River all the way to Pier 90, where the warship USS WASP was open for visitors. Then an early dinner at a nice Italian place and a return to the hotel.

My feet are killing me, but the tour was definitely worth it. (I have over 300 pictures and about 20 minutes of video to process...) Too bad I'll be leaving the city already tomorrow. Next time I'll bring my girlfriend along and we'll spend a whole week here.

I guess I won't go clubhopping tonight with these feet...

Thursday, May 17, 2007

The cause of an IOException

I just had to follow a stack trace through a complex codebase with multiple layers. The exception chaining mechanism introduced in Java 1.4 made the task easy up to the point where the last exception in the chain was an IOException thrown by code like this:

try {
....
} catch (Exception e) {
throw new IOException("...");
}


What a dead end! The problem is that the IOException constructors that allow exception chaining were only added in Java 6. Here's a workaround that would have saved me a lot of extra effort:

try {
....
} catch (Exception e) {
IOException ioe = new IOException("...");
ioe.initCause(e);
throw ioe;
}

Monday, April 30, 2007

Applied number theory

Mark Dominus writes great geek stuff in The Universe of Discourse.  Yesterday, based on feedback by Mauro Persano, he wrote about a brilliant solution to finding simple fractions, a topic that he started already earlier.  Clever, simple, and fast!

Stern-Brescot tree of fractions

I wonder if I could apply the same idea in some of the hierarchical storage models I've been thinking about lately...

Saturday, January 13, 2007

ApacheCon proposals

ApacheCon US 2006 was the first ApacheCon I attended, and I went there mostly to look around and get a feeling of the event. Encouraged by the good reception of my ad-hoc presentations there, I wanted to step up and propose some real sessions for the next ApacheCon. Thus, my proposals for ApacheCon Europe 2007 are:

  • Up to Speed with Java Content Repository API and Jackrabbit
    Joint session with Alexandru Popescu. Targeted for people interested in content management with JCR and Jackrabbit.

  • Structure and Implementation of Apache Jackrabbit
    A walktrough of the Jackrabbit internals. Not just for Jackrabbit developers but for anyone who is interested in seeing a reasonably complex codebase explained using various analysis and diagramming methods (like DSM).


I also proposed a half-day tutorial on JCR content management, and we'll probably arrange an informal Jackrabbit BOF during the event.