Jukka Zitting

Wednesday, November 24, 2010

The case for the digital Babel fish

"Just like Arthur Dent, who after inserting a Babel fish in his ear could understand Vogon poetry, a computer program that uses Tika can understand Microsoft Word documents." This is how Tika in Action, our book on Apache Tika, introduces it's subject. Download the freely available first chapter to read the the full introduction.

Chris Mattmann and I started writing the Tika in Actionbook for Manning at the beginning of this year, and we're now well past the half-way post. If we keep up this pace, the book should be out in print by next Summer! And thanks to the Manning Early Access Program (MEAP), you can already pre-order and access an early access edition of the book at the Tika in Action MEAP page.

If you're interested, use the "tika50" code to get a 50% early access discount when purchasing the MEAP book. You'll still receive updates on all new chapters and of course the full book when it's finished. Note that this discount code is valid only until December 17th, 2010.

We're also very interested in all comments and other feedback you may have about the book. Use the online forum or contact us directly, and we'll do our best to make the book more useful to you!

Update: The book is out in print now! Use the "tika37com" code for a 37% discount on the final book.

Thursday, November 11, 2010

Open sourcing made easy

Open sourcing a closed codebase can be difficult. The typical approach is to decide that you'll go open source, make big news about it and then try to figure out how to proceed. It's no wonder many open source transitions end up being more painful than expected and fail to generate as much community interest and involvement as hoped. How can you do better?

0. Start small

Even though your marketing people will be eager to use a good story, you should to avoid the temptation to make a big deal about your shiny new open source project. Instead, start with small, reversible steps that allow you to get comfortable with the new way of developing software before making public commitments. In other words, learn to walk before you try to run. The next sections outline how to do this.

1. Clean up the codebase

Do you really know what's inside your existing codebase? Do you have rights to use and redistribute all the included intellectual property? Are there trade secrets or other bits in the codebase that you'd rather not show everyone? Do you wish to keep parts of the codebase closed so you can keep selling them as an add-on components on top of the open source offering?

Answering these questions should be your first task. You'll need to spend some time auditing and possibly refactoring your code to prepare it for the public eye. Depending on the codebase this could be anything from a trivial exercise to a significant project. The nice thing is that the increased understanding and potential modularity you gain from this work will be quite valuable even if you never take the next step.

2. Open up your tools

Now that your codebase is clean and ready for the public view, you can (and should!) start using public tools to develop the code. You can either make your existing version control, issue tracking and other tools public, or migrate to a new set of public tools. There are plenty of excellent free hosting services for open source projects, so you have a good opportunity to both lower your maintenance costs and improve your productivity through better tooling!

There's no need yet to worry about external users or contributors. In fact the fewer people you attract at this stage, the better! The main purpose of this step is to make your developers comfortable with the idea that anyone could come and see all their code and all the mistakes they are making. This is a big cultural change for many developers, and you'll want to start small to give them time to adapt in peace.

3. Engage the community

If you've followed the steps so far, you've actually already open sourced your codebase. Are you and your developers comfortable with the situation? It's still possible to switch back to closed source with minimal disruption and no lost reputation if you're having second thoughts. But if you are willing to move forward, now is the time to start enjoying the benefits of open development!

Call in your marketing people to do their magic. Tell the world about the code you're sharing, and invite everyone to participate! If you're product is in any way useful to someone, you'll start seeing people come in, ask questions, submit bug reports and perhaps even contribute fixes. At this point it is useful to have a few people ready to help such new users and contributors, but it's surprising how quickly the community can become self-sufficient. More on that in a later post...

Sunday, November 7, 2010

Models of corporate open source

There are many different ways and reasons for companies to develop their software as open source. Here's some brief commentary on the main approaches you'll encounter in practice.

0. Closed source

Well, closed source is obviously not open, but I should mention it as not all software can or should be open. The main benefit of closed source software is that you can sell it. If you are working for profit, then you should only consider open sourcing your software if the benefits of doing so outweigh the lost license revenue.

1. Open releases

Also known as code drops. You develop the software internally, but you make your release available as open source to everyone who's interested. Allows you to play the "open source" card in marketing, and makes for a great loss leader for a "pro" or "enterprise" version with a higher price tag. And no changes are needed from more traditional closed source development processes. Unfortunately your users don't have much of an incentive to get involved in the development unless they decide to fork your codebase, which usually isn't what you'd want.

2. Open development

Making it easy for your users to get truly involved in your project requires changes in the way you approach development. You'll need to open up your source repositories, issue trackers and other tools, and make it easy for people to interact directly with your developers instead of going through levels support personnel. Do that, and you'll start receiving all sorts of contributions like bug reports, patches, new ideas, documentation, support, advocacy and sales leads for free. You can even allow trusted contributors to commit their changes directly to your codebase without losing control of the project.

3. Open community

Control, or the illusion of it, is a double-edged sword. If you're the "owner" the project, why should others invest heavily in developing or supporting "your" code? To avoid this inherent limitation and to unlock the full potential of the open source community, you'll need to let go of the idea of the project being yours. Instead you're just as much a user and a contributor to the project as everyone else, with no special privileges. The more you contribute, the more you get to influence the direction of the project. This is the secret sauce of most truly successful and sustainable open source projects, and it's also a key ingredient of the Apache Way.

So what's the right way?

There's no single best way to do open (or closed) source, and the right model for your project depends on many factors like your business strategy and environment. The right model can even vary between different codebases within the same company. For example in the "open core" model you increase the level of innovation in and adoption of your core technologies by open sourcing them (or using existing open source components), but you make money and maintain your competitive edge through closed source add-ons or full layers on top of the open core. This is the model we've been using quite successfully at Day (now a part of Adobe).

If you've decided to go open source and you don't have a strong need to maintain absolute control over your codebase (like I suppose Oracle now has over the OpenJDK!), I would recommend going all the way to the open community model. It can be a tough cultural change and often requires changes in your existing development processes and practices, but the payback can be huge. In military terms the community can act as a force multiplier not just for your developers, but also for the QA and support personnel and often even your sales and marketing teams!

If you're interested in pursuing the open community model as described above, the Apache Incubator is a great place to start!

Monday, November 1, 2010

Chongqing on the rise

"The largest city you've never heard about." That's how the Foreign Policy magazine labeled Chongqing in a recent story about the city. Today the Finnish television showed an interesting documentary that centered on the same city, and I recall seeing it mentioned also in the Economist recently. A sign of things to come?

I find it interesting that many of the above stories give the impression of Chongqing as a megacity of 30+ million people, when in fact (or at least according to Wikipedia) the urban population is "just" 5+ million people and a majority of the rest are farmers living in the surrounding areas that are administratively part of the city.

Thursday, August 26, 2010

Age discrimination with Clojure

Michael Dürig, a colleague of mine and big fan of Scala, wrote a nice post about the relative complexity of Scala and Java.

Such comparisons are of course highly debatable, as seen in the comments that Michi's post sparked, but for the fun of it I wanted to see what the equivalent code would look like in Clojure, my favourite post-Java language.

[sourcecode language="clojure"]
(use '[clojure.contrib.seq :only (separate)])

(defstruct person :name :age)

(def persons
[(struct person "Boris" 40)
(struct person "Betty" 32)
(struct person "Bambi" 17)])

(let [[minors majors] (separate #(<= (% :age) 18) persons)]
(println minors)
(println majors))
[/sourcecode]

The output is:

[sourcecode language="clojure"]
({:name Bambi, :age 17})
({:name Boris, :age 40} {:name Betty, :age 32})
[/sourcecode]

I guess the consensus among post-Java languages is that features like JavaBean-style structures and functional collection algorithms should either be a built-in part of the language or at least trivially implementable in supporting libraries.

Wednesday, July 28, 2010

Open Source at Adobe?

The news is just in about Adobe being set to acquire Day Software (see also the FAQ). Assuming the deal goes through, it looks like I'll be working for Adobe by the end of this year. I'm an open source developer, so I'm looking forward to finding out how committed Adobe is in supporting the open development model we're using for many parts of Day products.

The first comments from Erik Larson, a senior director of product management and strategy at Adobe, seem promising and he also asked what the deal should mean for open source. This is my response from the perspective of the open source projects I'm involved in.

First and foremost I'm looking forward to continuing the open and standards-based development of our key technologies like Apache Jackrabbit and Apache Sling. There's no way we'd be able to maintain the current level of innovation and productivity in these key parts of our product infrastructure without our symbiotic relationship with the open source community.

Second, I'm hoping that our experience and involvement with open source projects will help Adobe better interact with the various open source efforts that leverage Adobe standards and technologies like XMP, PDF and Flash. The Apache Software Foundation is a home to a growing collection of digital media projects like PDFBox, FOP, Tika, Batik and Sanselan, all of which are in one way or another related to Adobe's business. For example as a committer and release manager of the Apache PDFBox project I'd much appreciate better access to Adobe's deep technical PDF know-how. Similarly, in Apache Tika we're considering using XMP as our metadata standard, and better access to and co-operation with the people behind Adobe's XMP toolkit SDK (see more below) would be highly valuable.

It would be great to see Adobe becoming more proactive in reaching out and supporting such grass-roots efforts that leverage their technologies. I've dealt with Adobe lawyers on such cases before with good results but it did take some time before I found the correct people to contact. Another area of improvement would be to make freely redistributable Adobe IP more easily accessible for external developers by pushing them out to central repositories like Maven Central, RubyGems or CPAN, for example like I did when making PDF core font information available on Maven Central.

Finally, it would be great to see Adobe going further in embracing an open development model for some of their codebases like the XMP toolkit SDK that they already release under open source licenses. I'd love to champion or mentor the effort, should Adobe be willing to bring the XMP toolkit to the Apache Incubator!

Thursday, May 27, 2010

Forking a JVM

The thread model of Java is pretty good and works well for many use cases, but every now and then you need a separate process for better isolation of certain computations. For example in Apache Tika we're looking for a way to avoid OutOfMemoryErrors or JVM crashes caused by faulty libraries or troublesome input data.

In C and many other programming languages the straightforward way to achieve this is to fork separate processes for such tasks. Unfortunately Java doesn't support the concept of a fork (i.e. creating a copy of a running process). Instead, all you can do is to start up a completely new process. To create a mirror copy of your current process you'd need to start a new JVM instance with a recreated classpath and make sure that the new process reaches a state where you can get useful results from it. This is quite complicated and typically depends on predefined knowledge of what your classpath looks like. Certainly not something for a simple library to do when deployed somewhere inside a complex application server.

But there's another way! The latest Tika trunk now contains an early version of a fork feature that allows you to start a new JVM for running computations with the classes and data that you have in your current JVM instance. This is achieved by copying a few supporting class files to a temporary directory and starting the "child JVM" with only those classes. Once started, the supporting code in the child JVM establishes a simple communication protocol with the parent JVM using the standard input and output streams. You can then send serialized data and processing agents to the child JVM, where they will be deserialized using a special class loader that uses the communication link to access classes and other resources from the parent JVM.

My code is still far from production-ready, but I believe I've already solved all the tricky parts and everything seems to work as expected. Perhaps this code should go into an Apache Commons component, since it seems like it would be useful also to other projects beyond Tika. Initial searching didn't bring up other implementations of the same idea, but I wouldn't be surprised if there are some out there. Pointers welcome.