Wednesday, February 15, 2006

The road to Jackrabbit 0.9

I recently volunteered to work as the release manager for the 0.9 and 1.0 releases of the incubating Apache Jackrabbit project. The 0.9 part of this task was finished yesterday when I officially announced the release of Apache Jackrabbit version 0.9. The process leading up to the release announcement has been an interesting and educating one.

I first got involved with the Jackrabbit project in November 2004 when I announced an initial version of the JCR-RMI layer I had been working on. Based on positive feedback I started getting more and more active on the mailing list and continued working on JCR-RMI. My work was rewarded with committer status almost exactly a year ago in February 2005. Since then I've been working on various parts of Jackrabbit although I have still to touch the deep internals of Jackrabbit core. I got another reward for my work when I was invited to join the expert group of JSR 283: Content Repository for JavaTM Technology API Version 2.0.

Talk of making a Jackrabbit 1.0 release started sometime in summer 2005 after the JSR 170 had stabilized the JCR API. There was however not too much activity towards a release so I decided to start pushing things in late September. This sparked some discussion and concerns about some of the open issues. The Jira roadmap for 1.0 started to grow bigger and bigger as more issues were filed. I finally solved this problem in November by setting up a separate 1.1 goal for the more complex issues.

Despite these efforts and a lot of ongoing development work there still wasn't too much happening toward an initial release. This situation changed in January when Roy suggested that a 1.0 release plan might be a good thing to have. I volunteered to act as the release manager and based on the resulting discussions I published the Jackrabbit 0.9 release plan in late January. The idea (as suggested by Roy) was to first make a 0.9 release to streamline the release process for a 1.0 release soon after 0.9.

After working a while with making the build environment better fit the release needs I announced a 0.9 release candidate. Making the release candidate had me reading through a number of release guidelines on the various Apache websites and refreshing my GnuPG skills. All Apache releases are digitally signed so I needed to set up my own key. I also got my key signed by Roy, so I expect my key to show up in the Apache web of trust in a few days.

There was only one issue that got fixed between the first release candidate and the version I posted for the official release vote. The release vote passed with no objections, but then Roy suggested a few packaging changes before endorsing the release as required by the Incubator policy. I made the changes, and since no source files were modified there was no need to restart the release vote. Roy then proceeded to request and receive approval for the release from the Incubator PMC.

After the approval we placed the release under cvs.apache.org where incubating projects are allowed to publish their releases. I then updated the Jackrabbit web site and sent the release announcement. The next step the more challenging 1.0 release on which I'll start focusing tomorrow.

Sunday, February 12, 2006

The joy of troubleshooting

How to troubleshoot a problem in a computer system? There are a number of standard troubleshooting methodologies like trial-and-error, bottom-up, top-down, divide-and-conquer, and even root cause analysis, but none of them is a silver bullet for all possible problems. Often the most effective way is to dynamically combine these approaches based on experience and understanding of the underlying technologies. This is a story of a particularly hard problem with an unexpected cause that I troubleshooted recently.

The problem I was facing was related to a Midgard staging/live setup using the Exorcist tool. The staging site was working just fine but the live site was having weird problems on some pages. It appeared as if the MidCOM components of the troublesome pages were losing a part of their configuration parameters.

Step 1. Because everything was working fine on the staging site I figured that the problem is most likely related to the staging/live replication step performed using the Exorcist. Were the parameters not being replicated? None of the common causes for such situations (unapproved content, parameters in a different sitegroup than the main object, etc.) seemed to apply however so I had to inspect whether the replication dump actually contained the missing parameters. It did!

Step 2. So the parameters were being replicated, but was there some trouble in how they were being imported in the live database? There was nothing strange in the import log file, so I opened a MySQL prompt to inspect the live database. The troublesome parameters seemed to exist just where they should be, and they even had all the correct GUID mappings and other details that had previously caused problem with the Exorcist.

Step 3. It seemed that the live database was in perfect condition, so the problem must be caused by something else. The problem was affecting just some MidCOM components, so I figured that there might be some difference in the component versions used on the staging and live hosts. No, the MidCOM source trees on both the staging and live hosts were identical.

Step 4. If the problem wasn't with MidCOM, then it must be a problem with the Midgard framework. Adding debug prints to one of the troublesome MidCOM components I was able to narrow down the problem to a $object->listparameters($component) call in the MidCOM configuration class. The configuration class seemed to be working just fine everywhere else, so the problem must be in the listparameter method! For some object/component combinations the method just returned nothing and didn't log any errors even though I double- and triple-checked that the parameters existed in the database.

Step 5. I added a set of debug prints to the Midgard-PHP listparameters method, compiled and installed the modified module, and reloaded the troublesome pages. They still didn't work properly, but now I got a detailed trace of what was happening under the hood. The method actually did execute the correct SQL query but received no results from the database. This was certainly weird, as I was very certain that the parameters in question actually did exist.

Step 6. Could there be something wrong with the MySQL client library used by the Midgard core to send queries to and receive results from the database? The normal MySQL command line tools use the same library so I started the MySQL prompt and entered the SQL statement copied from the Midgard log file. No results! The SQL statement was:
SELECT id,domain,name,value FROM record_extension WHERE (tablename='topic' AND oid=27 AND domain='de.linkm.sitemap' AND lang=0) AND (record_extension.sitegroup in (0,2)) ORDER BY name

Step 7. There are some autogeneration artifacts in the query, but no real reason why it shouldn't work. I removed the domain='de.linkm.sitemap' constraint and got a list of parameters, including ones with the domain column set to de.linkm.sitemap! Even worse, the exact same query with domain='midcom' or domain='midcom.helper.nav' were returning results, for some totally unknown reason MySQL seemed to not like the domain='de.linkm.sitemap' constraint. I tried also some other variations of the query and found out that even just removing the ORDER BY clause made the query return the correct rows. This should ring heavy warning bells for anyone who knows SQL, as the ORDER BY clause only orders the returned rows.

Step 8. Could this be some weird character set and collation issue like the ones we've been facing recently? No, it works the same way (rows are returned or not returned depending on whether the ORDER BY clause is used) regardless of the character encoding being used.

Step 9. Has the database been corrupted? I backed up and restored the entire database but the query was still misbehaving. I then copied the correctly behaving database from the staging host to the live host. The same error appeared in that database as well!

Step 10. The staging host was running MySQL version 4.1.12 while the live host had MySQL 4.1.7. Both stable and tested releases installed from the standard RHEL packages. I was recalling ugly memories from the MySQL 3.x times, when weird database errors seemed to be much more common.

Step 11. OK, it seemed that the problem was some MySQL bug that got fixed somewhere between versions 4.1.7 and 4.1.12. I didn't want to risk messing with the database installation so I figured I'd better find some workaround rather than trying to upgrade the database version.

Step 12. I used the EXPLAIN statement trying to find out what could be triggering the MySQL bug. The output suggested that the query planner was using some of the indexes on the record_extension table. The indexes actually look quite weird:
KEY record_extension_oid_idx(oid),
KEY record_extension_tablename_idx(tablename(10)),
KEY record_domain_tablename_idx(domain(10)),
KEY record_extension_sitegroup_idx(sitegroup),
KEY record_extension_name_idx(name(10))

The indexes don't seem to be reasonable. Why the arbitrary ten character limit, and why use only a single column per index? This seems anomalous enough to trigger an obscure bug in MySQL, so I just dropped the record_domain_tablename_idx index. And it worked!

The entire troubleshooting session took about three hours with a couple of breaks in between.