Saturday, January 7, 2006

Analyzing the Jackrabbit architecture with Lattix LDM

Tim Bray pointed to David Berlind who pointed to the Lattix company. Lattix makes a tool called Lattix LDM that uses a Dependency Structure Matrix to work with software architecture. I watched the nice Lattix demo and decided to try the software out.

After receiving my Community license and struggling for a while to get the software running on Linux (need to include both the jars and the Lattix directory in the Java classpath!) I loaded the latest Jackrabbit jar file for analysis. The dependency matrix of the top-level packages after an initial partitioning is shown below:

Top-level packages of Jackrabbit

The matrix contains all the package dependencies. A number in a cell of the matrix tells how many dependencies the package on the vertical column has on the package on the horizontal row. You can tell how widely a package is used by reading the cells on the package row. The package column identifies the other packages that the selected package uses or depends on. In general a healthy architecture only contains dependencies located below the diagonal.

The packages 2-6 form the general purpose Jackrabbit commons module, while the more essential parts of the Jackrabbit architecture are found within the core module. I grouped the commons packages and expanded the core module to get a more complete view of the Jackrabbit internals:

Jackrabbit core and commons modules

There was no immediate structure appearing, so I used the automatic DSM partitioning tool on the core module to sort out the package dependencies:

Jackrabbit core after initial partitioning

The value, config, fs, and util packages form a lower utility layer and the jndi package a higher deployment tool layer. The most interesting part lies between those layers, in the large interdependent section in the middle. The key to the architecture seems to be the main core package that both uses and is used by other packages in this section. I opened a separate view for examining the contents of the main core package:

Partitioning classes within the Jmain core package

The partitioning suggests that it might make sense to split the package in two parts. Without concern for semantic grouping, I just grouped the classes in the upper half as core.A and the classes in the lower half as core.B. This seems useful as the core.B package seems to be a bit better in terms of external dependencies:

Jackrabbit core after splitting the main core package

Running the package partitioning again, I got a bit more balanced results although the main section still is heavily interdependent:

Jackrabbit core partitioning after splitting the main core package

Looking at the vertical columns it seems like the main culprits for the interdependencies are the nodetype, state, version, and the virtual core.A packages. Both the nodetype and state package contain subpackages so I wanted to see if the dependencies could be localized to just a part of those packages:

Contents of the Jackrabbit state and nodetype packages

This is interesting, the interdependencies for the state package are for the main state package, while the nodetype interdependencies only affect the nodetype.virtual subpackage. I split both packages along those dependency relations,and partitioned the core module again:

Jackrabbit core partitioning after splitting the state and nodetype packages

The persistence managers in the state subpackages are now outside the main section just like the non-virtual nodetype classes. After a short while of further research on the dependencies I found that the partitioning of the main state package would suggest that the item state managers be split to a separate package:

Contents of the main state package

After creating a new statemanager package for containing the item state managers, the partitioning of the core module starts to look better. The only remaining circular dependencies are for the virtual core.A and core.B packages:

Jackrabbit core partitioning after moving the state managers into a new statemanager packate

Looking at the virtual core.B package we find that only the NodeId, PropertyId, and ItemId classes depend on the state package:

Contents of the core.B package

In fact it seems that it might make sense to move the classes there. After doing that the core module partitioning looks even better:

Jackrabbit core partitioning after moving the ItemId classes to the state package

The only remaining source of cyclic dependencies is the virtual core.A package into which I wont be going any deeper at this moment. Even now the analysis seems to have provided a number of suggestions for reducing the amount of cyclic dependencies and thus the improving the design quality of the Jackrabbit core:

  • Split the main core package into subpackages

  • Move the nodetype.virtual package to a higher level

  • Move the state subpackages to a separate package

  • Make a separate package for the item state managers

  • Move the NodeId, PropertyId, and ItemId classes to the state package

Note that these suggestions are just initial ideas based on a quick walkthrough of the Jackrabbit architecture using a Dependency Structure Matrix as a tool. As such the approach only gives structural insight to the architecture, and for this short analysis I didn't much include knowledge about the semantic roles and relationships of the Jackrabbit classes and packages.

Monday, January 2, 2006

Implementing mRFC 0024

Today I wrote the mRFC 0024: Full text indexing in Midgard proposal for adding full text and content tree support to the Midgard Query Builder. Like Torben did for the MidCOM indexer, I'm planning to use Apache Lucene as the underlying full text engine. The search indexer process shall be based on the Lucene Java library, but I haven't yet decided what I should use for the query part. On the surface the best option would seem to be either the Lucene4C or the CLucene library, but both options have drawbacks. The Lucene4C seems like the best match for the midgard-core environment, but it doesn't seem to be too actively developed and there's even been talk of abandoning it for a gcj-compiled version of Lucene Java. The CLucene library is more mature, but it's written in C++ and might therefore cause some unexpected build issues for midgard-core. One option would of course be to actually try linking midgard-core with a gcj-compiled Lucene Java! I'll prototype with all these options tomorrow while the mRFC 0024 vote takes place.

Another interesting issue in mRFC 0024 is the introduction of the parent cache, or actually a global content tree structure. Currently Midgard supports a sort of a tree model for all content, but it is mostly accessible only as limited views like for example the topic, page and snippet trees. Functions like is_in_tree or list_..._all have also required major scope limitations or other performance hacks to be useful. This is a bit troublesome for many use cases like searching and access controlling. The proposed parent cache would greatly simplify such content tree operations.

If the proposed content tree model catches on, then a natural migration path for Midgard 2.0 would be to make the proposed parent_guid field the official parent link in all Midgard records. This would both simplify the object model and allow for much flexibility in organizing the content tree. It would for example be possible to create an event calendar topic that has all the event objects as direct descendants instead of having to use an explicit link to a separate root event. The only problem with this approach is that it is a major backwards compatibility issue...

Sunday, January 1, 2006

Network is the computer

I am sick and tired of doing backups, synchronizing settings, and having trouble accessing information. These are all symptoms of keeping your data locally on multiple computers. As a new year's resolution I have decided to get rid of all these problem.

So far I've tried to solve these issues by maintaining my own mail and web servers and keeping my data mostly on the servers. The problem with this approach is that I've never had (and probably never will have) the time to set up and maintain all the network services I'd need.

Thus I've decided to fully embrace the famous "Network is the computer" slogan by moving to fully external network applications for most of my daily information management.

As of today my Internet toolset will consist of the following:

The main reason why I haven't done this before is the question of data access and ownership. There is always the change of one of the service providers going down and taking the service with them. The external service interface also limits the things you can do with your data. Luckily with the recent emergence of various programmable APIs (REST, SOAP, etc.) these problems have become much less pressing. I can now write my own tools to import, export, and manipulate the externally stored data as easily (or even more easily!) as local data. This, I believe, is one of the cornerstones of the network as a computer.