The thread model of Java is pretty good and works well for many use cases, but every now and then you need a separate process for better isolation of certain computations. For example in Apache Tika we're looking for a way to avoid OutOfMemoryErrors or JVM crashes caused by faulty libraries or troublesome input data.
In C and many other programming languages the straightforward way to achieve this is to fork separate processes for such tasks. Unfortunately Java doesn't support the concept of a fork (i.e. creating a copy of a running process). Instead, all you can do is to start up a completely new process. To create a mirror copy of your current process you'd need to start a new JVM instance with a recreated classpath and make sure that the new process reaches a state where you can get useful results from it. This is quite complicated and typically depends on predefined knowledge of what your classpath looks like. Certainly not something for a simple library to do when deployed somewhere inside a complex application server.
But there's another way! The latest Tika trunk now contains an early version of a fork feature that allows you to start a new JVM for running computations with the classes and data that you have in your current JVM instance. This is achieved by copying a few supporting class files to a temporary directory and starting the "child JVM" with only those classes. Once started, the supporting code in the child JVM establishes a simple communication protocol with the parent JVM using the standard input and output streams. You can then send serialized data and processing agents to the child JVM, where they will be deserialized using a special class loader that uses the communication link to access classes and other resources from the parent JVM.
My code is still far from production-ready, but I believe I've already solved all the tricky parts and everything seems to work as expected. Perhaps this code should go into an Apache Commons component, since it seems like it would be useful also to other projects beyond Tika. Initial searching didn't bring up other implementations of the same idea, but I wouldn't be surprised if there are some out there. Pointers welcome.
Very cool trick. The contents of the classpath variable as well as the JAVA_HOME can be extracted from system properties, which I have done in the past, but it sounds like this can do that & more.
ReplyDeleteThat's a clever idea. So far we encapsulated problematic 3rd party libraries within services and used socket communication for data exchange. If your idea will enable controlling the separate library from within Java and maybe speed up communication this would be great. I'm quite interested on this project and could offer some help.
ReplyDeleteReminds me of "mobile code" in JINI, you might even "simply" fire up a JavaSpace and write the thing you want to get done as an entry, getting notified when done. This may be too much, I don't know.
ReplyDeleteYou may want to take a look at the River project in ASF's Incubator.
That is good news!
ReplyDeleteI complained about crashing JVMs due to errors during content extraction in another post of yours, and I'd love think that correlation is causation in this case. Please don't destroy my illusion. :)
Btw., I found that re-building the classpath is non-trivial if you plan to include all classes of a servlet container in the classpath, not just the WEB-INF/lib directory. I did that once to isolate a lucene based web-crawler.
Oh yes please push this in commmons
ReplyDeleteIt is very similar with hadoop's tasktraker implementation. TaskTracker will copy classes or jar to a working directory and then fork a child jvm. And the parent jvm can communicate with child jvm using hadoop's own IPC.
ReplyDeleteCan we fork jvm to solve lack of heap memory issues
ReplyDeleteA very interesting post, but can you please also describe how to fork a separate child JVM?
ReplyDeleteHi,
ReplyDeleteClever framework but might I stongly suggest repaying the stderr output in a separate thread to avoid all sorts of nast deadlocks? You can ensure that it does when the server does, and is run as a daemon, etc...
Rgds
Damon
Sorry, "relaying" not "repaying", and "it [the thread] exits" not "it does"! Fingers a long way behind my brain today... %-P
ReplyDeleteRgds
Damon
Hi,
ReplyDeleteWith your patch, the memory footprint looks to be smaller. I saw when the JVM spawns a new process that the Java Heap Space was doubled, which could be avoided by communicating with a daemon that's already running.
I had memory issues with HBase on my dev system with only 1G of RAM (which is not recommended). See http://techvineyard.blogspot.com/2011/01/trying-nutch-20-hbase-storage.html for more details.
Caused by: java.io.IOException: Cannot run program "chmod": java.io.IOException: error=12, Cannot allocate memory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
It would be interested to integrate your approach in the org.apache.hadoop.fs.RawLocalFileSystem.execCommand method, wich is performing some file manipulations (chmod for instance). This has been reported here:
https://issues.apache.org/jira/browse/HADOOP-5059
Thanks for the Hadoop reference! We were just discussing a similar issue in https://issues.apache.org/jira/browse/JCR-2864 and I filed https://issues.apache.org/jira/browse/TIKA-591 to address this in Tika. I also commented on HADOOP-5059. Perhaps this is something that all these projects could work together on?
ReplyDeleteThanks you much!
ReplyDeleteYour article very inspired me yesterday when I was trying to find solution which would provides ability of creating safe and resource limitable environment for a plugins and a hosted applications.
There was a several ways to resolve such task, but your idea combines a comformable and easy way on interaction between host and client side.
If you interested in, here the source code of solution: https://github.com/nikelin/Redshape-AS/tree/master/forker