The first one is Apache Lucene EuroCon, a dedicated Lucene and Solr user conference on 18-21 May in Prague. That's the place to be if you're in Europe and interested in Lucene-based search technology (or want to stop by for the beer festival). I'll be there presenting Apache Tika, and the abstract of my presentation is:
Apache Tika is a toolkit for extracting text and metadata from digital documents. It's the perfect companion to search engines and any other applications where it's useful to know more than just the name and size of a file. Powered by parser libraries like Apache POI and PDFBox, Tika offers a simple and unified way to access content in dozens of document formats.
This presentation introduces Apache Tika and shows how it's being used in projects like Apache Solr and Apache Jackrabbit. You will learn how to integrate Tika with your application and how to configure and extend Tika to best suit your needs. The presentation also summarizes the key characteristics of the more widely used file formats and metadata standards, and shows how Tika can help deal with that complexity.
The rest of the conference program is also now available. See you there!
As much as I love Lucene, PDFBox sneaks up on me time and again to stab in the back by killing the server with infinite loops and gigabytes of heap, at least with 0.8.ReplyDelete
Is this something the project owners are aware of?
I can only recommend running text extraction for Lucene in another JVM, or else you should be prepared for some unpleasant suprises.
There's already a PDFBox version 1.1.0 out (see http://pdfbox.apache.org/download.html) and we're actively working on various improvements there.ReplyDelete
[...] Tagged: berlin, berlin buzzwords, conference, Jackrabbit, jcr, tika Like the Lucene conference I mentioned earlier, Berlin Buzzwords 2010 is a new conference that fills in the space left by the decision [...]ReplyDelete