Integrating Java in Python with JTool

MERESCO combines components written in various programming languages. It uses Python to tie these components together. It integrates Java using JTool.

It began with Lucene
Lucene is a well-known Java library for full-text search. MERESCO used PyLucene which compiled Lucene to native machine code. PyLucene was unstable and did not cover all of Lucene. In 2008 it changed strategy and the performance dropped significantly. We decided to try a completely different approach and that turned out to work very well.

What did we try?
We quickly discovered that compiling Lucene with GCJ was easy and that it resulted in robust, fast and reliable programs. Then we created a Python extension called JTool which mirrors the complete Java API in Python.

How to use?
Here is how you use it in Python:

$ python
>>> import jtool
>>> jtool.load('')  # compiled lucene-core.jar
>>> from org.apache.lucene.index import IndexReader
>>> reader ="/indexdir")

This is how all of Lucene is accessed in MERESCO. It runs fast, reliable and with low memory footprint. The code base of JTool is only 1500 lines, there is no code generation and it is completely generic. So the next question is:

Will JTool work for other Java libraries?
In February 2010 we started looking for a more scalable Triple Store for MERESCO. Our choice was OWLIM…. written in Java. While Lucene is quite a large library, OWLIM is even larger. The latter depends on 22 other Java projects, including the Sesame RDF Framework.

Compilation of OWLIM took a bit more effort as we needed to gather all needed jar files and make sure some factories did not get duplicated in the final library. Then we tried to load this library in Python using JTool:

>>> import jtool
>>> jtool.load("")
>>> from org.openrdf.repository.sail import SailRepository
>>> from org.openrdf.query import QueryLanguage
>>> ...

This enabled us to insert RDF and execute SPARQL queries on the triple store. Yes it works!

Future of JTool
JTool can not yet call methods with NULL-parameter or Java 5 varargs. It also does not support callbacks in Python yet. We have solutions for these omissions which we will implement this year. Meanwhile, it is easy enough to create a Java wrapper and use this via JTool. So JTool allows us to quickly integrate any Java libraries in MERESCO.

Sources for JTool up to version 4 are available JTool Sources.
JTool version 5 and up are available in binary form JTool Binaries.

What makes Meresco different from Solr?

Solr focusses to get the most out of one index type: Lucene. Meresco supports a number of different index types, each specialized for a specific task. Queries are split, each part processed by the most appropriate index, and the results are integrated. This ensures that all types of queries are processed within tens or at most hundreds of milliseconds.

Each type of index has distinct and unique properties such as specific query algorithms, optimized access patterns and scalability. We will introduce each index type below together with a short characterization.

Fulltext Index
This index is optimized for queries for a combination of words, literal phrases, words nearby other words etc. It is implemented with Lucene which is known to scale very well. Meresco helps scaling it by keeping it small, this post about Storage versus Index.

Facet Index
The facet index specializes in drilldown (faceting) queries, dynamic clustering and tag clouds. It produces exact results even on large data sets, which is one of Merescos unique selling points. Meresco uses custom data structures and algorithms which scale to billions of postings on a single node.

Dictionary Index
This index supports fast lookup of arbitrary textual information related to keys. It supports simple lookup ‘queries’ only. It is implemented using Berkeley DB, which is known for its good scalability and performance. It is being used to scale up set and metadataPrefix queries in OAI-PMH to tens of millions of records. This post describes the process: Dependable OAI Repositories.

Sorted Dictionary Index
This index supports extremely fast lookup of simple numeric information attached to alphabetically ordered terms. It supports prefix queries such as needed for auto-complete. It is implemented using a Burst Trie.

Triple Store
This index supports queries about arbitrary relationships between objects (graph-inference) typically through SPARQL or extensions to CQL. It is implemented with rdflib and OWLIM, the former being simple, the latter being one of the most scalable and fast triple stores around. An application is relating traditional records to social metadata such as tagging, ratings and reviews. A lot is going to happen around here.

Range Index
This index supports ultra fast retrieval of data contained in numerical ranges. It supports range queries such as 20090101 < date <= 20101231. Meresco has its own optimized implementation. This index is so small, it scales to billions of documents even on a single node.

The n-gram index is capable of performing approximate matches and hence used for suggestions in ‘Did you mean?’-like solutions. More generally it allows for language neutral queries. This index lays on top of the Lucene index, but is nominated to be replaced by a faster and more specialized one in 2010

Meresco can maintain these index types in sync both during batch and real-time updates. Together, these indexes deliver fast results to queries, even if those queries are complicated and demanding such as tag clouds, auto-complete, clustering, term suggestions, did-you-mean and relationship queries.