Multi-threaded expert search with Lucene

Lucene’s high-level search API parallelizes search automatically, achieving superb performance.

The low-level expert API does not do that.  Here is a solution to parallelize custom search. It is Open Source.

Introduction

The Open Source search engine Lucene has evolved into a very smart and efficient search tool.  It offers both high level search interfaces as well as low-level (expert) interfaces. If you pass it an Executor, the high-level interface (API) automatically enables all the CPU’s in a server to work on your queries as fast as possible.  The results are downright fantastic.  For quite large data sets and quite complicated queries, you will get results in milliseconds.

Concurrency and the expert interface

However, the low-level API does not put all your cores to work automatically.  That means that as soon as queries become more complicated and you have to use the low-level API, it runs slower.  That is not a problem as long as you limit the number of documents to, say, 10,000,000. Most functionality like sorting and facets still runs in a few 100 of milliseconds. But when you have many more documents and when you must support many concurrent queries, the help of more CPU’s would be very nice.

Before I introduce our solution, a quick word about the so-called expert API and the problem with it.

How the low-level API works: Collector

The main work of any search engine consists of intersecting, intersecting and yet more intersecting. Lucene’s low-level expert API gives a programmer access to the inner loop that performs the intersecting. Lucene has a hook that, regardless of what intersection algorithm it uses (zip, gallop, bisect, skip lists, and so on, look for “integer intersection algorithms”), it calls your code as soon as it determines a document is in the set defined by your query.

The hook consists of an interface called Collector, that has a method collect().  When you pass an object that supports this interface, Lucene will call collect() for each result.  This method must be very short, small and fast, since Lucene will call it many, many times.

The problem is that the search methods that accept a Collector do not enable threads and leave all but one of your CPU’s idle.  There is a good reason for that.

The problem with Collectors

The Collector interface is inherently thread-unsafe.  The reason for this is the way the setNextReader() method on Collector works. Internally, Lucene maintains an index composed of tens of segments each containing a subset of the complete index.  During search, Lucene tells the collector that a new segment starts and that the following calls to collect() are in the context of this segment.  All but the most trivial Collectors need to remember this context and this is why the Collector is unsuitable for multi-threading.  What if two threads call setNextReader() with different arguments?

Using Lucene’s internal segmenting to divide the work is quite natural. Indeed, the high-level API divides the work based on segments.  What if custom Collectors could also use this?

SuperCollector and SubCollector

We designed two new interfaces that are much like the Collector, but just different enough to support multi-threaded search for any operation.  The first is SuperCollector.  It basically replaces Collector in the call to search().  So instead of (pseudocode):

searcher.search(query, filter, collector)

we use:

searcher.search(query, filter, supercollector)

That’s all.

The SuperCollector creates subCollectors that are thread-safe:

subcollector = supercollector.subCollector(...)

The new search() method calls the method subCollector() for as many threads it wants to create. SubCollector is almost identical to Collector.  If fact, it is a subclass of Collector. It supports the setNextReader() and setScorer() methods and of course collect() and it adds only one more method:

subcollector.complete()

The thread executing the search calls complete() after it is finished. Some collectors do most of their work during collect() calls, but others just collect data for post-processing later.  Such collectors can put time-consuming post-processing in the complete() method so that it executes in parallel as well.

Encapsulating existing code

The nasty thing about the Collector API is that once you start using it, the simple parts of your query, that used to be automatically parallelized by Lucene, will now be single-threaded as well. Therefore we also need to create Super/SubCollectors for simple queries.

Simple queries are queries with results sorted on score or on a field value. The parallelization is spread across the Lucene code. We found it and encapsulated it as follows:

  1. Some code in the search method at lines 450-456 and lines 535-547 creates the threads.  We replace this code with one generic piece in search().
  2. Code for creating the TopFieldCollector at line 535 for sorting on a field while the case for sorting on score is degelated to SearcherCallableNoSort which will delegate further until a TopScoreDocCollector is finally created at line 490. We encapsulate this in specific subCollector() methods.
  3. Code for merging the results at lines 460-471 (sort on score) and at lines 550-559 (sort on field). We encapsulate this in SuperCollector.

Example: TopFieldSuperCollector

Here are the results for applying the encapsulation outlined above to the way Lucene’s default TopFieldCollector works.  These are the important steps:

First, the TopDocsSuperCollector creates the proper SubCollector when search() asks for it:

TopFieldSubCollector
createSubCollector() {
    return new TopFieldSubCollector(...);
}

The search continues as normal.  Lucene calls collect() for each result found, but this happens in a thread.  The TopFieldSubCollector simply delegates to a standard Lucene TopFieldCollector.

The search signals completion of the segment by calling complete(), which simply delegates again:

void complete() {
    this.topdocs = this.delegate.topDocs()
}

Finally, the merge happens when the application asks for the result by calling topDocs():

TopDocs topDocs(int start) {
    for (TopFieldSubCollector sub : super.subs) {
        // merge results from subcollectors
    }
    return topDocs
}

What next?

The concept above has been implemented for simple search and shows very much speed-up.  We now have to create Super- and SubCollectors for the other functions we use:

  1. for sorting by relevance: TopDocsCollector
  2. for calculating facets: FacetCollector
  3. for creating trees of collectors: MultiCollector
  4. for de-duplicating results: DeDupCollector
  5. for query-time join: KeyCollector and ScoreCollector

For each of them we are working out the design and then we will add them. I will post the results when we are done.

Meanwhile, you can follow the code here: Meresco Lucene.

Multi-threading and optimizing speed-up (Infographic)

Below is an #infographic about how a multi-threaded application in general behaves, including text balloons with suggestions about where to look for improvements.

I created it because I wanted to explain to a colleague what I did to improve Lucene’s faceting performance using 16 CPU cores.

I hope it speaks for itself.

Infographic about Multi-threading

Infographic about Analyzing and Optimizing Multi-threading (and why you don’t see linear speed-up in most practical cases)

About scalability, efficiency, pairs and time

At Seecr we continuously both scale up and scale out our systems, but we also improve efficiency continuously.  Here is why and how we do it.

Scalability versus Efficiency

Quite often, people think that scalability is everything. But scaling an inefficient system, if at all possible, is going to be expensive and might even stop you completely. It certainly looks nice when Amazon adds 100 extra machines to your system in an instant, but it might just as well be a gross waste of resources. And as long as money is not a problem, the problems related to inefficient systems can remain hidden for a long time.

Why is Inefficiency a Problem

We are now living in an era where more and more people need no explanation as to why inefficiency is a bad thing. Much to my delight, the mere idea of wasting something is becoming increasingly sufficient to let people act. So I won’t go into that direction. There is another problem however.

Inefficiency is caused by something. And when the time comes that you do want to improve it, you need to address the causes. And then it might turn out that you are a little too late….

Here are two significant causes we have observed frequently:

  1. Programming negligence
  2. Wrong algorithm

1. Programming negligence.

Programming is quite a difficult task and each problem has different aspects that need careful attention. There are the matters of primary design, testability, consequences for other features, refactoring, selection of libraries, code style, readability and intention, integration with other parts, packaging, upgrading and data conversion and on goes the list, endlessly. That’s the nature of software.

Efficiency is definitely somewhere on that list of aspects. It is all too natural that, once code functionally works, everyone is glad and moves on to the next task. But at that time, some aspects might not have received the attention they require, and efficiency is often among them.  If this goes on for some time, you’ll end up with many places needing small fixes.

But you can’t fix it later.  The reason for that is as profound as it is simple: there are just too many small things, each of which contributes only little to the problem. It is the power of the many. Addressing each of these problems requires you take time to delve into them again, with only little reward: each single problem solved improves efficiency only a little. You’ll have to work through them all to see results.

If only you would have made different choices in the first place, when you were initially writing it…

2. Wrong algorithm. 

A problem can often be solved with very different solutions. Naturally, you first pick a solution based on your experience and understanding of the problem, then work it out. Often it becomes clear during the process that another solution is a better fit. This is a direct result of an increased understanding of the problem while working on it. Deep understanding of the dynamics that arise when the problem and the solution interact might also arrive later. For example when you run tests with fully populated data sets and unforeseen usage patterns that do not appear in testing environments. It turns out that you will need a (completely) different algorithm to solve the problem.  Back to the drawing board. That’s the nature of software too.

Dead in your tracks

Both problems, many small inefficiencies and wrong algorithm, are not just a pair of non-optimalities of your system.  They both have the ability to simply place required throughput and responsiveness beyond your capabilities and budget.  Because both problems require a complete rethinking of your system: go over all the details again, and improve them, or go over the main design again and change it. This costs a lot of time, and, most importantly, it takes the time of the most specialized people.  If you could only have made other decisions when the system was first created….

What are solutions?

Let me get this straight first: getting better programmers, smarter architects or more elaborate processes including a lot of quality assurance does not solve the problem. While some people do learn faster or can have more things on their minds, and while some elaborate processes do catch more problems, they basically only ameliorate the situation marginally.  They do not address the limitations fundamentally.

So what is the fundamental problem then?

The fundamental problem is that:

  1. people are given too many things to worry about at the same time.
  2. people are given too little time to learn and understand well.

In short, they suffer from Continuous partial attentionNow, the solution becomes almost evident: use more people and give them more time.

In Shock?

You would probably say: “but I can’t spend more people and more time, you know that budgets are tight and your solution is not realistic.”  Well, if you really think that, stop here.

If you think you are in control and you can change things for the better: continue.

First: Pair Programming

The most profound impact on software quality (that’s what we’re talking about) I have ever seen is the working in pairs.  Pairs have double the capacity to deal with al the things-to-worry-about.  But that’s only the beginning.  The dynamics that play between pairs is really what pays off.  Each one has his or her own set of aspects of interest, and stimulates the thinking process about the other.

Pairs have the advantage to easily mix.  Depending on the task at hand, on personal preferences, on tiredness even, one person might switch with another.  This new combination will pay attention to other aspects with a fresh mind.  Also, there is knowledge exchange happening automatically.  Purposely shuffling with pairs is a strong tool in the hand of a team.  It allows you to increase the number of aspects that deserve attention.

Second: Time

But having more brains at the task is only half the solution.  The other half is giving these brains enough time to learn and understand the problem.  So if one pair is functionally done, prepare to let a third person replace one of them.  There will be points left over for him or her to think about. While the first pair focussed on getting it functionally right, the second pair looks at other aspects such as efficiency.  In fact, it is a simple application of Make-It-Work-Make-It-Right.

Conclusion

It all comes down to the careful allocation of people and time. Look at their skill, interests and allocate enough time. I can’t stress this enough: proper allocation of time means saving time. Full stop.  When you rush, you are going to regret it later; and someone is going to pay for it.  It is therefore an act of will to allocate time in the beginning, right when you are solving the problem.

The only way to build scalable systems is by first making efficient systems. And for that you need to allocate enough time and people before you scale up.

 

Sensing and Acting (about systems failure)

This is a story about the failure of a large computer system the day before it went live.  I am telling this story to give others the opportunity to learn from it.  It does not name nor blame people or organizations. In fact, all people involved are highly skilled professionals who learn every day from events like these. My duty is to share.

Response Times

Since the system we talk about is not an end-user system but instead an API to be used to create mash-ups that serve end-users, performance is very important.  That is because you have twice the internet latency: first between user and mash-up, next between mash-up and API.

Gut feeling

As we were nearing the end of the project, performance tests became a more regular item on our backlog. Both the overall tests and our component test showed figures within the desired range.  We aimed for 80% within 400ms, and 99% within 5s as that were the official targets.  Although we were well within range, I felt that the system could do better. The 95% percentile was about 2s, while I felt it could be close to 500ms. It was just a gut feeling based on previous experience with similar systems and I just could not figure out if there was a reason why this time we did not achieve the superb response times we were used to.

Never one problem

But, as said, it was well in range. People even complimented me personally and, most importantly, there were other things on my mind: I couldn’t log in to the VPN anymore.  This proved to be a serious problem as all machines making up the system were only reachable over the VPN. This needed fixing first.

The VPN connection wasn’t really stable. Apart from the first few days after it got delivered, it never was. Connecting was often slow, it requested my password twice although I was sure to have typed it in correctly the first time.  Also, logging in via SSH was also very slow, and public key authentication worked intermittently. But now it didn’t connect at all.  But heck, other people still had access, so let’s use their accounts.  After all, they were the ones who needed it most, and they were OK.

After a week I was able to connect again. It was still slow but it worked. So let’s forget this misery and go on. There is a system that must go live!

Almost live

Friday night all machines in the system ran on the latest version and all databases were properly populated. All looked fine for the release next monday.  Only the DNS entries had to be switched and the system would be live.

The gates open

Saturday morning I got a text message from the project manager.  The system was working, but horribly slow.  Requests took seconds, tens of seconds, up to a minute. Consistently. Adrenaline! I ran to my computer, called my colleague and started investigating.  Within minutes we knew that something was badly wrong and we called all hands to battle stations.  Software developers, system maintainers, the customers support center, about ten people from five organizations on it.

We went deep, deep down into the systems guts and after three hours we found a problem.  The internal DNS system was responding slow; too slow.  Requests took one second to complete, on average, and this caused the socket library to block while waiting for an answer every time it wanted to communicate with another system. This, it turned out, slowed down almost everything.

The software engineers knew what to do immediately: at those places where the DNS was time-critical, replace all host names with IP addresses.  And that we did.  While we worked the VPN connection kept failing, but we fixed the problem.  It was eight o’clock PM.  I took a drink and went to sleep.

Hell breaks loose

Next morning, on a beautiful Sunday, the project manager sent me another text message: the system spet out incomplete data. Again, I launched myself towards my laptop, and gazed at our monitoring system: nine of 13 services were gone. Dead. Red flags all over.  And not a single message on my phone. Oh dear, oh dear.

My first reflex was to try to login to one of these systems that hosted one or more of these dead services.  But the VPN did not connect. Trying…. connecting…. no luck.

Again I called all hands on deck.  But this time, I couldn’t do nothing else. I had to wait for the VPN access to get fixed first. After three hours the problem was found and fixed.  The LDAP service failed and hence authentication could not proceed properly. Rebooting the LDAP service brought everything online again.

Saved

Everything?  Yes everything.  Also the nine dead services were running happily.  There was no need for me to login or even connect to the VPN.  Even the response times were better. The previously unsatisfying 95% percentile was now down to 350ms. Much more what I expected.

What happened?

LDAP can back a DNS service and so it did here. For a still unknown reason it became slower and slower until it eventually collapsed completely.  Since DNS is really a fundamental systems component, its malfunctioning affects almost all other services.  VPN went down.  Nine of our services went down.  The only four remaining were the ones we poked the IP addresses in the day before.  Did you wonder why we did not get notifications before the project manager texted me?  Because e-mail and messaging services were also down. There is almost nothing that works without DNS.

What do I take away from this story?  First of all, there is the obvious single point of failure: the DNS services.  While it is not in my capacity to fix that, we did fix the dependency on it.  However, this has its limitations. Using IP addresses instead of host names limits load balancing and name-based routing. So in the end, every system needs a reliable DNS.  But this is not the most powerful lesson to take away.

What did I learn?

The must fundamental and most potent lesson I have learned, or in fact re-learned is: if you see something strange, investigate! It is clear from this story that there were early warnings that could have prevented disaster. First, the VPN connections were slow and failed often. Second, we felt that the performance could be better. We ignored both signals for considerable time.  Both signals clearly hinted at the same problem.

The most powerful lesson

Now this is not a new lesson.  If fact, almost everyone who has some experience in systems maintenance already knows this. So what I am actually learning is how difficult it can be to apply what you know to be good.  Circumstances, context, people dynamics, pressure and even personal well-being and fitness are strong forces that influence people’s ability to sense the often small signals and act on it. You can’t ignore these forces by stating that people must act ‘professional’.  Both sensing and acting are crucial human activities that only happen when the circumstances are right. No matter how obvious and well-known a rule can be; applying it consistently is a whole different ball-game.

A Faster Join for Solr/Lucene

Reducing Index Maintenance Costs with Join.

The previous post introduced the reasons why we want blazingly fast join functionality in Lucene: reduction of index maintenance costs.

This post details how we improved the speed of Lucene’s Query-Time Join a thousandfold.  We achieved this by looking at usage patterns instead of looking at the technology. Lucene’s default Join is a truly Lucene way of performing Joins between arbitrary terms in arbitrary fields in two indexes. Our Join more or less turns a Lucene index into a relational table and provides Join functionality as you would expect from a relational database.  That might sound restrictive to Lucene-adepts, but it offers unprecedented possibilities.

Keys, keys, keys

Lucene joins indexes based on terms. Our first observation is that these terms in fact play the role of keys. In database-lingo you would call them primary or foreign keys. Nowadays, most people use either UUIDs or URIs to identify things, but these are hard to deal with: they occupy much space, are expensive to compare and do not compress well.  Data management software internally always translates long, text based identifiers to small integers. (As a matter of fact, Lucene also uses them and calls them docid and ord.) Our Join implementation is based on such small monotonically increasing numbers and we call them keys.

Multiple Values

The second observation is that lucene supports join on fields that contain multiple values for one document.  But key cells in databases always contain exactly one value for each row.  If you need multiple values, you’d create a separate table containing multiple entries containing the same (foreign) key and then… join it.  Since join is a database concept after all, we might want to consider to be consequent and use single valued fields exclusively. If we need multiple values, we’ll create a separate index and then… join it!  So in our Join implementation, every document in a Lucene index gets one and only one (non-unique though) key.   This is not restricting you as long as you are prepared to adapt your data model.  Read on.

Performance

The performance gain of using small integers over strings is tremendous.  Small integers use significantly less memory and can be fetched, stored and compared with a single machine instruction. Having only single values (one key) per field per document means they can be organized efficiently in one-dimensional arrays or bitsets.  Much research in the domain of Information Retrieval deals with fast storage and intersection of these arrays and bitsets, and much of the research results are available in Lucene! It works marvelous with our keys.

Translate Identifiers to Single Valued Keys

So how do we go about translating string identifiers to numeric keys?  The implementation of this is downright easy in Lucene if you use the new Taxonomy functionality.  The taxonomy is a proper dictionary, mapping every term onto a small number: the key.  Lucene can store this key very efficiently using the relatively new NumericDocValuesField feature. During indexing, we use it like this to store a key in field key (pseudo-code):

TaxonomyWriter keysDict = new DirectoryTaxonomyWriter(...);
Document doc = new Document();
long key = keyDict.addCategory(new CategoryPath());
doc.add(new NumericDocValuesField("key", key));
indexWriter_A.addDocument(doc);

It is essential that one uses the same TaxonomyWriter for every index so that identifiers in all indexes get mapped onto the same keys.

(A completely different way of creating keys is using another database’s key generation mechanism.  Virtuoso for example exposes its IRI_TO_ID and ID_TO_IRI functions.  Using these to obtain keys gives the opportunity to Join between Lucene and Virtuoso.  Expect a blog about that!)

Create Single Valued Keys

The problem of having single valued keys may require changes to the data model. In our case we had the denormalized form in one index:

id:<uri_1>  title:"The Hobbit"  location:<loc_1>,<loc_2>, ...

We split this into two indexes:

id:<uri_1>  title:"The Hobbit"

and:

id:<uri_1>  location:<loc_1>
id:<uri_1>  location:<loc_2>

Querying

Now we have two or more indexes in which the terms we wish to use for joining are replaced by keys.  We can now join these indexes during queries by specifying which fields to use for joining. To keep things simple, we assume that one index A simply filters the results from another index B.  In practise things are (much) more complicated; too complicated for a blog.

Expressing Join

For join to really work well, we will need a query language that supports join. The query language built into Lucene offers no support for this, but luckily we can do without.  There is another language that does support joins and that is the language consisting of Java classes centered around Query, Filter and Collector.

Collecting Terms… eh Keys

The first step is to collect the keys for the terms we want to match between indexes. This step is straight forward (if you leave out caching and all the devious Java markup, it is just two lines of useful code.).  We created a Collector that for each hit, looks up the key and stores it into a bitset.  It needs to know the name of the field containing the key. Use it as follows:

KeyCollector keyCollector = new KeyCollector("key");
indexSearcher_A.search(<query_a>, keyCollector);

Filtering Terms… eh Keys

The second step is to ask the KeyCollector for a KeyFilter to be used in another index. The filter is also not too complicated if you leave out caching and Java markup. It needs to know the name of the field containing the key.  Use it as follows:

KeyFilter keyFilter = keyCollector.getFilter("otherKey");
TopDocs results = indexSearcher_B.search(<query_b>, keyFilter);

Done. Now the results in index B are filtered based on a query on index A.  And it is fast, blazingly fast, thanks to Lucene’s excellent DocValues and Collector APIs.

But it needs to be faster, even more

Although this duo of KeyCollector and KeyFilter improve the speed of Lucene’s built-in Join with a factor 50 or so, it is not fast enough.  It would get the raw processing time for our 8-second query down to 160 ms, but that is still too much. You’ll need to add all kinds of post-processing to this 160 ms and then you’ll end up way to close to 1s. Any raw processing time not substantially under 100 ms makes me personally very nervous. It would require many machines to deal with the loads we have.

Caching

With caching, we can get the processing times down by a factor 20, leaving a mere 8 ms. That yields a total 1000-fold speedup compared to where we started.

How that is achieved might be the subject of the next blog. But you might peek at Meresco’s Lucene code at GitHub.

Conclusion

By making some observations about the nature of joins and by making a firm decision to follow the relational interpretation of join and by assuming that anything is still possible by adapting your data model, we managed to speed up joins 1000-fold.

The result is completely orthogonal to other Lucene functionality such as faceting, sorting, scoring, filtering, etc. All intermediate steps and results are in the hands of the programmer.

The results are about to go in production (March) on a Dutch national library search engine, joining 12.000.000+ titles with 25.000.000+ holdings. The contribution of join to the query response times are deep down on the list of bottlenecks: a few ms.

Reducing Index Maintenance Costs with Query-Time Join for Solr/Lucene

Index Maintenance: Cheaper and Better

Last year Seecr has been on a quest to make maintenance of large integrated search engines easier, cheaper and at the same time better.

We achieved that goal by reorganizing the indexing process and applying a technique called Late Integration.  We created an exciting extension for the Solr/Lucene search engine to make that possible.

In a series of blogs, we will present the problem, the solution and the code. This completes our previous series of blogs about the Open Index: Late IntegrationQuery Resolving and how to find relevant indexes.

Relations and Denormalization

Traditionally (that is, during the last decade), integrated search meant collecting data from various sources and integrating it in an central search engine. This approach works well but runs into problems when the data sets become larger and more divers. And what finally kills this approach is the presence of 1:n relations between data sets.

Traditional databases and triple stores deal with relations explicitly and provide query languages (SQL and SPARQL) that incorporate the notion of relations. With Lucene, the predominant approach is to de-normalize the data in order to create one big index. This denormalization is what becomes the bottleneck when the index grows.

Denormalizing: The Problem

Suppose you have a corporate catalogue of product descriptions and each local branch maintains a database, linking to this central catalogue, adding information about the local stock, price, promotions, reservation, et cetera.  Providing Solr/Lucene-based search functionality spanning both the central catalogue as well as the local database would require both data sources to be de-normalized.  Doing so runs into a few problems:

  1. The 1-N nature of the relationship yields an explosion of data. That’s the nature of denormalization. The most prominent example I know of is the travel agency that maintained one index with over 109 trips, resulting from heavy denormalizing destinations, hotels, dates, lengths and features like dog admittance and smoking rooms.
  2. Both sources have their own update cycle (life cycle). Some data might be constantly updated, for example room availability, while other data is rather stable, such as room descriptions. Feeding both into the same update cycle often limits the update rate of more volatile data.
  3. Both sources require their own specific expertise. The extraction of the data out of its original environment poses a problem: away from the experts that know it by heart and into the centrally managed domain of the generalists.

It would be so much nicer to have separate indexes: one for the corporate catalogue and one for the local stock.  Each update in local stock would require just one simple update in the stock index, just as one item update in the catalogue requires only one update in the catalogue index. So, an catalogue item update need not to go through the denormalize process, avoiding the forced update of all local stock related to this item.

Having separate indexes makes it almost trivial to differentiate the calculation of fields (tokenization, ranking, normalization etc) per index. Also, the update rate can vary easily between indexes, simply because they are independent.

If we only could join the indexes during query time…

Relations in Lucene: Join

If we could bring the notion of relations to Lucene, and we are able to efficiently query along relations, we could avoid denormalization, making the indexing process easier while at the same time creating opportunities to have more specialized indexing when needed.

What is Join?

The idea of joins stems from relational databases (RDBMS) and the structured query language (SQL).  The main idea is to normalize the data into separate tables, and to re-combine these tables during queries. Queries express relations between those tables and data is recombined based on the relations.  This is called joining.

The same idea is present in RDF triple stores. The Sparql query language allows queries to span multiple relations between concepts. The actual join however is much more natural: it happens implicitly because of the notion of everything being a relation, even individual properties.

Lucene Join

Coincidentally, last years, Solr/Lucene has seen the addition of functionality for joining. This allows one to indicate a specific field in one index, and one specific field in another index, and Lucene would then join documents from both indexes if and when they contain the same value in the indicated fields. But since it is not fast enough, we had to rethink and redesign it for us to work. We needed it to be as fast as any other query in Lucene. So we made it that way.

Speed?

In Solr/Lucene a special utility helps an application programmer to create a query that under the hood collects terms in one index and filters on these terms in another. Using this utility (JoinUtil.createJoinQuery) we run a typical query on our indexes: one containing 12,000,000+ book titles and one containing 25,000,000+ individual books held by libraries in the Netherlands.  The question is: for a certain selection, which books are present at a certain library or group of libraries?

This query took 8679 ms to execute.

Regardless of hardware, this is too much. Way too much, as each and every query will come with such a filter.  We need it down to 8 ms. That is our target.

A 1000 times faster

The reasons Lucene’s Join is not fast enough lays in the generic nature of it’s approach. It works on terms: strings of random length. It builds sets of terms and makes intersections or unions of them.  However, for computers, dealing with strings is expensive. They take a lot of memory, and comparison is slow, especially if almost every string is an URI, with long common prefixes. Secondly, the intersections and unions are made by scanning the terms of an index, which is a slow process.

But Lucene is open source, is rather well though-about, and has healthy discussions between experts.  So we started reading and we improved the speed of joining by doing the following:

  1. Translate terms (strings) into numbers. This will speed up the creation of sets as well as taking the intersection.
  2. Closely follow the Lucene segmenting architecture to exploit caching per segment on a lowest possible level.

We set out to do so and learned a lot about Lucene and joining algorithms!

Results

We were very surprised by the results: joining has become so fast, it is now a non-issue for us. This means we can use it whenever we decide that maintaining a separate index is beneficial, without worrying about the performance loss of queries.

Exactly how we did it will be the subject of a following, more technical, blog post.

In the next post, we will outline our approach, with real working code, and show how join can be combined orthogonally with other Lucene functionality. At the end, we will contribute this code to the marvellous Lucene-community.

How to scale up Meresco

Recently Kennisnet asked me how to scale up Edurep with regard to:
– queries per second
– record updates per second
– total number of records

I suspect that this is of broader interest, so below are two approaches for scaling CPUs, memory or bandwidth.

Queries per second
A single machine Meresco system runs between 10 and 100 queries per second. Scaling this requires adding more machines so load can be distributed over CPUs and networks. There are two approaches.

Approach A
Replicate the entire server process and feed updates to them simultaneously.

Approach B
Extract the most demanding components from the server’s configuration and put these on separate machines. Reconnect them using the Inbox component.

Before After

Both approaches are based on standard Meresco functionality and therefore easily configured.

Record updates per second
Meresco is able to process 1 to 10 updates per second concurrently with querying. Scaling this up requires adding machines that can share the load of processing the records using approach B. These machines can feed into one or more query processing machines, effectively enabling scaling along both axes.

The main idea is to decompose a system into subsystems which can be distributed and replicated. This analysis must be done before a system can scale up using cloud-like environments. How Meresco’s configuration supports this will be outlined in a future blog.

Total number of records
Meresco can host 10 – 100 million records on one machine, mostly limited by what its indexes can do. Scaling up requires a closer look at these indexes to see how additional resources must be allocated. In this area Lucene, BerkeleyDB and OWLIM have earned great reputations. Meresco’s architecture helps to get the most out of these.

Meresco’s homegrown Facet Index and Sorted Dictionary Index (used for auto-complete) can be scaled following approach B. However, with a single-node limit of roughly one billion records most applications would not need more than one node.

Conclusion
I realize that I only scratched the surface of how to scale Meresco. There are many details to discuss and you probably wonder how your situation could be dealt with. I’d love to hear your responses!