Over Erik J. Groeneveld

I love beautiful software.

Meer keuze voor voorlezers!

Vaders en moeders zoeken vaak naar boeken die zijn toegesneden op het leesniveau van hun kind. Na de volgende release op het Zoekplatform van NBC+, zullen ze meer suggesties krijgen aangeboden.

Zijn er dan meer boeken voor deze doelgroep aangeschaft door de bibliotheek? Dat wellicht ook, maar er is nog iets ànders gebeurd. We hebben een praktische toepassing van Linked Data ingezet. Dat hebben we als volgt gedaan:

Er bestaan in de wereld van kinderboeken een standaard om de moeilijkheidsniveau van een boek aan te geven. Tot op heden ‘begreep’ het NBC alleen een beperkt aantal (lagere) AVI-niveau’s. Nu de brondata werd uitgebreid met

Screen Shot 2016-04-25 at 11.02.18

Facet om op AVI-niveaus te selecteren

hogere AVI-niveau’s, was dat voor ons reden om de vindbaarheid ervan op het Zoekplatform te verbeteren. We hebben zodoende de ontologiebeschrijving in RDF uitgebreid, waardoor de hogere AVI-niveau’s uit de brondata nu ook worden gemapt.

Van code naar URI: de wereld van Linked Data

Typische voordelen van een vertaling van code uit een database naar een URI, die verwijst naar de ontologie (in RDF), is dat:

  1. de verschillende coderingen zijn nu van elkaar te onderscheiden. Er zijn namelijk nog veel meer soorten codes dan AVI, en die kunnen zo beter uit elkaar gehouden worden.
  2. er kunnen verbanden gelegd worden tussen verschillende codes (alignment) zodat, als ie
    mand anders een iets andere systematiek gebruikt, met andere codes, maar waarin toch overlap voorkomt, deze twee systemen gekoppeld kunnen worden.
  3. de systematische ordening van zowel de codes als de betekenis ervan in een RDF-bestand, biedt de mogelijkheid om automatische facetten te krijgen, die gebruikers helpen bij het zoeken.

Samenvallende systematieken

Als punt 1, 2 en 3 gecombineerd worden, ontstaat de situatie waarin iemand heel specifiek volgens zijn favoriete ontologie kan zoeken (onderscheid). Aan de andere kant kun je als software developer ook delen van ontologieën hergebruiken in een nieuwe ontologie, zonder dat die delen elkaar in de weg zitten.

Een andere ontologie, de codes voor genres, is een verhaal apart. Daarover een andere keer.

 

NWO-strategie en Seecr

Vandaag wordt de meerjarenstrategie tot 2018 van het NWO aangeboden aan staatssecretaris Sander Dekker.  Ik kreeg een uitnodiging om dit bij te wonen, vanmiddag in het Gemeentemuseum Den Haag.

Wetenschap en het MKB

De afgelopen jaren heb ik verschillende interessante mensen ontmoet bij het NWO. Mensen die op een hele positieve manier op waren naar de koppeling tussen wetenschap en bedrijfsleven.  Het idee dat toepassing van wetenschap vooral door bedrijven plaats vindt, ligt hier aan ten grondslag.

Deelname van Seecr

Ook was er een prettige voorkeur voor het MKB.  Als klein bedrijf is Seecr daar natuurlijk heel blij mee.  We hebben veel geleerd en gedaan in het NWO Catchplus programma. Dit heeft onder andere geleid tot een hele mooie standaard: http://www.openannotation.org.  Deze passen wij nu toe in de nieuwste aggregators van de Digitale Collectie.

In 2013 hebben we meegedaan aan een onderzoeksweek met “ICT with Industry“.  Samen met topwetenschappers hebben we ons een week gelang gebogen over een door Seecr ingebrachte casus. Dit was buitengewoon inspirerend!  Dank ook special voor Rosemarie van der Veen-Oei voor haar enthousiaste koppelen!

Perspectief

In 2014 kreeg ik het verzoek van het STW om als commissielid programmavoorstellen te beoordelen in de Perspectiefronde 2013/2014 High Tech Systemen en Materialen.  Zo kreeg ik een goed beeld van hoe zo’n proces gaat en op welke manier NWO en STW hun strategie uitvoeren. De drive om gemeenschapsgeld op zo concreet mogelijke manier te laten bijdragen aan diezelfde gemeenschap, is mooi om te zien!

Vandaar dat ik heel benieuwd ben naar vanmiddag, en de komende drie jaar.

Open Source Erfgoed Software: Meresco

In Schudden voor gebruik – Nationale strategie voor landelijke voorzieningen voor digitaal erfgoed (zie Openbare Review Nationale Strategie Digitaal Erfgoed) wordt gezegd:

“Het ICT-maatwerk heeft in de voorbije jaren veel kunnen betekenen voor instellingen en hun klanten, maar leidt nu ook tot beperkingen bij de ontwikkeling van een goed functionerend informatienetwerk. Er moet daarom nu meer aandacht uitgaan naar gestandaardiseerde en toegankelijke basisvoorzieningen, om daar bovenop eenvoudiger en voordeliger maatwerktoepassingen te kunnen maken.

80/20
Ja! In de Erfgoedsector kan 80% van de software gedeeld worden. Je doet zoveel mogelijk samen, zonder dat je wordt beperkt in je vrijheid om je eigen ding te realiseren. 20% maatwerk daar bovenop is het interessante deel waarmee je je onderscheidt en waar besteding van het budget te rechtvaardigen is, omdat het meerwaarde oplevert.

Nu al beschikbaar
Wat dat betreft is er goed nieuws. In de afgelopen jaren heeft Seecr al veel gemeenschappelijke software ontwikkeld voor het publiceren en toegankelijk maken van grote hoeveelheden data. Deze software is:

  • beschikbaar onder Open Source licentie;
  • onder copyright van erfgoedinstellingen;
  • gesmeed tot een geheel van herbruikbare componenten;
  • als source gepubliceerd op GitHub;
  • als binaire packages installeerbaar op Debian of RedHat;
  • bewezen technologie: productierijp en grootschalig inzetbaar.

Bewezen technologie
Vandaag de dag draaien grote landelijke netwerken op deze software. In het domein van onderwijs, erfgoed, openbare en wetenschappelijke bibliotheken. Deze bevatten zeer veel gecombineerde data en hebben dagelijks een groot publiek van institutionele gebruikers en eindgebruikers.

bron Elsbeth Kwantbron: Elsbeth Kwant
In bovenstaande afbeelding van Elsbeth Kwant geeft de oranje kleur in stam en wortels aan waar deze software aanwezig is. Deze data wordt al dagelijks gebruikt en onder elkaar uitgewisseld.

Verbetering componenten
Herbruikbare componenten bedenk je niet, die groeien. Ze ontstaan als gevolg van hergebruik: de eerste maakt iets, de volgende past het aan en bij de derde kun je gaan spreken van een herbruikbaar component. Als dat enkele jaren doorgaat, ontstaat er een enorme hoeveelheid aan meer generieke en steeds betere componenten. Dit is precies wat met deze software de afgelopen jaren is gebeurd en wat dagelijks plaatsvindt.

Copyright en licenties
Een open source licentie is mooi, maar daarmee heb je nog geen controle over wat er met die software gebeurt, want die ligt bij de copyright eigenaar. Het copyright van deze software ligt bij grote spelers uit de erfgoed sector waaronder Stichting Beeld en Geluid, de Koninklijke Bibliotheek, de Koninklijke Nederlandse Academie van Wetenschappen (KNAW) en Stichting Kennisnet.

Publicatie en packaging
De publicatie en packaging is de afgelopen jaren om niet gedaan door Seecr. De binaire packages kunnen in enkele minuten geïnstalleerd worden. In een handomdraai heb je dan een eigen netwerk waar je een eigen toepassing op kunt bouwen zonder je zorgen te hoeven maken over de gemeenschappelijke delen. Een goed voorbeeld hiervan is Narcis die door DANS wordt onderhouden.

Heeft deze software al een naam? Ja! In 2007 hebben wij met vier partijen die hiermee begonnen zijn de naam Meresco bedacht. Onder die naam is het sindsdien gepubliceerd en onderhouden tot op de dag van vandaag: zie github.com/seecr?query=meresco

Wat is er met deze software allemaal mogelijk? In een volgende blog zal ik hier dieper op in gaan.

Multi-threaded expert search with Lucene

Lucene’s high-level search API parallelizes search automatically, achieving superb performance.

The low-level expert API does not do that.  Here is a solution to parallelize custom search. It is Open Source.

Introduction

The Open Source search engine Lucene has evolved into a very smart and efficient search tool.  It offers both high level search interfaces as well as low-level (expert) interfaces. If you pass it an Executor, the high-level interface (API) automatically enables all the CPU’s in a server to work on your queries as fast as possible.  The results are downright fantastic.  For quite large data sets and quite complicated queries, you will get results in milliseconds.

Concurrency and the expert interface

However, the low-level API does not put all your cores to work automatically.  That means that as soon as queries become more complicated and you have to use the low-level API, it runs slower.  That is not a problem as long as you limit the number of documents to, say, 10,000,000. Most functionality like sorting and facets still runs in a few 100 of milliseconds. But when you have many more documents and when you must support many concurrent queries, the help of more CPU’s would be very nice.

Before I introduce our solution, a quick word about the so-called expert API and the problem with it.

How the low-level API works: Collector

The main work of any search engine consists of intersecting, intersecting and yet more intersecting. Lucene’s low-level expert API gives a programmer access to the inner loop that performs the intersecting. Lucene has a hook that, regardless of what intersection algorithm it uses (zip, gallop, bisect, skip lists, and so on, look for “integer intersection algorithms”), it calls your code as soon as it determines a document is in the set defined by your query.

The hook consists of an interface called Collector, that has a method collect().  When you pass an object that supports this interface, Lucene will call collect() for each result.  This method must be very short, small and fast, since Lucene will call it many, many times.

The problem is that the search methods that accept a Collector do not enable threads and leave all but one of your CPU’s idle.  There is a good reason for that.

The problem with Collectors

The Collector interface is inherently thread-unsafe.  The reason for this is the way the setNextReader() method on Collector works. Internally, Lucene maintains an index composed of tens of segments each containing a subset of the complete index.  During search, Lucene tells the collector that a new segment starts and that the following calls to collect() are in the context of this segment.  All but the most trivial Collectors need to remember this context and this is why the Collector is unsuitable for multi-threading.  What if two threads call setNextReader() with different arguments?

Using Lucene’s internal segmenting to divide the work is quite natural. Indeed, the high-level API divides the work based on segments.  What if custom Collectors could also use this?

SuperCollector and SubCollector

We designed two new interfaces that are much like the Collector, but just different enough to support multi-threaded search for any operation.  The first is SuperCollector.  It basically replaces Collector in the call to search().  So instead of (pseudocode):

searcher.search(query, filter, collector)

we use:

searcher.search(query, filter, supercollector)

That’s all.

The SuperCollector creates subCollectors that are thread-safe:

subcollector = supercollector.subCollector(...)

The new search() method calls the method subCollector() for as many threads it wants to create. SubCollector is almost identical to Collector.  If fact, it is a subclass of Collector. It supports the setNextReader() and setScorer() methods and of course collect() and it adds only one more method:

subcollector.complete()

The thread executing the search calls complete() after it is finished. Some collectors do most of their work during collect() calls, but others just collect data for post-processing later.  Such collectors can put time-consuming post-processing in the complete() method so that it executes in parallel as well.

Encapsulating existing code

The nasty thing about the Collector API is that once you start using it, the simple parts of your query, that used to be automatically parallelized by Lucene, will now be single-threaded as well. Therefore we also need to create Super/SubCollectors for simple queries.

Simple queries are queries with results sorted on score or on a field value. The parallelization is spread across the Lucene code. We found it and encapsulated it as follows:

  1. Some code in the search method at lines 450-456 and lines 535-547 creates the threads.  We replace this code with one generic piece in search().
  2. Code for creating the TopFieldCollector at line 535 for sorting on a field while the case for sorting on score is degelated to SearcherCallableNoSort which will delegate further until a TopScoreDocCollector is finally created at line 490. We encapsulate this in specific subCollector() methods.
  3. Code for merging the results at lines 460-471 (sort on score) and at lines 550-559 (sort on field). We encapsulate this in SuperCollector.

Example: TopFieldSuperCollector

Here are the results for applying the encapsulation outlined above to the way Lucene’s default TopFieldCollector works.  These are the important steps:

First, the TopDocsSuperCollector creates the proper SubCollector when search() asks for it:

TopFieldSubCollector
createSubCollector() {
    return new TopFieldSubCollector(...);
}

The search continues as normal.  Lucene calls collect() for each result found, but this happens in a thread.  The TopFieldSubCollector simply delegates to a standard Lucene TopFieldCollector.

The search signals completion of the segment by calling complete(), which simply delegates again:

void complete() {
    this.topdocs = this.delegate.topDocs()
}

Finally, the merge happens when the application asks for the result by calling topDocs():

TopDocs topDocs(int start) {
    for (TopFieldSubCollector sub : super.subs) {
        // merge results from subcollectors
    }
    return topDocs
}

What next?

The concept above has been implemented for simple search and shows very much speed-up.  We now have to create Super- and SubCollectors for the other functions we use:

  1. for sorting by relevance: TopDocsCollector
  2. for calculating facets: FacetCollector
  3. for creating trees of collectors: MultiCollector
  4. for de-duplicating results: DeDupCollector
  5. for query-time join: KeyCollector and ScoreCollector

For each of them we are working out the design and then we will add them. I will post the results when we are done.

Meanwhile, you can follow the code here: Meresco Lucene.

Multi-threading and optimizing speed-up (Infographic)

Below is an #infographic about how a multi-threaded application in general behaves, including text balloons with suggestions about where to look for improvements.

I created it because I wanted to explain to a colleague what I did to improve Lucene’s faceting performance using 16 CPU cores.

I hope it speaks for itself.

Infographic about Multi-threading

Infographic about Analyzing and Optimizing Multi-threading (and why you don’t see linear speed-up in most practical cases)

The NBC+ Search Platform

As many wonder what the National Library’s Catalogue (NBC) actually is, I try to explain it here from a technical perspective.

National Catalogue

The majority of the Dutch public libraries use a central catalogue of publications and only register what they have in stock locally. These registrations refer to the central catalogue and only add limited information which is unique for that library.

But many libraries also offer extra publications not present in the national catalogue. For example music albums, newspapers, consumer test reports, event guides, special interest publications and so on. This is where the Search Platform comes in.

The Search Platform

The Search Platform makes all these publications from all these sources available through a unified Application Programming Interface (API). An API means: not for humans, but for computers.  So it is possible to use the vast amount of library related data in any conceivable way to create end-user applications.

Here is a short list of what Search Platform makes accessible:

  1. Concise and unified metadata description of all publications
  2. Detailed information about organizations (libraries, publishers, musea, etc)
  3. Unified typology of all things inside the Platform: music, books, e-books, people, video, software, games, articles, and so on.
  4. Details from leading Author thesauri, classifications etc.
  5. Both unified and Raw (meta) data of everything.

Here is a short list of what functionality the API has:

  1. Integrated topic search with autocomplete and term suggestions.
  2. Static and dynamic ranking.
  3. Object resolving.
  4. Structured queries.
  5. Harvesting.
  6. Icons and thumbnails.
  7. Get-It services for: loan, download, reserve, etc.

Semantic Data

The Search Platform works with Semantic Data.  Instead of boosting all the hyped technical details of RDF and LOD, we just list what it actually achieves for API users:

  1. Uniform data representation regardless of how you access it.
  2. Clear and unambiguous relations between objects.
  3. Open and detailed data directly linking to the source without information loss.
  4. Multi-structured: pick your favorites from many ontologies.
  5. Easy integration with other tools and techniques.

Innovation

The Search Platform features two key innovations:

  1. Late Integration. It keeps separate indexes and integrates results on the fly.  This allows for easier and more specific maintenance of the indexes while integration happens in milliseconds. This required a technical innovation. Read more about it in “Reducing Index Maintenance Costs…” and in the more technical post here.
  2. It crosses the chasm between statistical information retrieval and linked data by employing both technologies and combining them in a clever way in the API.  As for the reason and how, please bear with me, as the next post will be about this exact topic.

Status

The Search Platform is now in production. The Public Library of Amsterdam uses it for all its branches.  It combines the National Catalogue with, among others, the music collection of Muziekweb.nl and local events from Uitburo.nl.  Late Integration makes sure maintaining the indexes is very easy.

Other features (ready or under development) are:

  1. Dynamic static rank: a separate ranking query, reweighs results according to static ranks maintained in a separate index.  Such ranks include at this moment: age, , holdings, sources and types.
  2. Uploading and using more ontologies so that more content becomes navigable through them.
  3. Extensive availability services providing detailed information on how to get each object, how, where and under what conditions.

Especially the last point is an interesting added value of the Search Platform.  No matter what one finds, one always wants to click-through to see more. In the library and cultural heritage domain, that involves almost always more than just providing a link.  The platform uses a both generalized and specialized implementation of the Availability Information working draft (DAIA).  A next blog post will offer more details on the architecture and application of DAIA.

 

About scalability, efficiency, pairs and time

At Seecr we continuously both scale up and scale out our systems, but we also improve efficiency continuously.  Here is why and how we do it.

Scalability versus Efficiency

Quite often, people think that scalability is everything. But scaling an inefficient system, if at all possible, is going to be expensive and might even stop you completely. It certainly looks nice when Amazon adds 100 extra machines to your system in an instant, but it might just as well be a gross waste of resources. And as long as money is not a problem, the problems related to inefficient systems can remain hidden for a long time.

Why is Inefficiency a Problem

We are now living in an era where more and more people need no explanation as to why inefficiency is a bad thing. Much to my delight, the mere idea of wasting something is becoming increasingly sufficient to let people act. So I won’t go into that direction. There is another problem however.

Inefficiency is caused by something. And when the time comes that you do want to improve it, you need to address the causes. And then it might turn out that you are a little too late….

Here are two significant causes we have observed frequently:

  1. Programming negligence
  2. Wrong algorithm

1. Programming negligence.

Programming is quite a difficult task and each problem has different aspects that need careful attention. There are the matters of primary design, testability, consequences for other features, refactoring, selection of libraries, code style, readability and intention, integration with other parts, packaging, upgrading and data conversion and on goes the list, endlessly. That’s the nature of software.

Efficiency is definitely somewhere on that list of aspects. It is all too natural that, once code functionally works, everyone is glad and moves on to the next task. But at that time, some aspects might not have received the attention they require, and efficiency is often among them.  If this goes on for some time, you’ll end up with many places needing small fixes.

But you can’t fix it later.  The reason for that is as profound as it is simple: there are just too many small things, each of which contributes only little to the problem. It is the power of the many. Addressing each of these problems requires you take time to delve into them again, with only little reward: each single problem solved improves efficiency only a little. You’ll have to work through them all to see results.

If only you would have made different choices in the first place, when you were initially writing it…

2. Wrong algorithm. 

A problem can often be solved with very different solutions. Naturally, you first pick a solution based on your experience and understanding of the problem, then work it out. Often it becomes clear during the process that another solution is a better fit. This is a direct result of an increased understanding of the problem while working on it. Deep understanding of the dynamics that arise when the problem and the solution interact might also arrive later. For example when you run tests with fully populated data sets and unforeseen usage patterns that do not appear in testing environments. It turns out that you will need a (completely) different algorithm to solve the problem.  Back to the drawing board. That’s the nature of software too.

Dead in your tracks

Both problems, many small inefficiencies and wrong algorithm, are not just a pair of non-optimalities of your system.  They both have the ability to simply place required throughput and responsiveness beyond your capabilities and budget.  Because both problems require a complete rethinking of your system: go over all the details again, and improve them, or go over the main design again and change it. This costs a lot of time, and, most importantly, it takes the time of the most specialized people.  If you could only have made other decisions when the system was first created….

What are solutions?

Let me get this straight first: getting better programmers, smarter architects or more elaborate processes including a lot of quality assurance does not solve the problem. While some people do learn faster or can have more things on their minds, and while some elaborate processes do catch more problems, they basically only ameliorate the situation marginally.  They do not address the limitations fundamentally.

So what is the fundamental problem then?

The fundamental problem is that:

  1. people are given too many things to worry about at the same time.
  2. people are given too little time to learn and understand well.

In short, they suffer from Continuous partial attentionNow, the solution becomes almost evident: use more people and give them more time.

In Shock?

You would probably say: “but I can’t spend more people and more time, you know that budgets are tight and your solution is not realistic.”  Well, if you really think that, stop here.

If you think you are in control and you can change things for the better: continue.

First: Pair Programming

The most profound impact on software quality (that’s what we’re talking about) I have ever seen is the working in pairs.  Pairs have double the capacity to deal with al the things-to-worry-about.  But that’s only the beginning.  The dynamics that play between pairs is really what pays off.  Each one has his or her own set of aspects of interest, and stimulates the thinking process about the other.

Pairs have the advantage to easily mix.  Depending on the task at hand, on personal preferences, on tiredness even, one person might switch with another.  This new combination will pay attention to other aspects with a fresh mind.  Also, there is knowledge exchange happening automatically.  Purposely shuffling with pairs is a strong tool in the hand of a team.  It allows you to increase the number of aspects that deserve attention.

Second: Time

But having more brains at the task is only half the solution.  The other half is giving these brains enough time to learn and understand the problem.  So if one pair is functionally done, prepare to let a third person replace one of them.  There will be points left over for him or her to think about. While the first pair focussed on getting it functionally right, the second pair looks at other aspects such as efficiency.  In fact, it is a simple application of Make-It-Work-Make-It-Right.

Conclusion

It all comes down to the careful allocation of people and time. Look at their skill, interests and allocate enough time. I can’t stress this enough: proper allocation of time means saving time. Full stop.  When you rush, you are going to regret it later; and someone is going to pay for it.  It is therefore an act of will to allocate time in the beginning, right when you are solving the problem.

The only way to build scalable systems is by first making efficient systems. And for that you need to allocate enough time and people before you scale up.