Beheer NBC+ Zoekplatform

Sinds vandaag voert Seecr’s Continubeheer het volledige beheer, van applicatie tot hardware, van het Zoekplatform van de Nationale Bibliotheekcatalogus (NBC+) van Bibliotheek.nl.

Het NBC+ Zoekplatform bestaat uit een negental services, waarvan vier meervoudig uitgevoerd ten behoeve van fail-over en voldoende capaciteit.  De diensten worden geleverd aan alle openbare bibliotheken in Nederland en houden in:

  1. Vergaren van data, het combineren, opschonen, verrijken en dagelijks actueel houden hiervan.
  2. Toegang tot de catalogus en andere bronnen (18.000.000 titels) via verschillende supersnelle zoekfuncties.
  3. Zoeken in het bezit van eigen en andere bibliotheken en vestigingen (39.000.000 exemplaren).
  4. Het opbouwen en onderhouden van een kennisbank waarin relaties tussen titels worden vastgelegd.

Procesverbetering

Agile werken wordt meer en meer de norm. Beheerprocessen moeten daarbij aansluiten. Wat heb je aan supersnelle realisatie van nieuwe features als het maanden duurt om die live te zetten?  Continubeheer is Agile, maar dan voor beheer. Snel schakelen, zonder gedoe. Dat kan door een zeer geoliede beheerorganisatie die zich uitstrekt tot de toeleveranciers. Alleen een op vertrouwen gebaseerde hechte samenwerking leidt tot snelle actie zonder fouten. En dat bespaart ook nog eens kosten bij alle betrokken partijen.

Samen met andere grootschalige landelijke diensten die bij Continubeheer van Seecr zijn ondergebracht, zoals de Nationale Aggregator van de Digitale Collectie en de Educatieve Contentketen van Kennisnet (Edurep) is gebleken dat we in staat zijn om dagelijks grote hoeveelheden data op een efficiënte manier te verwerken en aan te bieden. Daar zijn we trots op!

Het NBC+ Zoekplatform vanuit technisch perspectief

Wat is de Nationale Bibliotheek Catalogus (NBC) eigenlijk precies? In dit artikel wordt vanuit een technisch perspectief belicht wat er inmiddels is ontwikkeld voor de Nationale Bibliotheek Catalogus (NBC).

Nationale catalogus

De meerderheid van de Nederlandse openbare bibliotheken maakt gebruik van een centrale catalogus met publicaties en registreren alleen de lokale voorraad. Deze registraties hebben betrekking op de centrale catalogus. De beperkte informatie die uniek is voor de desbetreffende bibliotheek wordt toegevoegd.

Veel bibliotheken beschikken ook over extra uitgaven die niet zijn opgenomen in de nationale catalogus. Bijvoorbeeld muziek, albums, kranten, consumenten testrapporten etc. Het zoekplatform brengt hier verandering in.

Het zoekplatform

Het zoekplatform maakt de publicaties uit al deze bronnen beschikbaar via een Application Programming Interface (API). Hiermee is het mogelijk om de enorme hoeveelheid bibliotheek gerelateerde gegevens te gebruiken op elke denkbare manier en om applicaties voor de eindgebruiker te creëren.

Het zoekplatform maakt onderstaande toegankelijk:

  • Beknopte en uniforme metadatabeschrijving van alle publicaties.
  • Gedetailleerde informatie over organisaties (bibliotheken, uitgeverijen, musea, etc).
  • Eenduidige typologie van alle producten binnen het Platform: muziek, boeken, e-books, mensen, video, software, games, artikelen, etc.
  • Details van toonaangevende auteur thesauri, classificaties, etc.
  • Zowel de uniforme data en bron(meta)data.

De API beschikt over onderstaande functionaliteiten:

  • Geïntegreerd zoeken met autocomplete en zoeksuggesties.
  • Statische en dynamische ranking.
  • Object herleiden.
  • Gestructureerde queries.
  • Harvesten van data.
  • Pictogrammen en thumbnails.
  • Get-IT-diensten voor: lenen, downloaden, reserveren, enzovoort.

Semantische data

Het zoekplatform werkt met semantische data. In plaats van diep in te gaan op alle technische details van RDF en LOD, hebben we een opsomming gemaakt dat eenvoudig weergeeft wat er daadwerkelijk is bereikt voor API-gebruikers:

  • Uniforme datarepresentatie ongeacht hoe het wordt geopend.
  • Duidelijke en ondubbelzinnige relaties tussen objecten.
  • Open en gedetailleerde gegevens die rechtstreeks zijn gekoppeld aan de bron zonder verlies van informatie.
  • Multi-gestructureerd: kies je favorieten uit vele ontologieën.
  • Eenvoudige integratie met andere tools en technieken.

Innovatie

Het zoekplatform heeft twee belangrijke vernieuwingen:

  1. “Late Integration”. Bij deze methode worden er meerdere indexen separaat bijgehouden en worden de zoekresultaten bij het uitleveren geïntegreerd. Het onderhoud van de indexen kan sneller en specifieker, terwijl de integratie plaatsvindt in milliseconden. Dit vereist een technische innovatie. In het artikel “Reducing Index Maintenance Costs” kunt u hier meer over lezen.
  2. Het overbrugt de kloof tussen statistische “information retrieval” en “linked data” door deze technologieën op een slimme manier te koppelen in de API.

Status

Het zoekplatform is inmiddels in gebruik genomen door de Openbare Bibliotheek van Amsterdam. De nationale catalogus wordt gecombineerd met onder andere de muziekcollectie van Muziekweb.nl en met lokale evenementen van Uitburo.nl. Door de toepassing van ‘Late Integration’ is de index eenvoudig te beheren.

Andere kenmerken (al gereed of nog in ontwikkeling) zijn:

  • Statische en dynamische ranking; voor iedere zoekopdracht wordt een aparte ranking query uitgevoerd waarmee de zoekresultaten worden herwogen op basis van statische ranking gegevens, zoals leeftijd, holdings, bronnen en types. De statische ranking gegevens worden bijgehouden in een aparte index.
  • Door het uploaden van ontologieën is het mogelijk op op een andere manier door de data te navigeren.
  • Uitgebreide beschikbaarheidsdiensten bieden gedetailleerde informatie over hoe, waar en onder welke voorwaarden een object te verkrijgen is.

Vooral het laatste punt is een interessante toegevoegde waarde van het zoekplatform. In de bibliotheek- en het cultureel erfgoedsector is het aanbieden van een link te beperkt. De gebruiker wil vaak meer gegevens zoals beschikbaarheid en dergelijke.

Het platform maakt gebruik van een zowel algemene en gespecialiseerde uitvoering van de DAIA (Document Availability Information API). In een volgend artikel zullen we verder ingaan op de architectuur en de toepassing van DAIA.

The NBC+ Search Platform

As many wonder what the National Library’s Catalogue (NBC) actually is, I try to explain it here from a technical perspective.

National Catalogue

The majority of the Dutch public libraries use a central catalogue of publications and only register what they have in stock locally. These registrations refer to the central catalogue and only add limited information which is unique for that library.

But many libraries also offer extra publications not present in the national catalogue. For example music albums, newspapers, consumer test reports, event guides, special interest publications and so on. This is where the Search Platform comes in.

The Search Platform

The Search Platform makes all these publications from all these sources available through a unified Application Programming Interface (API). An API means: not for humans, but for computers.  So it is possible to use the vast amount of library related data in any conceivable way to create end-user applications.

Here is a short list of what Search Platform makes accessible:

  1. Concise and unified metadata description of all publications
  2. Detailed information about organizations (libraries, publishers, musea, etc)
  3. Unified typology of all things inside the Platform: music, books, e-books, people, video, software, games, articles, and so on.
  4. Details from leading Author thesauri, classifications etc.
  5. Both unified and Raw (meta) data of everything.

Here is a short list of what functionality the API has:

  1. Integrated topic search with autocomplete and term suggestions.
  2. Static and dynamic ranking.
  3. Object resolving.
  4. Structured queries.
  5. Harvesting.
  6. Icons and thumbnails.
  7. Get-It services for: loan, download, reserve, etc.

Semantic Data

The Search Platform works with Semantic Data.  Instead of boosting all the hyped technical details of RDF and LOD, we just list what it actually achieves for API users:

  1. Uniform data representation regardless of how you access it.
  2. Clear and unambiguous relations between objects.
  3. Open and detailed data directly linking to the source without information loss.
  4. Multi-structured: pick your favorites from many ontologies.
  5. Easy integration with other tools and techniques.

Innovation

The Search Platform features two key innovations:

  1. Late Integration. It keeps separate indexes and integrates results on the fly.  This allows for easier and more specific maintenance of the indexes while integration happens in milliseconds. This required a technical innovation. Read more about it in “Reducing Index Maintenance Costs…” and in the more technical post here.
  2. It crosses the chasm between statistical information retrieval and linked data by employing both technologies and combining them in a clever way in the API.  As for the reason and how, please bear with me, as the next post will be about this exact topic.

Status

The Search Platform is now in production. The Public Library of Amsterdam uses it for all its branches.  It combines the National Catalogue with, among others, the music collection of Muziekweb.nl and local events from Uitburo.nl.  Late Integration makes sure maintaining the indexes is very easy.

Other features (ready or under development) are:

  1. Dynamic static rank: a separate ranking query, reweighs results according to static ranks maintained in a separate index.  Such ranks include at this moment: age, , holdings, sources and types.
  2. Uploading and using more ontologies so that more content becomes navigable through them.
  3. Extensive availability services providing detailed information on how to get each object, how, where and under what conditions.

Especially the last point is an interesting added value of the Search Platform.  No matter what one finds, one always wants to click-through to see more. In the library and cultural heritage domain, that involves almost always more than just providing a link.  The platform uses a both generalized and specialized implementation of the Availability Information working draft (DAIA).  A next blog post will offer more details on the architecture and application of DAIA.

 

About scalability, efficiency, pairs and time

At Seecr we continuously both scale up and scale out our systems, but we also improve efficiency continuously.  Here is why and how we do it.

Scalability versus Efficiency

Quite often, people think that scalability is everything. But scaling an inefficient system, if at all possible, is going to be expensive and might even stop you completely. It certainly looks nice when Amazon adds 100 extra machines to your system in an instant, but it might just as well be a gross waste of resources. And as long as money is not a problem, the problems related to inefficient systems can remain hidden for a long time.

Why is Inefficiency a Problem

We are now living in an era where more and more people need no explanation as to why inefficiency is a bad thing. Much to my delight, the mere idea of wasting something is becoming increasingly sufficient to let people act. So I won’t go into that direction. There is another problem however.

Inefficiency is caused by something. And when the time comes that you do want to improve it, you need to address the causes. And then it might turn out that you are a little too late….

Here are two significant causes we have observed frequently:

  1. Programming negligence
  2. Wrong algorithm

1. Programming negligence.

Programming is quite a difficult task and each problem has different aspects that need careful attention. There are the matters of primary design, testability, consequences for other features, refactoring, selection of libraries, code style, readability and intention, integration with other parts, packaging, upgrading and data conversion and on goes the list, endlessly. That’s the nature of software.

Efficiency is definitely somewhere on that list of aspects. It is all too natural that, once code functionally works, everyone is glad and moves on to the next task. But at that time, some aspects might not have received the attention they require, and efficiency is often among them.  If this goes on for some time, you’ll end up with many places needing small fixes.

But you can’t fix it later.  The reason for that is as profound as it is simple: there are just too many small things, each of which contributes only little to the problem. It is the power of the many. Addressing each of these problems requires you take time to delve into them again, with only little reward: each single problem solved improves efficiency only a little. You’ll have to work through them all to see results.

If only you would have made different choices in the first place, when you were initially writing it…

2. Wrong algorithm. 

A problem can often be solved with very different solutions. Naturally, you first pick a solution based on your experience and understanding of the problem, then work it out. Often it becomes clear during the process that another solution is a better fit. This is a direct result of an increased understanding of the problem while working on it. Deep understanding of the dynamics that arise when the problem and the solution interact might also arrive later. For example when you run tests with fully populated data sets and unforeseen usage patterns that do not appear in testing environments. It turns out that you will need a (completely) different algorithm to solve the problem.  Back to the drawing board. That’s the nature of software too.

Dead in your tracks

Both problems, many small inefficiencies and wrong algorithm, are not just a pair of non-optimalities of your system.  They both have the ability to simply place required throughput and responsiveness beyond your capabilities and budget.  Because both problems require a complete rethinking of your system: go over all the details again, and improve them, or go over the main design again and change it. This costs a lot of time, and, most importantly, it takes the time of the most specialized people.  If you could only have made other decisions when the system was first created….

What are solutions?

Let me get this straight first: getting better programmers, smarter architects or more elaborate processes including a lot of quality assurance does not solve the problem. While some people do learn faster or can have more things on their minds, and while some elaborate processes do catch more problems, they basically only ameliorate the situation marginally.  They do not address the limitations fundamentally.

So what is the fundamental problem then?

The fundamental problem is that:

  1. people are given too many things to worry about at the same time.
  2. people are given too little time to learn and understand well.

In short, they suffer from Continuous partial attentionNow, the solution becomes almost evident: use more people and give them more time.

In Shock?

You would probably say: “but I can’t spend more people and more time, you know that budgets are tight and your solution is not realistic.”  Well, if you really think that, stop here.

If you think you are in control and you can change things for the better: continue.

First: Pair Programming

The most profound impact on software quality (that’s what we’re talking about) I have ever seen is the working in pairs.  Pairs have double the capacity to deal with al the things-to-worry-about.  But that’s only the beginning.  The dynamics that play between pairs is really what pays off.  Each one has his or her own set of aspects of interest, and stimulates the thinking process about the other.

Pairs have the advantage to easily mix.  Depending on the task at hand, on personal preferences, on tiredness even, one person might switch with another.  This new combination will pay attention to other aspects with a fresh mind.  Also, there is knowledge exchange happening automatically.  Purposely shuffling with pairs is a strong tool in the hand of a team.  It allows you to increase the number of aspects that deserve attention.

Second: Time

But having more brains at the task is only half the solution.  The other half is giving these brains enough time to learn and understand the problem.  So if one pair is functionally done, prepare to let a third person replace one of them.  There will be points left over for him or her to think about. While the first pair focussed on getting it functionally right, the second pair looks at other aspects such as efficiency.  In fact, it is a simple application of Make-It-Work-Make-It-Right.

Conclusion

It all comes down to the careful allocation of people and time. Look at their skill, interests and allocate enough time. I can’t stress this enough: proper allocation of time means saving time. Full stop.  When you rush, you are going to regret it later; and someone is going to pay for it.  It is therefore an act of will to allocate time in the beginning, right when you are solving the problem.

The only way to build scalable systems is by first making efficient systems. And for that you need to allocate enough time and people before you scale up.