A peek behind the curtain at: Bibliographic Indexing in Evergreen

In an attempt to force myself to post more often, I’m starting a new series tentatively called “A peek behind the curtain at.” I’m not sure how often these will happen or exactly where it will lead, but if you want more (or less) of this, just comment below. ?And now, on to the main event!

We’re often asked how indexing of bibliographic material works in Evergreen. That’s a hard question to answer because, firstly, indexing is a hard problem to solve in practice. It’s also a hard question to answer because our solution is, well, mutable.

In Evergreen you can control what data is indexed, how it is presented to the indexer, how it is normalized for indexing — both as a raw value and as an index vector that is used internally — and the weights and ranking bonuses that any field can have when contributing to a result set.

All of that means indexing and searching are complex problems with complex solutions. So, as a start to explaining the process, I have created two pieces of documentation that I hope will help others understand what is going on under the covers, and perhaps spur discussion of how to improve and extend the Evergreen indexing and search infrastructure in the future.

To see how data is indexed, head over here on the Evergreen wiki. This covers the state of the art in trunk, what will become Evergreen 2.0 in the future.

Likewise, a discussion of searching is available here on the wiki. This is also targeted at trunk/2.0.

Please comment here, or start a thread on the open-ils-dev mailing list, if you’d like to take this discussion further. I look forward to any feedback!

In the next installment of the “behind the curtain” series I plan to highlight documentation describing what you can do with a search result for display purposes.


UPDATE: For some reason, WordPress decided that every sentence by the first in each paragraph should start with a question mark.  I’ve disabused it of that notion.  Also, it’s “peek”, not “peak” …