This is part #3 in a (never ending?) series of articles on Indexing and Searching the ISFDB.org data using Solr.When we left off on Jan 28, I had configs that enabled me to build a very basic index of all the title+author pairs in the ISFDB using the DataImportHandler, indexing into meaningful fields with useful data types. The goal last Friday (When I managed to squeeze in a bit of work on this, but I didn’t get a chance to blog about it until today) was to improve the “Document Modeling” of the index, so that queries could be used to answer meaningful questions about “Titles”.(If you are interested in following along at home, you can checkout the code from github. I’m starting at the blog_2 tag, and as the article progresses I’ll link to specific commits where I changed things, leading up to the blog_3 tag containing the end result of this article.)
What does “Document Modeling” mean
“Document Modeling” is just my fancy way of describing “Data Modeling” from a search perspective – I use the term “Document Modeling” because unlike traditional RDBMS or OO Data Modeling, when using an IR Engine like Solr, you need to think “flat”. Complete relationships need to be flattened into individual “Documents” that form the basis of all actions.At the end of last weeks blog, I mentioned some queries we could do with the data we had…The key part of these query descriptions to notice is the term “record” … what i meant by that is that because of how the data was being index, each Document in our index corresponded to a “record” of a title+author pair — which is fairly arbitrary. This isn’t the type of information most people are looking for, people want to search for “Books” or “People” – not “Instances where a Person was an author of a Book”.So today we’re going to tweak our index so that each Document models a “Title” and contains info about all of the Authors that collaborated on it.
Getting Started
To start things off, I took a look at the ISFDB.org DB Schema documentation. (It says a lot about Solr that this is really the first time I had to look at any documentation on the tables I was indexing). This lead me to a few pieces of information that I took advantage of…- constraining ca_status=1 in our DB query is how we eliminate reviews and just get real authors of titles
- verified that title_ctl is a bogus field