Solr Powered ISFDB – Part #3: Document Modeling

This is part #3 in a (never ending?) series of articles on Indexing and Searching the ISFDB.org data using Solr.When we left off on Jan 28, I had configs that enabled me to build a very basic index of all the title+author pairs in the ISFDB using the DataImportHandler, indexing into meaningful fields with useful data types. The goal last Friday (When I managed to squeeze in a bit of work on this, but I didn’t get a chance to blog about it until today) was to improve the “Document Modeling” of the index, so that queries could be used to answer meaningful questions about “Titles”.(If you are interested in following along at home, you can checkout the code from github. I’m starting at the blog_2 tag, and as the article progresses I’ll link to specific commits where I changed things, leading up to the blog_3 tag containing the end result of this article.)

What does “Document Modeling” mean

“Document Modeling” is just my fancy way of describing “Data Modeling” from a search perspective – I use the term “Document Modeling” because unlike traditional RDBMS or OO Data Modeling, when using an IR Engine like Solr, you need to think “flat”. Complete relationships need to be flattened into individual “Documents” that form the basis of all actions.At the end of last weeks blog, I mentioned some queries we could do with the data we had…

All records where the authors last name is smith
All records for Novels by authors born in Paris
records about titles with “space” in their name, with facet counts based on the type of work
A break down showing how many novels contain “space” vs “robot” vs “robots” in their titles

The key part of these query descriptions to notice is the term “record” … what i meant by that is that because of how the data was being index, each Document in our index corresponded to a “record” of a title+author pair — which is fairly arbitrary. This isn’t the type of information most people are looking for, people want to search for “Books” or “People” – not “Instances where a Person was an author of a Book”.So today we’re going to tweak our index so that each Document models a “Title” and contains info about all of the Authors that collaborated on it.

Getting Started

To start things off, I took a look at the ISFDB.org DB Schema documentation. (It says a lot about Solr that this is really the first time I had to look at any documentation on the tables I was indexing). This lead me to a few pieces of information that I took advantage of…

constraining ca_status=1 in our DB query is how we eliminate reviews and just get real authors of titles
verified that title_ctl is a bogus field

With those tweaks in mind, the first thing to do to make progress on the configs was to spell out explicitly the fields I wanted in our SQL so I could keep track of what was coming from where and completely skip the fields i was already ignoring.

Nested Entities

Next up, was to switch to making our top level “Entity” in the DIH config be “title” and add a “nested entity” that corresponds to the “author” data. Using nested entities like this allows for one query that fetches the main details about each row (“title”), with a sub query that returns multiple values for multivalued fields (the “author” related fields).Using nested entities is a lot slower (~6.5 minutes on my laptop vs ~2 minutes in the old version) but that’s just because we need to execute a lot more sql queries (600k+ SQL queries instead of 1) — it’s the nature of the DB data model. If the datamodel had the notion of a “primary” author, then we could include that in our main select for the “title” entity, and then most of the queries for the “title_author” sub-entity would be no-ops. (not sure if it would actually be much faster though)The upside is now we have a single document per “title” with multivalued fields containing the author details — since these documents are title-centric, I remove some of the author fields since they don’t really add any value. (I’ll add them back when we also include author-centric documents in our index)

Gotcha: DIH and Multiple Values for Single Valued Fields

At this point while doing some testing, I discovered something interesting about DIH that I didn’t know before. If you configure it to produce multiple values for a field (like all the author_* fields in my current isfdb-dih.xml) but some of those fields are multiValued=”false” in your schema (I forgot to add it to author_canonical) DIH will silently throw away the duplicates for you and just give you one value. I would have expected it to error (if it had I would have caught my mistake before my last commit) but it doesn’t — live and learn (and fix my schema.xml mistake).

Speed Things Up?

At this point, I tried adding the CachedSqlEntityProcessor to my sub-entity, but it didn’t help speed things up. In hindsight this wasn’t too surprising given the number of authors, but then I realized because I was using canonical_authors in the sql for the nested entity, of course every select would be unique.Then I noticed the option on CachedSqlEntityProcessor to load all rows from an entity into memory as part of a single upfront query. This is a feature I wasn’t aware of, that sounded like it could be great for dealing with authors (assuming you have the RAM). The trick was to move canonical_authors out of the entity (so we only have one row per author, not one per author_title) which leads to a triple nested entity – requiring only 3 DB SQL calls to get all the data.this got the indexing time down to only ~70 seconds — way better then my expectations (I figured it might be close to as fast as the single select, but I didn’t expect it to be faster). Using CachedSqlEntityProcessor won’t help in all situations (it really depends on the volume of data you’re trying to cache, and how much RAM you have to work with) but in this case it turned out awesome.

Conclusion (For Now)

And that wraps up this latest installment with the blog_3 tag. Now queries like the ones I mentioned last week all give us results that actually mean something interesting about “Titles” in the ISFDB…

Check back at the end of this week, when I plan to add “Author Centric” Documents to our Document Model, letting users search for Titles or Authors and still get meaningful results.

Solr Powered ISFDB – Part #3: Document Modeling

What does “Document Modeling” mean

Getting Started

Nested Entities

Gotcha: DIH and Multiple Values for Single Valued Fields

Speed Things Up?

Conclusion (For Now)

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112