Evergreen DokuWiki

This is a collection of proposed changes to searching. None of it is guaranteed to happen and it is all subject to revision.

Database changes

Table Changes

Create metabib.combined_*_field_entry tables:

record - FK to biblio.record_entry
metabib_field - FK to config.metabib_field, nullable
index_vector - Combined stripped index_vector

PK would be record + metabib_field or similar. No need for ID-based as this would be 100% maintained by the DB and thus needs no IDL entry.

These tables would be populated with the combined index_vector for all entries for a given record/field combination, as well as for a given record/all fields combination when metabib_field is NULL.

Problem solved: Atom negation not negating properly when multiple rows exist (say, subject atom negation). Possibility for speed increase due to less rows needing to be checked for record matching.

Note that the index_vector would be stripped. This is because we would be ranking on individual rows, not the combined one.

Create config.metabib_class_ts_map table:

id - PK
field_class - FK to config.metabib_class
ts_config - TEXT
active - bool
index_weight - character(1) - Restricted to A,B,C,D. Which index_vector weight class this belongs to. Default D.
index_lang - TEXT, nullable. Regex to match record languages against for index-time parsing. Intended for turning on/off stemming or replacements. NULL means "always use"
search_lang - TEXT, nullable. Same as index_lang, but for search-time.
always - bool

This would set the defaults for text search indexing config.metabib_field values. If "always" is set the rule is applied in addition to field specific values, otherwise the rule is only applied in lieu of field specific values.

GOALS INCLUDE: Stemmed and non-stemmed variants indexed, multi-language stemming, etc.

RECOMMENDED: One entry for each class that does 'simple' on weight A with always set and no language restrictions. One entry for each class for your preferred language stemming (default: english) on weight C with always set and no language restrictions.

Create config.metabib_field_ts_map table:

id - PK
metabib_field - FK to config.metabib_field
ts_config - TEXT
active - bool
index_weight - same as class map
index_lang - same as class map
search_lang - same as class map

The same as the class_ts_map table, but for individual fields. This would allow a keyword search to more significantly bump, say, titles or authors by adding another keyword index.

ALTER config.metabib_class table:

ADD COLUMNS:
- a_weight
- b_weight
- c_weight
- d_weight

For ranking values. I don't see a reason to make these any more specific than the entire class. Likely defaults may be the stock text search defaults.

Weight Classes

Recommended (but not required) uses for weight classes:

A - Non-stemmed, 'simple'
B - Lang-specific Thesaurus Replacements
C - Lang-specific Stemming
D - Lang-specific Thesaurus Replacements + Stemming

DB Triggers

Creating or updating a metabib.*_field_entry row would cause it to load up all applicable class and field text search mappings. For each weight in order from A to D:

Build individual tsvectors based on the ts_config entries
Convert those to strings
Concat the strings with spaces between
Convert back to a unified tsvector
Use tsvector concat (||) to combine with previous groups

This ensures that the first weight used from A to D has accurate position info, but the following ones may not.

Also, update the "populate metabib.*_field_entry" ingest trigger to also fire off a new "populate combined" function when it is done with a record, so that we don't have things trying to update the combined entry on every field_entry change.

QP Changes

Search Construction

Non-phrase searches

If doing a non-specific class search load up all config.metabib_class_ts_map rows that match search_lang, or have a NULL search_lang. Each atom gets a to_tsquery for each ts_config, ||ed together, then those are &&ed together to form the query.

If doing a specific field set load up the configs for the specific field(s), if any match search_lang, then add in the "always" ones for that class. Repeat as needed.

Sanity default would probably be "simple" and "<class>" if none were found.

Regardless, apply the query to the "combined" table. If using specific classes do a string_agg(index_vector::text, ' ')::tsvector, but add in an all-||, no ! tsquery on individual combined index_vectors for speed purposes. Use the resulting list of IDs to load up individual field rows for ranking (with weights as appropriate).

Phrase searches

Always use the individual field_entry rows like we currently do, but look to see if "simple" is indexed on the field(s). If it is use *only* "simple" ts_config as an override. Otherwise build atoms like above.

If the phrase has no * as the first or last character then word-boundary the start and end. Otherwise skip the word boundary where there is a *.

Negated phrase searches

Treat like a non-negated phrase search, but prepend "NOT". This allows us to exclude records that contain the phrase in any index and allows us to take advantage of the index_vectors for speed in finding those that might contain it compared to definitely don't.

Ranking

Using the ?_weight values on config.metabib_class build a float array for the first input to the ranking functions.

All non-negated atoms would be joined with || for ranking, instead of using &&. This allows things that matches across multiple individual field_entry rows to get non-zero values when running ts_rank_cd checks.

I am UNSURE whether or not individual rows that *don't* contain the atoms should be included for a zero ranking. They are not currently. If not then do the OR search against each row. If so then just look up by record.

Rel Bumps

Move to an all-in-one plperlu function (maybe eventually a C func?) that does the various rel bumps based on array inputs. These may be less needed with the new method of doing things.

Example searches

Some fairly simplified examples without full query output for explaining some of the above.

NOTE: The examples below assume the current behavior of the - modifier.

NOTE 2: The tsquery examples below are lazy examples and would actually be built with each atom passed through to_tsquery, which would stem them as appropriate.

keyword: martin luther -king

Current

search metabib.keyword_field_entry for index_vector with tsquery "martin & luther & !king", rank with the same tsquery.

Test Issue: If there are multiple keyword indexes with different weights and even one contains martin and luther but not king then the record will be returned.

Proposed

search metabib.combined_keyword_field_entry for index_vector with tsquery "martin & luther & !king" and metabib_field set to NULL. Join to metabib.keyword_field_entry based on the record ID for ranking using tsquery "martin | luther".

Issues solved: The combined table has all of the atoms within it, so matching on it is a one-shot. We then go back to the non-combined for ranking, which I think should be faster in the long run as we only load the records we previously identified.

title: the assist

Current

search metabib.title_field_entry for index_vector with tsquery "the & assist", rank with the same tsquery.

Test Issue: 'assistant' stems to 'assist', and tends to come up more often.

Proposed

search metabib.combined_title_field_entry for index_vector with tsquery "the & assist" and metabib_field set to NULL. Join to metabib.title_field_entry based on the record ID for ranking using tsquery "the | assist".

Issues solved: Note that the new stock config will likely have 'A' weight atoms for 'assist' where that was present exactly, but only 'C' or 'D' weight atoms for ones where 'assistant' was present, so 'assist' should weight higher by default.

title: "the assist"

Current

search metabib.title_field_entry for index_vector with tsquery "the & assist" and a where regex clause looking for "the assist", rank with the tsquery.

Test Issue: "the assistant" contains "the assist"

Proposed

search metabib.title_field_entry for index_vector with tsquery "the & assist" and a where regex clause looking for "[[:<:]]the assist[[:>:]]" (or equiv), rank with tsquery "the | assist".

Issues solved: The word boundaries will ensure that "the assist" is not followed by "ant".

title|general|eng: the assist

Current

search metabib.title_field_entry for index_vector with tsquery "the & assist" and field set to the ids of the general or eng title fields, rank with the tsquery.

Test Issue: Generally the same as the standard title search, but limited to those two indexes.

Proposed

search metabib.combined_title_field_entry for a combined index_vector (string_agg(index_vector::text, ' ')::tsvector) with tsquery "the & assist" and metabib_field set to the ids of the general or eng title fields. Join to metabib.title_field_entry based on the record and field IDs for ranking using tsquery "the | assist".

Issues Solved: Generally the same as the standard title search, but limiting to those two index.

Table of Contents

Database changes

Table Changes

Weight Classes

DB Triggers

QP Changes

Search Construction

Non-phrase searches

Phrase searches

Negated phrase searches

Ranking

Rel Bumps

Example searches

keyword: martin luther -king

Current

Proposed

title: the assist

Current

Proposed

title: "the assist"

Current

Proposed

title|general|eng: the assist

Current

Proposed