This is a collection of proposed changes to searching. None of it is guaranteed to happen and it is all subject to revision.
Create metabib.combined_*_field_entry tables:
PK would be record + metabib_field or similar. No need for ID-based as this would be 100% maintained by the DB and thus needs no IDL entry.
These tables would be populated with the combined index_vector for all entries for a given record/field combination, as well as for a given record/all fields combination when metabib_field is NULL.
Problem solved: Atom negation not negating properly when multiple rows exist (say, subject atom negation). Possibility for speed increase due to less rows needing to be checked for record matching.
Note that the index_vector would be stripped. This is because we would be ranking on individual rows, not the combined one.
Create config.metabib_class_ts_map table:
This would set the defaults for text search indexing config.metabib_field values. If "always" is set the rule is applied in addition to field specific values, otherwise the rule is only applied in lieu of field specific values.
GOALS INCLUDE: Stemmed and non-stemmed variants indexed, multi-language stemming, etc.
RECOMMENDED: One entry for each class that does 'simple' on weight A with always set and no language restrictions. One entry for each class for your preferred language stemming (default: english) on weight C with always set and no language restrictions.
Create config.metabib_field_ts_map table:
The same as the class_ts_map table, but for individual fields. This would allow a keyword search to more significantly bump, say, titles or authors by adding another keyword index.
ALTER config.metabib_class table:
For ranking values. I don't see a reason to make these any more specific than the entire class. Likely defaults may be the stock text search defaults.
Recommended (but not required) uses for weight classes:
Creating or updating a metabib.*_field_entry row would cause it to load up all applicable class and field text search mappings. For each weight in order from A to D:
This ensures that the first weight used from A to D has accurate position info, but the following ones may not.
Also, update the "populate metabib.*_field_entry" ingest trigger to also fire off a new "populate combined" function when it is done with a record, so that we don't have things trying to update the combined entry on every field_entry change.
If doing a non-specific class search load up all config.metabib_class_ts_map rows that match search_lang, or have a NULL search_lang. Each atom gets a to_tsquery for each ts_config, ||ed together, then those are &&ed together to form the query.
If doing a specific field set load up the configs for the specific field(s), if any match search_lang, then add in the "always" ones for that class. Repeat as needed.
Sanity default would probably be "simple" and "<class>" if none were found.
Regardless, apply the query to the "combined" table. If using specific classes do a string_agg(index_vector::text, ' ')::tsvector, but add in an all-||, no ! tsquery on individual combined index_vectors for speed purposes. Use the resulting list of IDs to load up individual field rows for ranking (with weights as appropriate).
Always use the individual field_entry rows like we currently do, but look to see if "simple" is indexed on the field(s). If it is use *only* "simple" ts_config as an override. Otherwise build atoms like above.
If the phrase has no * as the first or last character then word-boundary the start and end. Otherwise skip the word boundary where there is a *.
Treat like a non-negated phrase search, but prepend "NOT". This allows us to exclude records that contain the phrase in any index and allows us to take advantage of the index_vectors for speed in finding those that might contain it compared to definitely don't.
Using the ?_weight values on config.metabib_class build a float array for the first input to the ranking functions.
All non-negated atoms would be joined with || for ranking, instead of using &&. This allows things that matches across multiple individual field_entry rows to get non-zero values when running ts_rank_cd checks.
I am UNSURE whether or not individual rows that *don't* contain the atoms should be included for a zero ranking. They are not currently. If not then do the OR search against each row. If so then just look up by record.
Move to an all-in-one plperlu function (maybe eventually a C func?) that does the various rel bumps based on array inputs. These may be less needed with the new method of doing things.
Some fairly simplified examples without full query output for explaining some of the above.
NOTE: The examples below assume the current behavior of the - modifier.
NOTE 2: The tsquery examples below are lazy examples and would actually be built with each atom passed through to_tsquery, which would stem them as appropriate.
search metabib.keyword_field_entry for index_vector with tsquery "martin & luther & !king", rank with the same tsquery.
Test Issue: If there are multiple keyword indexes with different weights and even one contains martin and luther but not king then the record will be returned.
search metabib.combined_keyword_field_entry for index_vector with tsquery "martin & luther & !king" and metabib_field set to NULL. Join to metabib.keyword_field_entry based on the record ID for ranking using tsquery "martin | luther".
Issues solved: The combined table has all of the atoms within it, so matching on it is a one-shot. We then go back to the non-combined for ranking, which I think should be faster in the long run as we only load the records we previously identified.
search metabib.title_field_entry for index_vector with tsquery "the & assist", rank with the same tsquery.
Test Issue: 'assistant' stems to 'assist', and tends to come up more often.
search metabib.combined_title_field_entry for index_vector with tsquery "the & assist" and metabib_field set to NULL. Join to metabib.title_field_entry based on the record ID for ranking using tsquery "the | assist".
Issues solved: Note that the new stock config will likely have 'A' weight atoms for 'assist' where that was present exactly, but only 'C' or 'D' weight atoms for ones where 'assistant' was present, so 'assist' should weight higher by default.
search metabib.title_field_entry for index_vector with tsquery "the & assist" and a where regex clause looking for "the assist", rank with the tsquery.
Test Issue: "the assistant" contains "the assist"
search metabib.title_field_entry for index_vector with tsquery "the & assist" and a where regex clause looking for "[[:<:]]the assist[[:>:]]" (or equiv), rank with tsquery "the | assist".
Issues solved: The word boundaries will ensure that "the assist" is not followed by "ant".
search metabib.title_field_entry for index_vector with tsquery "the & assist" and field set to the ids of the general or eng title fields, rank with the tsquery.
Test Issue: Generally the same as the standard title search, but limited to those two indexes.
search metabib.combined_title_field_entry for a combined index_vector (string_agg(index_vector::text, ' ')::tsvector) with tsquery "the & assist" and metabib_field set to the ids of the general or eng title fields. Join to metabib.title_field_entry based on the record and field IDs for ranking using tsquery "the | assist".
Issues Solved: Generally the same as the standard title search, but limiting to those two index.