User Tools

Site Tools


scratchpad:brush_up_search

Ways to Brush Up Your Search

Apostrophe Searching

The problem: Evergreen uses a modified NACO normalization scheme to better handle apostrophe searching for French records. The problem is that this normalization scheme doesn't work as well for English records as pure NACO normalization does. It also sometimes relies on stemming to find the correct search term, which causes problems for sites that disable stemming

Who should consider this adjustment: Evergreen sites with a database the mostly contain English-language records and that are okay with search behavior where a search for a French word like 'histoire' will not retrieve records with l'histoire. There may be other multi-lingual implications for making this tweak. Sites with mostly English records that have also disabled stemming should strongly consider this tweak.

How To:

Add mappings to NACO Normalize for all indexes that are currently mapped to the default Search Normalize function.

UPDATE config.metabib_field_index_norm_map a
SET norm = 1
FROM (SELECT id,norm FROM config.metabib_field_index_norm_map) AS subquery
WHERE subquery.norm = 17 AND a.id = subquery.id;

A full reingest is required after making this adjustment.

Synonym Dictionary

The problem: Although stemming can allow users to find records with some variations of their search terms, it will only find variations that share the same stems. There are other common word variations (e.g. color/colour, 19th/nineteenth) that libraries may want to make interchangeable in a synonym dictionary.

Who should consider this adjustment: Libraries that want to improve the recall of retrieved search results. In particular, libraries that have disabled stemming for some search classes may want to use a synonym dictionary as an alternate method of retrieving records with word variations. However, since many word variations will not match the stem of the entered search term, a synonym dictionary can be useful for sites that utilize stemming in their searches.

About Postgres dictionaries:

See http://www.postgresql.org/docs/9.1/static/textsearch-dictionaries.html

Dictionaries are used to eliminate words that should not be considered in a search (stop words), and to normalize words so that different derived forms of the same word will match. A successfully normalized word is called a lexeme. Aside from improving search quality, normalization and removal of stop words reduce the size of the tsvector representation of a document, thereby improving performance.

PostgreSQL provides predefined dictionaries for many languages. There are also several predefined templates that can be used to create new dictionaries with custom parameters.

The synonym dictionary template is used to create dictionaries that replace a word with a synonym. Phrases are not supported.

Note: after creating a new dictionary or adding to an existing dictionary, a reingest of all (or just the affected) records will be required.

How to:

Below are the steps used at the North of Boston Library Exchange (NOBLE) when creating a synonym dictionary. You can replace the use of the word 'noble' in various places with a name representing your own institution.

1. Create our own synonym dictionary on disk

Copy the sample dictionary or create a new file:

cd /usr/share/postresql/9.x/tsearch_data
sudo cp synonym_sample.syn synonym_noble.syn

The NOBLE test file looks like this:

color		colour
colour		color
19th		nineteenth
nineteenth	19th
20th		twentieth
twentieth	20th
indices		index*

An asterisk (*) can be placed at the end of a synonym in the configuration file. This indicates that the synonym is a prefix.

2. Create a dictionary in the public schema

psql -U evergreen -h localhost
CREATE TEXT SEARCH DICTIONARY public.synonym_noble (template=pg_catalog.synonym, synonyms='synonym_noble');

This command creates a dictionary based on the template ‘pg_catalog.synonym’. The dictionary will be created in the public schema of the Evergreen database. The parameter ‘synonyms’ specifies the file in /var/share/postgresql/9.1/tsearch_data without the ‘syn’ file suffix. This is the file created in step 1 above.

A synonym dictionary replaces one word with another word. Phrases are not supported. If the original word is also desired, put both forms in the dictionary, for example:

color	colour
colour	color

To see all dictionaries in the Evergreen database:

SELECT * FROM pg_ts_dict;

There are more dictionaries in the pg_catalog schema which you can see from the psql shell invoked as the evergreen user.

Note: the postgres user’s password is disabled and needs to remain that way.

% psql -U evergreen -h localhost
evergreen=# \dFd (SHOW dictionaries)
evergreen=# \dF  (SHOW configurations)
evergreen=# \q

Test a dictionary by passing a term to a dictionary:

SELECT ts_lexize('synonym_noble', 'colour');
 ts_lexize
-----------
 {color}

3. Create a Configuration in the public schema

A text search configuration binds a parser together with a set of dictionaries to process the parser's output tokens. For each token type that the parser can return (asciiword, numword), a separate list of dictionaries is specified by the configuration.

Create the new configuration using the default configuration as a template:

psql -U evergreen -h localhost
CREATE TEXT SEARCH CONFIGURATION public.synonym_noble (copy=DEFAULT);

The copy command (copy=default) specifies the configuration to copy to create this new configuration.

To see all configurations in the Evergreen database:

SELECT * FROM pg_ts_config;

After copying the default configuration, alter the new configuration to point to our new dictionary. We are pointing 3 parsers to our new dictionary: asciiword, asciihword, and hword_asciipart. These were the 3 parsers that were pointing to the public.english_nostop dictionary.

ALTER TEXT SEARCH CONFIGURATION public.synonym_noble 
  ALTER MAPPING FOR asciiword 
  WITH synonym_noble;
 
ALTER TEXT SEARCH CONFIGURATION public.synonym_noble 
  ALTER MAPPING FOR asciihword 
  WITH synonym_noble;
 
ALTER TEXT SEARCH CONFIGURATION public.synonym_noble 
  ALTER MAPPING FOR hword_asciipart 
  WITH synonym_noble;

You can change all the mappings with one command:

ALTER TEXT SEARCH CONFIGURATION public.synonym_noble 
  ALTER MAPPING FOR asciiword, asciihword, hword_asciipart 
  WITH synonym_noble;

To see the new configuration in the Evergreen database go into the psql shell:

% psql -U evergreen -h localhost
evergreen=# \dF+ synonym_noble
 
 
Text SEARCH configuration "public.synonym_noble"
Parser: "pg_catalog.default"
  	Token     | Dictionaries
-----------------+---------------
 asciihword      | synonym_noble
 asciiword       | synonym_noble
 email           | simple
 file            | simple
 FLOAT           | simple
 host            | simple
 hword           | simple
 hword_asciipart | synonym_noble
 hword_numpart   | simple
 hword_part      | simple
 INT             | simple
 numhword        | simple
 numword         | simple
 sfloat          | simple
 uint            | simple
 url             | simple
 url_path        | simple
 version         | simple
 word            | simple

4.Create config.ts_config_list

This table in Evergreen’s config schema lists each full-text configuration that will be referenced in config.metabib_class_ts_map. The config is ‘synonym_noble’ and there is a description of this config included.

INSERT INTO config.ts_config_list VALUES ('synonym_noble', 'NOBLE Synonym List');

Verify the addition of the dictionary to the map:

SELECT * FROM config.ts_config_list;
 
       id          |    name
-------------------+--------------------
 simple            | Non-Stemmed Simple
 danish_nostop     | Danish Stemmed
 dutch_nostop      | Dutch Stemmed
 english_nostop    | English Stemmed
 finnish_nostop    | Finnish Stemmed
 french_nostop     | French Stemmed
 german_nostop     | German Stemmed
 hungarian_nostop  | Hungarian Stemmed
 italian_nostop    | Italian Stemmed
 norwegian_nostop  | Norwegian Stemmed
 portuguese_nostop | Portuguese Stemmed
 romanian_nostop   | Romanian Stemmed
 russian_nostop    | Russian Stemmed
 spanish_nostop    | Spanish Stemmed
 swedish_nostop    | Swedish Stemmed
 turkish_nostop    | Turkish Stemmed
 synonym_noble     | NOBLE Synonym List

5. Create a mapping in config.metabib_class_ts_map

This mapping table relates each metabib class (keyword, title etc.) with the configuration that sends the indexing through the specified dictionary. We are mapping keyword, title, and subject to our new synonym dictionary.

INSERT INTO config.metabib_class_ts_map (field_class, ts_config, index_weight) VALUES 
('keyword', 'synonym_noble', 'C'),
('title', 'synonym_noble', 'C'),
('subject', 'synonym_noble', 'C');

6. Perform a full reingest on the records in your database

See sample Synonym Dictionary from NOBLE

scratchpad/brush_up_search.txt · Last modified: 2022/02/10 13:34 by 127.0.0.1

Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Share Alike 4.0 International
CC Attribution-Share Alike 4.0 International Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki

© 2008-2022 GPLS and others. Evergreen is open source software, freely licensed under GNU GPLv2 or later.
The Evergreen Project is a U.S. 501(c)3 non-profit organization.