User Tools

Site Tools


scratchpad:brush_up_search

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
scratchpad:brush_up_search [2016/05/13 14:21] – [Apostrophe Searching] klussierscratchpad:brush_up_search [2022/02/10 13:34] (current) – external edit 127.0.0.1
Line 3: Line 3:
 =====Apostrophe Searching===== =====Apostrophe Searching=====
  
-**The problem:** Evergreen uses a modified NACO normalization scheme to better handle apostrophe searching for French records. The problem is that this normalization scheme doesn't work as well for English records as pure NACO normalization does. It also sometimes relies on stemming to find the correct search term, which causes problems for sites that disabled stemming+**The problem:** Evergreen uses a modified NACO normalization scheme to better handle apostrophe searching for French records. The problem is that this normalization scheme doesn't work as well for English records as pure NACO normalization does. It also sometimes relies on stemming to find the correct search term, which causes problems for sites that disable stemming
  
-**Who should consider this tweak:** Evergreen sites with a database the mostly contain English-language records and that are okay with search behavior where a search for a French word like 'histoire' will not retrieve records with l'histoire. There may be other multi-lingual implications for making this tweak. Sites with mostly English records that have also disabled stemming should strongly consider this tweak.+**Who should consider this adjustment:** Evergreen sites with a database the mostly contain English-language records and that are okay with search behavior where a search for a French word like 'histoire' will not retrieve records with l'histoire. There may be other multi-lingual implications for making this tweak. Sites with mostly English records that have also disabled stemming should strongly consider this tweak.
  
 **How To:**  **How To:** 
Line 14: Line 14:
 UPDATE config.metabib_field_index_norm_map a UPDATE config.metabib_field_index_norm_map a
 SET norm = 1 SET norm = 1
-FROM (SELECT norm FROM config.metabib_field_index_norm_map) AS subquery+FROM (SELECT id,norm FROM config.metabib_field_index_norm_map) AS subquery
 WHERE subquery.norm = 17 and a.id = subquery.id; WHERE subquery.norm = 17 and a.id = subquery.id;
 </code> </code>
Line 20: Line 20:
 A [[scratchpad:random_magic_spells#reingesting_bib_and_authority_records_20_and_later|full reingest]] is required after making this adjustment. A [[scratchpad:random_magic_spells#reingesting_bib_and_authority_records_20_and_later|full reingest]] is required after making this adjustment.
  
 +=====Synonym Dictionary=====
 +
 +**The problem:** Although stemming can allow users to find records with some variations of their search terms, it will only find variations that share the same stems. There are other common word variations (e.g. color/colour, 19th/nineteenth) that libraries may want to make interchangeable in a synonym dictionary.
 +
 +**Who should consider this adjustment:** Libraries that want to improve the recall of retrieved search results. In particular, libraries that have disabled stemming for some search classes may want to use a synonym dictionary as an alternate method of retrieving records with word variations. However, since many word variations will not match the stem of the entered search term, a synonym dictionary can be useful for sites that utilize stemming in their searches.
 +
 +**About Postgres dictionaries:** 
 +
 +See [[http://www.postgresql.org/docs/9.1/static/textsearch-dictionaries.html]]
 +
 +Dictionaries are used to eliminate words that should not be considered in a search (stop words), and to normalize words so that different derived forms of the same word will match. A successfully normalized word is called a lexeme. Aside from improving search quality, normalization and removal of stop words reduce the size of the tsvector representation of a document, thereby improving performance. 
 +
 +PostgreSQL provides predefined dictionaries for many languages. There are also several predefined templates that can be used to create new dictionaries with custom parameters.
 +
 +The synonym dictionary template is used to create dictionaries that replace a word with a synonym. Phrases are not supported.
 +
 +Note: after creating a new dictionary or adding to an existing dictionary, [[scratchpad:random_magic_spells#reingesting_bib_and_authority_records_20_and_later|a reingest of all (or just the affected) records]] will be required.
 +
 +
 +**How to:**
 +
 +Below are the steps used at the North of Boston Library Exchange (NOBLE) when creating a synonym dictionary. You can replace the use of the word 'noble' in various places with a name representing your own institution. 
 +
 +**1. Create our own synonym dictionary on disk**
 +
 +Copy the sample dictionary or create a new file:
 +
 +  cd /usr/share/postresql/9.x/tsearch_data
 +  sudo cp synonym_sample.syn synonym_noble.syn
 +
 +The NOBLE test file looks like this:
 +
 +<code>
 +color colour
 +colour color
 +19th nineteenth
 +nineteenth 19th
 +20th twentieth
 +twentieth 20th
 +indices index*
 +</code>
 +
 +An asterisk (*) can be placed at the end of a synonym in the configuration file. This indicates that the synonym is a prefix.
 +
 +
 +**2. Create a dictionary in the public schema**
 +
 +<code sql>
 +psql -U evergreen -h localhost
 +CREATE TEXT SEARCH DICTIONARY public.synonym_noble (template=pg_catalog.synonym, synonyms='synonym_noble');
 +</code>
 +
 +This command creates a dictionary based on the template ‘pg_catalog.synonym’.  The dictionary will be created in the public schema of the Evergreen database. The parameter ‘synonyms’ specifies the file in /var/share/postgresql/9.1/tsearch_data without the ‘syn’ file suffix.  This is the file created in step 1 above.
 +
 +A synonym dictionary replaces one word with another word.  Phrases are not supported.  If the original word is also desired, put both forms in the dictionary, for example:
 +
 +<code>
 +color colour
 +colour color
 +</code>
 +
 +To see all dictionaries in the Evergreen database:
 +
 +<code sql>
 +select * from pg_ts_dict;
 +</code>
 +
 +There are more dictionaries in the pg_catalog schema which you can see from the psql shell invoked as the evergreen user.  
 +
 +Note: the postgres user’s password is disabled and needs to remain that way.  
 +
 +<code sql> 
 +% psql -U evergreen -h localhost
 +evergreen=# \dFd (show dictionaries)
 +evergreen=# \dF  (show configurations)
 +evergreen=# \q
 +</code>
 +
 +Test a dictionary by passing a term to a dictionary:
 +<code sql>
 +select ts_lexize('synonym_noble', 'colour');
 + ts_lexize
 +-----------
 + {color}
 +</code>
 +
 +**3. Create a Configuration in the public schema**
 +
 +A text search configuration binds a parser together with a set of dictionaries to process the parser's output tokens. For each token type that the parser can return (asciiword, numword), a separate list of dictionaries is specified by the configuration.  
 +
 +Create the new configuration using the default configuration as a template:
 +
 +<code sql>
 +psql -U evergreen -h localhost
 +CREATE TEXT SEARCH CONFIGURATION public.synonym_noble (copy=default);
 +</code>
 +
 +The copy command (copy=default) specifies the configuration to copy to create this new configuration.
 +
 +To see all configurations in the Evergreen database:
 +
 +<code sql>
 +select * from pg_ts_config;
 +</code>
 +
 +After copying the default configuration, alter the new configuration to point to our new dictionary.  We are pointing 3 parsers to our new dictionary: asciiword, asciihword, and hword_asciipart.  These were the 3 parsers that were pointing to the public.english_nostop dictionary. 
 +
 +<code sql>
 +ALTER TEXT SEARCH CONFIGURATION public.synonym_noble 
 +  ALTER MAPPING FOR asciiword 
 +  WITH synonym_noble;
 +
 +ALTER TEXT SEARCH CONFIGURATION public.synonym_noble 
 +  ALTER MAPPING FOR asciihword 
 +  WITH synonym_noble;
 +
 +ALTER TEXT SEARCH CONFIGURATION public.synonym_noble 
 +  ALTER MAPPING FOR hword_asciipart 
 +  WITH synonym_noble;
 +
 +</code>
 +
 +You can change all the mappings with one command:
 +
 +<code sql> 
 +ALTER TEXT SEARCH CONFIGURATION public.synonym_noble 
 +  ALTER MAPPING FOR asciiword, asciihword, hword_asciipart 
 +  WITH synonym_noble;
 +</code>
 +
 +To see the new configuration in the Evergreen database go into the psql shell:
 +
 +<code sql>
 +% psql -U evergreen -h localhost
 +evergreen=# \dF+ synonym_noble
 +
 +
 +Text search configuration "public.synonym_noble"
 +Parser: "pg_catalog.default"
 +  Token     | Dictionaries
 +-----------------+---------------
 + asciihword      | synonym_noble
 + asciiword       | synonym_noble
 + email           | simple
 + file            | simple
 + float           | simple
 + host            | simple
 + hword           | simple
 + hword_asciipart | synonym_noble
 + hword_numpart   | simple
 + hword_part      | simple
 + Int             | simple
 + numhword        | simple
 + numword         | simple
 + sfloat          | simple
 + uint            | simple
 + url             | simple
 + url_path        | simple
 + version         | simple
 + word            | simple
 +
 +</code>
 +
 +
 +**4.Create config.ts_config_list**
 +
 +This table in Evergreen’s config schema lists each full-text configuration that will be referenced in config.metabib_class_ts_map.  The config is ‘synonym_noble’ and there is a description of this config included.
 +
 +<code sql>
 +
 +INSERT into config.ts_config_list values ('synonym_noble', 'NOBLE Synonym List');
 +
 +</code>
 +
 +Verify the addition of the dictionary to the map:
 +
 +<code sql>
 +select * from config.ts_config_list;
 +
 +       id          |    name
 +-------------------+--------------------
 + simple            | Non-Stemmed Simple
 + danish_nostop     | Danish Stemmed
 + dutch_nostop      | Dutch Stemmed
 + english_nostop    | English Stemmed
 + finnish_nostop    | Finnish Stemmed
 + french_nostop     | French Stemmed
 + german_nostop     | German Stemmed
 + hungarian_nostop  | Hungarian Stemmed
 + italian_nostop    | Italian Stemmed
 + norwegian_nostop  | Norwegian Stemmed
 + portuguese_nostop | Portuguese Stemmed
 + romanian_nostop   | Romanian Stemmed
 + russian_nostop    | Russian Stemmed
 + spanish_nostop    | Spanish Stemmed
 + swedish_nostop    | Swedish Stemmed
 + turkish_nostop    | Turkish Stemmed
 + synonym_noble     | NOBLE Synonym List
 +</code>
 +
 +**5. Create a mapping in config.metabib_class_ts_map**
 +
 +This mapping table relates each metabib class (keyword, title etc.) with the configuration that sends the indexing through the specified dictionary.  We are mapping keyword, title, and subject to our new synonym dictionary. 
 +
 +<code sql>
 +INSERT into config.metabib_class_ts_map (field_class, ts_config, index_weight) values 
 +('keyword', 'synonym_noble', 'C'),
 +('title', 'synonym_noble', 'C'),
 +('subject', 'synonym_noble', 'C');
 +</code>
 +
 +**6. [[scratchpad:random_magic_spells#reingesting_bib_and_authority_records_20_and_later|Perform a full reingest]] on the records in your database**
 +
 +See sample [[scratchpad:sample_synonym_dictionary|Synonym Dictionary from NOBLE]]
scratchpad/brush_up_search.1463163712.txt.gz · Last modified: 2022/02/10 13:33 (external edit)

Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Share Alike 4.0 International
CC Attribution-Share Alike 4.0 International Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki

© 2008-2022 GPLS and others. Evergreen is open source software, freely licensed under GNU GPLv2 or later.
The Evergreen Project is a U.S. 501(c)3 non-profit organization.