User Tools

Site Tools


documentation:indexing

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
documentation:indexing [2016/07/27 09:07] – [Indexed Field Definitions] rjs7documentation:indexing [2022/02/10 13:34] (current) – external edit 127.0.0.1
Line 1: Line 1:
 +FIXME - Add Information on Virtual Index Definitions in 3.1+
 +
 ======= Bibliographic Indexing in Evergreen ======== ======= Bibliographic Indexing in Evergreen ========
 Indexing and searching bibliographic data in Evergreen are complex processes.  Also, Evergreen is extremely configurable in these areas and can be tuned to the specific needs of each installation.  Because of this, there are many moving parts which are tightly integrated and interdependent. Indexing and searching bibliographic data in Evergreen are complex processes.  Also, Evergreen is extremely configurable in these areas and can be tuned to the specific needs of each installation.  Because of this, there are many moving parts which are tightly integrated and interdependent.
Line 111: Line 113:
 Indexed data from different fields will probably be considered to have different importance when calculating the relevance of a matched query term.  For instance, a match in a translated title may be considered less important than a match in the title proper.  The **weight** allows control over this. Indexed data from different fields will probably be considered to have different importance when calculating the relevance of a matched query term.  For instance, a match in a translated title may be considered less important than a match in the title proper.  The **weight** allows control over this.
  
-By supplying a higher or lower relative **weight**, one field can be made more or less important, in relevance ranking terms, than others.  This value is used as a multiplier to the baseline cover-density ranking (discussed below), and so setting this value to 0 will allow matches, but always rank them at the bottom of the list.  Likewise, supplying a very large **weight** multiplier will cause matches to appear at the top of the list.  Floating point values are allowed, and by supplying values between 0 and 1 can be used apply fine-grained, percentage-based adjustments.+By supplying a higher or lower relative **weight**, one field can be made more or less important, in relevance ranking terms, than others.  This value is used as a multiplier to the baseline cover-density ranking (discussed below), and so setting this value to 0 will allow matches, but always rank them at the bottom of the list.  Likewise, supplying a very large **weight** multiplier will cause matches to appear at the top of the list.  Floating point values are allowed, and values between 0 and 1 can be used to apply fine-grained, percentage-based adjustments.
  
 Evergreen ships with all Indexed Field Definition weights set to 1 by default. Evergreen ships with all Indexed Field Definition weights set to 1 by default.
Line 163: Line 165:
 First, these aliases provide a mechanism for internationalizing the user-supplied search constraints; for instance, "author" can be aliased to "skrywer" to support searching by native speakers of Afrikaans without having to know the English term "author". First, these aliases provide a mechanism for internationalizing the user-supplied search constraints; for instance, "author" can be aliased to "skrywer" to support searching by native speakers of Afrikaans without having to know the English term "author".
  
-In a similar manner, aliases can be used to map [[http://www.loc.gov/standards/sru/specs/cql.html|CQL]] context set match points, which have standard names external to any specific search backend, to appropriate match points in any given Evergreen installation.  For instance, the [[http://www.loc.gov/standards/sru/cql-bibliographic-searching.html|proposed CQL 'bib' context set]] defines, among others, title indexes called **dc.title**, **bib.titleUniform** and **bib.titleSeries**.  In a stock Evergreen installation, **dc.title** would be aliased to the entire **title** search class, **bib.titleUniform** to the **uniform** field within the **title** search class, and **bib.titleSeries** to the **seriestitle** field within the **series** search class.+In a similar manner, aliases can be used to map [[http://www.loc.gov/standards/sru/specs/cql.html|CQL]] context set match points, which have standard names external to any specific search backend, to appropriate match points in any given Evergreen installation.  For instance, the [[http://www.loc.gov/standards/sru/cql/contextSets/bib-context-set.html|proposed CQL 'bib' context set]] defines, among others, title indexes called **dc.title**, **bib.titleUniform** and **bib.titleSeries**.  In a stock Evergreen installation, **dc.title** would be aliased to the entire **title** search class, **bib.titleUniform** to the **uniform** field within the **title** search class, and **bib.titleSeries** to the **seriestitle** field within the **series** search class.
  
  
Line 195: Line 197:
 Normalizer functions are in-database stored procedures, and can be written in any programming language supported by Postgres.  The stock normalizers are all written in either PL/PerlU, PL/pgSQL or SQL. Normalizer functions are in-database stored procedures, and can be written in any programming language supported by Postgres.  The stock normalizers are all written in either PL/PerlU, PL/pgSQL or SQL.
  
-Twelve normalizer functions are registered in the stock Evergreen installation.  They are+Twenty-one normalizer functions are registered in the stock Evergreen installation.  They are
  
 ^ Name ^ Description ^ ^ Name ^ Description ^
 +|Approximate High Date Normalize|Normalize the value to the nearest date-ish value, rounding up|
 +|Approximate Low Date Normalize|Normalize the value to the nearest date-ish value, rounding down|
 +|Coded Value Map Normalizer|Applies coded_value_map mapping of values|
 |Down-case|Convert text lower case.| |Down-case|Convert text lower case.|
 |Extract Dewey-like number|Extract a string of numeric characters ther resembles a DDC number.| |Extract Dewey-like number|Extract a string of numeric characters ther resembles a DDC number.|
 |First word|Include only the first space-separated word of a string.| |First word|Include only the first space-separated word of a string.|
 +|Generic Mapping Normalizer|Map values or sets of values to new values.|
 |ISBN 10/13 conversion|Translate ISBN10 to ISBN13, and vice versa, for indexing purposes.| |ISBN 10/13 conversion|Translate ISBN10 to ISBN13, and vice versa, for indexing purposes.|
 |Left truncation|Discard the specified number of characters from the left side of the string.| |Left truncation|Discard the specified number of characters from the left side of the string.|
Line 206: Line 212:
 |NACO Normalize -- retain first comma|Apply NACO normalization rules to the extracted text, retaining the first comma.  See https://www.loc.gov/aba/pcc/naco/normrule-2.html for details.| |NACO Normalize -- retain first comma|Apply NACO normalization rules to the extracted text, retaining the first comma.  See https://www.loc.gov/aba/pcc/naco/normrule-2.html for details.|
 |Normalize date range|Split date ranges in the form of "XXXX-YYYY" into "XXXX YYYY" for proper index.| |Normalize date range|Split date ranges in the form of "XXXX-YYYY" into "XXXX YYYY" for proper index.|
-|Replace|Replace all occurances of first parameter in the string with the second parameter.|+|Normalize date range|Normalize the value to NULL if it is not a number| 
 +|Replace|Replace all occurrences of first parameter in the string with the second parameter.| 
 +|Remove Parenthesized Substring|Remove any parenthesized substrings from the extracted text, such as the agency code preceding authority record control numbers in subfield 0.|
 |Right truncation|Include only the specified number of characters from the left side of the string.| |Right truncation|Include only the specified number of characters from the left side of the string.|
 +|Search Normalize|Apply search normalization rules to the extracted text. A less extreme version of NACO normalization.|
 |Strip Diacritics|Convert text to NFD form and remove non-spacing combining marks.| |Strip Diacritics|Convert text to NFD form and remove non-spacing combining marks.|
 +|Trim Surrounding Space|Trim leading and trailing spaces from extracted text.|
 +|Trim Trailing Punctuation|Eliminate extraneous trailing commas and periods in text.|
 |Up-case|Convert text upper case.| |Up-case|Convert text upper case.|
  
documentation/indexing.1469624844.txt.gz · Last modified: 2022/02/10 13:33 (external edit)

Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Share Alike 4.0 International
CC Attribution-Share Alike 4.0 International Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki

© 2008-2022 GPLS and others. Evergreen is open source software, freely licensed under GNU GPLv2 or later.
The Evergreen Project is a U.S. 501(c)3 non-profit organization.