Differences

This shows you the differences between two versions of the page.

--- documentation:indexing [2012/10/17 09:40] – [Normalization Functions] typo fix csharp
+++ documentation:indexing [2022/02/10 13:34] (current) – external edit 127.0.0.1
@@ Line 1: / Line 1: @@
+FIXME - Add Information on Virtual Index Definitions in 3.1+
 ======= Bibliographic Indexing in Evergreen ========
 Indexing and searching bibliographic data in Evergreen are complex processes.  Also, Evergreen is extremely configurable in these areas and can be tuned to the specific needs of each installation.  Because of this, there are many moving parts which are tightly integrated and interdependent.
@@ Line 100: / Line 102: @@
 </code>
-Each indexed datum coming from a bibliographic record in Evergreen is extracted based on an Indexed Field Definition.  There are approximately twenty default definitions in a stock Evergreen installation spread across all six search classes.  These definitions provide Evergreen with the information it needs in order to extract interesting data from bibliographic records.
+Each indexed datum coming from a bibliographic record in Evergreen is extracted based on an Indexed Field Definition.  There are approximately thirty default definitions in a stock Evergreen installation spread across all six search classes.  These definitions provide Evergreen with the information it needs in order to extract interesting data from bibliographic records.
 ==== Field Class ====
@@ Line 111: / Line 113: @@
 Indexed data from different fields will probably be considered to have different importance when calculating the relevance of a matched query term.  For instance, a match in a translated title may be considered less important than a match in the title proper.  The **weight** allows control over this.
-By supplying a higher or lower relative **weight**, one field can be made more or less important, in relevance ranking terms, than others.  This value is used as a multiplier to the baseline cover-density ranking (discussed below), and so setting this value to 0 will allow matches, but always rank them at the bottom of the list.  Likewise, supplying a very large **weight** multiplier will cause matches to appear at the top of the list.  Floating point values are allowed, and by supplying values between 0 and 1 can be used apply fine-grained, percentage-based adjustments.
+By supplying a higher or lower relative **weight**, one field can be made more or less important, in relevance ranking terms, than others.  This value is used as a multiplier to the baseline cover-density ranking (discussed below), and so setting this value to 0 will allow matches, but always rank them at the bottom of the list.  Likewise, supplying a very large **weight** multiplier will cause matches to appear at the top of the list.  Floating point values are allowed, and values between 0 and 1 can be used to apply fine-grained, percentage-based adjustments.
 Evergreen ships with all Indexed Field Definition weights set to 1 by default.
@@ Line 163: / Line 165: @@
 First, these aliases provide a mechanism for internationalizing the user-supplied search constraints; for instance, "author" can be aliased to "skrywer" to support searching by native speakers of Afrikaans without having to know the English term "author".
-In a similar manner, aliases can be used to map [[http://www.loc.gov/standards/sru/specs/cql.html|CQL]] context set match points, which have standard names external to any specific search backend, to appropriate match points in any given Evergreen installation.  For instance, the [[http://www.loc.gov/standards/sru/cql-bibliographic-searching.html|proposed CQL 'bib' context set]] defines, among others, title indexes called **dc.title**, **bib.titleUniform** and **bib.titleSeries**.  In a stock Evergreen installation, **dc.title** would be aliased to the entire **title** search class, **bib.titleUniform** to the **uniform** field within the **title** search class, and **bib.titleSeries** to the **seriestitle** field within the **series** search class.
+In a similar manner, aliases can be used to map [[http://www.loc.gov/standards/sru/specs/cql.html|CQL]] context set match points, which have standard names external to any specific search backend, to appropriate match points in any given Evergreen installation.  For instance, the [[http://www.loc.gov/standards/sru/cql/contextSets/bib-context-set.html|proposed CQL 'bib' context set]] defines, among others, title indexes called **dc.title**, **bib.titleUniform** and **bib.titleSeries**.  In a stock Evergreen installation, **dc.title** would be aliased to the entire **title** search class, **bib.titleUniform** to the **uniform** field within the **title** search class, and **bib.titleSeries** to the **seriestitle** field within the **series** search class.
@@ Line 195: / Line 197: @@
 Normalizer functions are in-database stored procedures, and can be written in any programming language supported by Postgres.  The stock normalizers are all written in either PL/PerlU, PL/pgSQL or SQL.
-Twelve normalizer functions are registered in the stock Evergreen installation.  They are
+Twenty-one normalizer functions are registered in the stock Evergreen installation.  They are
 ^ Name ^ Description ^
+|Approximate High Date Normalize|Normalize the value to the nearest date-ish value, rounding up|
+|Approximate Low Date Normalize|Normalize the value to the nearest date-ish value, rounding down|
+|Coded Value Map Normalizer|Applies coded_value_map mapping of values|
 |Down-case|Convert text lower case.|
 |Extract Dewey-like number|Extract a string of numeric characters ther resembles a DDC number.|
 |First word|Include only the first space-separated word of a string.|
+|Generic Mapping Normalizer|Map values or sets of values to new values.|
 |ISBN 10/13 conversion|Translate ISBN10 to ISBN13, and vice versa, for indexing purposes.|
 |Left truncation|Discard the specified number of characters from the left side of the string.|
-|NACO Normalize|Apply NACO normalization rules to the extracted text.  See http://www.loc.gov/catdir/pcc/naco/normrule-2.html for details.|
+|NACO Normalize|Apply NACO normalization rules to the extracted text.  See https://www.loc.gov/aba/pcc/naco/normrule-2.html for details.|
-|NACO Normalize -- retain first comma|Apply NACO normalization rules to the extracted text, retaining the first comma.  See http://www.loc.gov/catdir/pcc/naco/normrule-2.html for details.|
+|NACO Normalize -- retain first comma|Apply NACO normalization rules to the extracted text, retaining the first comma.  See https://www.loc.gov/aba/pcc/naco/normrule-2.html for details.|
 |Normalize date range|Split date ranges in the form of "XXXX-YYYY" into "XXXX YYYY" for proper index.|
-|Replace|Replace all occurances of first parameter in the string with the second parameter.|
+|Normalize date range|Normalize the value to NULL if it is not a number|
+|Replace|Replace all occurrences of first parameter in the string with the second parameter.|
+|Remove Parenthesized Substring|Remove any parenthesized substrings from the extracted text, such as the agency code preceding authority record control numbers in subfield 0.|
 |Right truncation|Include only the specified number of characters from the left side of the string.|
+|Search Normalize|Apply search normalization rules to the extracted text. A less extreme version of NACO normalization.|
 |Strip Diacritics|Convert text to NFD form and remove non-spacing combining marks.|
+|Trim Surrounding Space|Trim leading and trailing spaces from extracted text.|
+|Trim Trailing Punctuation|Eliminate extraneous trailing commas and periods in text.|
 |Up-case|Convert text upper case.|