FIXME - Add Information on Virtual Index Definitions in 3.1+ ======= Bibliographic Indexing in Evergreen ======== Indexing and searching bibliographic data in Evergreen are complex processes. Also, Evergreen is extremely configurable in these areas and can be tuned to the specific needs of each installation. Because of this, there are many moving parts which are tightly integrated and interdependent. ====== Components ====== Here we will discuss the nuts and bolts of indexing. In each following subsection, you will see the database definition associated with each component, normally a table, followed by an explanation of the database entity's fields, what those fields control, and how those fields are linked and used by other components. ===== Search Classes ===== Table "config.metabib_class" Column | Type | Modifiers --------+------+----------- name | text | not null label | text | not null Indexes: "metabib_class_pkey" PRIMARY KEY, btree (name) "metabib_class_label_key" UNIQUE, btree (label) Referenced by: TABLE "config.metabib_field" CONSTRAINT "field_class_fkey" FOREIGN KEY (field_class) REFERENCES config.metabib_class(name) TABLE "config.metabib_search_alias" CONSTRAINT "metabib_search_alias_field_class_fkey" FOREIGN KEY (field_class) REFERENCES config.metabib_class(name) Search classes are the largest basic grouping mechanism for indexed bibliographic data. When indexing and searching, data from a bib record or a search query is always constrained to a specific search class. The primary technical reason for this is that each search class has some basic normalization and weighting rules that can be applied globally. For example: when dealing with titles, strings are short and word order is important; when dealing with contributor names, word order can generally be disregarded but year range normalization is important. Evergreen ships with six search classes which are used to group categories of fields into logical sets for searching in aggregate. The stock setup of these classes are as follows: * Title -- Generally, [[http://www.loc.gov/standards/mods/|MODS]] **titleInfo** elements * Author -- Generally, [[http://www.loc.gov/standards/mods/|MODS]] **name** elements that are direct children of the **mods** top-level element * Subject -- Generally, descendants of the [[http://www.loc.gov/standards/mods/|MODS]] **subject** element * Series -- [[http://www.loc.gov/standards/mods/|MODS]] **titleInfo** elements within a **relatedItem** element that has a **type** attribute of //series// * Keyword -- Everything within the [[http://www.loc.gov/standards/mods/|MODS]] record, except the **originInfo** element and descendants * Identifier -- Record accession number, most MARC standard number fields, and bib-level free-text call numbers. Examples of identifiers that would be included in this search class are ISBNs, ISSNs, ISMNs, LCCNs, and OCLC numbers. ==== Name ==== The name of a search class is the value which can be used in the advanced search syntax (more on this in the [[indexing#field_aliases|Field Aliases]] section). In the stock Evergreen code, it is the lowercase version of the classes in the list above. This name is used as the key for the search class, and is linked to by the [[indexing#indexed_field_definitions|Indexed Field Definitions]] and [[indexing#field_aliases|Field Aliases]]. ==== Label ==== An internationalizable field containing the display label for the search class. This shows up in the advanced search dropdown box and faceting sidebar in the OPAC. The translation of this field is dependent on the locale that the user has chosen in the OPAC. ===== XML Transformations ===== Table "config.xml_transform" Column | Type | Modifiers ---------------+------+----------- name | text | not null namespace_uri | text | not null prefix | text | not null xslt | text | not null Indexes: "xml_transform_pkey" PRIMARY KEY, btree (name) Evergreen stores [[http://www.loc.gov/standards/marcxml/|MARCXML]] internally for all records, but provides the ability to index records after transforming them to other XML formats. For instance, the [[http://www.loc.gov/standards/mods/|MODS]] format can be much simpler to work with than [[http://www.loc.gov/standards/marcxml/|MARCXML]], even for advanced catalogers, because [[http://www.loc.gov/standards/mods/|MODS]] provides semantic interpretation via XSLT that is difficult or impossible to express in simple XPath. ==== Name ==== Each XML Transformation row requires a unique name. This field is linked from the [[indexing#indexed_field_definitions|Indexed Field Definitions]]. ==== XSLT ==== An [[http://www.w3.org/TR/xslt|XSLT]] document which provides a transformation from one bibliographic format to another. Today in Evergreen, because [[http://www.loc.gov/standards/marcxml/|MARCXML]] is used as the canonical storage format, these transformation documents must go from [[http://www.loc.gov/standards/marcxml/|MARCXML]] to something else. Evergreen supplies [[http://www.w3.org/TR/xslt|XSLT]] documents that target version 3.0, 3.1, 3.2 and 3.3 of [[http://www.loc.gov/standards/mods/|MODS]], and the stock indexing definitions use either MODS 3.2 or MARCXML. ==== Namespace URI ==== When defining a new XML Transformation for Evergreen, a target XML Namespace URI must be provided. All viable XML-based bibliographic formats use an XML Namespace to scope their contents. ==== Namespace Prefix ==== In addition to a Namespace URI, Evergreen needs a prefix for each Namespace URI. This prefix can be anything that is legal for use as an XML Namespace Prefix, and need not be unique, though uniqueness is recommended for clarity. ===== Indexed Field Definitions ===== Table "config.metabib_field" Column | Type | Modifiers --------------+---------+------------------------------------------------------------------- id | integer | not null default nextval('config.metabib_field_id_seq'::regclass) field_class | text | not null name | text | not null xpath | text | not null weight | integer | not null default 1 format | text | not null default 'mods33'::text search_field | boolean | not null default true facet_field | boolean | not null default false label | text | not null facet_xpath | text | Indexes: "metabib_field_pkey" PRIMARY KEY, btree (id) "config_metabib_field_class_name_idx" UNIQUE, btree (field_class, name) Foreign-key constraints: "field_class_fkey" FOREIGN KEY (field_class) REFERENCES config.metabib_class(name) Referenced by: TABLE "metabib.author_field_entry" CONSTRAINT "metabib_author_field_entry_field_pkey" FOREIGN KEY (field) REFERENCES config.metabib_field(id) ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED TABLE "config.metabib_field_index_norm_map" CONSTRAINT "metabib_field_index_norm_map_field_fkey" FOREIGN KEY (field) REFERENCES config.metabib_field(id) ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED TABLE "metabib.identifier_field_entry" CONSTRAINT "metabib_identifier_field_entry_field_pkey" FOREIGN KEY (field) REFERENCES config.metabib_field(id) ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED TABLE "metabib.keyword_field_entry" CONSTRAINT "metabib_keyword_field_entry_field_pkey" FOREIGN KEY (field) REFERENCES config.metabib_field(id) ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED TABLE "config.metabib_search_alias" CONSTRAINT "metabib_search_alias_field_fkey" FOREIGN KEY (field) REFERENCES config.metabib_field(id) TABLE "metabib.subject_field_entry" CONSTRAINT "metabib_subject_field_entry_field_pkey" FOREIGN KEY (field) REFERENCES config.metabib_field(id) ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED TABLE "metabib.title_field_entry" CONSTRAINT "metabib_title_field_entry_field_pkey" FOREIGN KEY (field) REFERENCES config.metabib_field(id) ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED TABLE "search.relevance_adjustment" CONSTRAINT "relevance_adjustment_field_fkey" FOREIGN KEY (field) REFERENCES config.metabib_field(id) DEFERRABLE INITIALLY DEFERRED Each indexed datum coming from a bibliographic record in Evergreen is extracted based on an Indexed Field Definition. There are approximately thirty default definitions in a stock Evergreen installation spread across all six search classes. These definitions provide Evergreen with the information it needs in order to extract interesting data from bibliographic records. ==== Field Class ==== Each Indexed Field Definition references a [[indexing#search_classes|Search Class]] via this field. When searching class-wide, all fields that share a class are searched at the same time. ==== Field Name ==== Each Definition also has a name, which like the name field on [[indexing#search_classes|Search Classes]], is available for use in the advanced search syntax. ==== Weight ==== Indexed data from different fields will probably be considered to have different importance when calculating the relevance of a matched query term. For instance, a match in a translated title may be considered less important than a match in the title proper. The **weight** allows control over this. By supplying a higher or lower relative **weight**, one field can be made more or less important, in relevance ranking terms, than others. This value is used as a multiplier to the baseline cover-density ranking (discussed below), and so setting this value to 0 will allow matches, but always rank them at the bottom of the list. Likewise, supplying a very large **weight** multiplier will cause matches to appear at the top of the list. Floating point values are allowed, and values between 0 and 1 can be used to apply fine-grained, percentage-based adjustments. Evergreen ships with all Indexed Field Definition weights set to 1 by default. ==== XML Format ==== As mentioned above in the [[indexing#xml_transformations|XML Transformations]] section, Evergreen allows indexing of bibliographic records after [[http://www.w3.org/TR/xslt|XSLT]] processing, which can aid in complex indexing requirements. The **format** field specifies what XML format the **xpath** and **facet_xpath** expressions expect to work in, and from this Evergreen can decide what, if any, [[http://www.w3.org/TR/xslt|XSLT]] must be applied to the record being indexed. The XML format must match the name of an XSLT transformation specified in ''config.xml_transform''. ==== XPath ==== The **xpath** field holds an [[http://www.w3.org/TR/xpath/|XPath]] expression that is used to extract sections of data from within an XML record, after any **format**-required [[http://www.w3.org/TR/xslt|XSLT]] has been applied. The text content of the node set returned by this XPath is then concatenated to provide an initial indexed value. For instance, the title-related XPath expressions such as //mods32:mods/mods32:titleInfo[mods32:title and not (@type)] will extract an XML subset similar to Two lyric pieces for piano solo From this, the text content is extracted and concatenated to form Two lyric pieces for piano solo which is then indexed. ==== Flags (search/facet) ==== Values extracted from bibliographic records by Indexed Field Definitions can be used for either searching or faceting, or both. ==== Facet XPath ==== For fields that are used for faceting, where **facet_field** is true, it is often necessary to further refine the set of XML nodes extracted by the **xpath** expression. The **facet_xpath** field holds an optional expression for accomplishing this goal. For instance, the author-related XPath expressions such as //mods32:mods/mods32:name[@type='personal' and mods32:role/mods32:roleTerm[text()='creator']] will extract an XML subset similar to Hobson, Bruce (R. Bruce) creator which then condenses to Hobson, Bruce (R. Bruce) creator which is the entire set of text content. However, for faceting, the "creator" string is harmful. By supplying a **facet_xpath** expression of *[local-name()='namePart'] the node set is further refined to just Hobson, Bruce (R. Bruce) for which the text content is Hobson, Bruce (R. Bruce) and what we want for indexing. ===== Field Aliases ===== Table "config.metabib_search_alias" Column | Type | Modifiers -------------+---------+----------- alias | text | not null field_class | text | not null field | integer | Indexes: "metabib_search_alias_pkey" PRIMARY KEY, btree (alias) Foreign-key constraints: "metabib_search_alias_field_class_fkey" FOREIGN KEY (field_class) REFERENCES config.metabib_class(name) "metabib_search_alias_field_fkey" FOREIGN KEY (field) REFERENCES config.metabib_field(id) All [[indexing#search_classes|Search Classes]] and [[indexing#indexed_field_definitions|Indexed Field Definitions]] can be given aliases for use in several interfaces. First, these aliases provide a mechanism for internationalizing the user-supplied search constraints; for instance, "author" can be aliased to "skrywer" to support searching by native speakers of Afrikaans without having to know the English term "author". In a similar manner, aliases can be used to map [[http://www.loc.gov/standards/sru/specs/cql.html|CQL]] context set match points, which have standard names external to any specific search backend, to appropriate match points in any given Evergreen installation. For instance, the [[http://www.loc.gov/standards/sru/cql/contextSets/bib-context-set.html|proposed CQL 'bib' context set]] defines, among others, title indexes called **dc.title**, **bib.titleUniform** and **bib.titleSeries**. In a stock Evergreen installation, **dc.title** would be aliased to the entire **title** search class, **bib.titleUniform** to the **uniform** field within the **title** search class, and **bib.titleSeries** to the **seriestitle** field within the **series** search class. ==== Alias ==== The **alias** field is required and defines the string to be used to stand in for a Search Class or Indexed Field Definition. ==== Class ==== The **class** is required and specifies which Search Class this alias belongs to. ==== Field ==== The **field** is optional and specifies which Indexed Field Definition this alias stands in for. If left NULL, the entire class, specified in the **class** field is aliased. ===== Normalization Functions ===== Table "config.index_normalizer" Column | Type | Modifiers -------------+---------+---------------------------------------------------------------------- id | integer | not null default nextval('config.index_normalizer_id_seq'::regclass) name | text | not null description | text | func | text | not null param_count | integer | not null default 0 Indexes: "index_normalizer_pkey" PRIMARY KEY, btree (id) "index_normalizer_name_key" UNIQUE, btree (name) Referenced by: TABLE "config.metabib_field_index_norm_map" CONSTRAINT "metabib_field_index_norm_map_norm_fkey" FOREIGN KEY (norm) REFERENCES config.index_normalizer(id) ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED Data extracted from bibliographic record, for indexing purposes, will normally require some normalization. In the least, stripping of diacritics is normally required for searches to act as users expect. Normalizer functions are in-database stored procedures, and can be written in any programming language supported by Postgres. The stock normalizers are all written in either PL/PerlU, PL/pgSQL or SQL. Twenty-one normalizer functions are registered in the stock Evergreen installation. They are ^ Name ^ Description ^ |Approximate High Date Normalize|Normalize the value to the nearest date-ish value, rounding up| |Approximate Low Date Normalize|Normalize the value to the nearest date-ish value, rounding down| |Coded Value Map Normalizer|Applies coded_value_map mapping of values| |Down-case|Convert text lower case.| |Extract Dewey-like number|Extract a string of numeric characters ther resembles a DDC number.| |First word|Include only the first space-separated word of a string.| |Generic Mapping Normalizer|Map values or sets of values to new values.| |ISBN 10/13 conversion|Translate ISBN10 to ISBN13, and vice versa, for indexing purposes.| |Left truncation|Discard the specified number of characters from the left side of the string.| |NACO Normalize|Apply NACO normalization rules to the extracted text. See https://www.loc.gov/aba/pcc/naco/normrule-2.html for details.| |NACO Normalize -- retain first comma|Apply NACO normalization rules to the extracted text, retaining the first comma. See https://www.loc.gov/aba/pcc/naco/normrule-2.html for details.| |Normalize date range|Split date ranges in the form of "XXXX-YYYY" into "XXXX YYYY" for proper index.| |Normalize date range|Normalize the value to NULL if it is not a number| |Replace|Replace all occurrences of first parameter in the string with the second parameter.| |Remove Parenthesized Substring|Remove any parenthesized substrings from the extracted text, such as the agency code preceding authority record control numbers in subfield 0.| |Right truncation|Include only the specified number of characters from the left side of the string.| |Search Normalize|Apply search normalization rules to the extracted text. A less extreme version of NACO normalization.| |Strip Diacritics|Convert text to NFD form and remove non-spacing combining marks.| |Trim Surrounding Space|Trim leading and trailing spaces from extracted text.| |Trim Trailing Punctuation|Eliminate extraneous trailing commas and periods in text.| |Up-case|Convert text upper case.| ==== Name ==== The **name** field provides a short, human-readable label for a normalization function, such as **NACO Normalize**. ==== Description ==== The **description** field provides a description of how a normalization function will change text at indexing and search time. ==== Function ==== The **func** field supplies the function name, schema-qualified if necessary. ==== Parameter Count ==== Normalization functions always take the text to be normalized as the first parameter. If extra parameters are required, the number of extra parameters are listed here. For instance, the **Right Truncation** normalizer requires one extra parameter -- the number of characters to retain from the beginning of the text. ===== Field Normalization Settings ===== Table "config.metabib_field_index_norm_map" Column | Type | Modifiers --------+---------+---------------------------------------------------------------------------------- id | integer | not null default nextval('config.metabib_field_index_norm_map_id_seq'::regclass) field | integer | not null norm | integer | not null params | text | pos | integer | not null default 0 Indexes: "metabib_field_index_norm_map_pkey" PRIMARY KEY, btree (id) Foreign-key constraints: "metabib_field_index_norm_map_field_fkey" FOREIGN KEY (field) REFERENCES config.metabib_field(id) ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED "metabib_field_index_norm_map_norm_fkey" FOREIGN KEY (norm) REFERENCES config.index_normalizer(id) ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED In order to apply a [[indexing#normalization_functions|Normalization Function]] to an [[indexing#indexed_field_definitions|Indexed Field Definition]], it is necessary to create a mapping between the two. When searching, these registered normalizations are applied to search terms in the same order as they are applied to indexed terms. This parity is required to ensure that search terms end up in the same normalization state as indexed terms, or else matching is not possible. **Note:** Only normalizations with a negative **pos** value are applied to the facet version of indexed terms! ==== Field ==== The **field** value specifies which [[indexing#indexed_field_definitions|Indexed Field Definition]] should have a particular [[indexing#normalization_functions|Normalization Function]] applied to its extracted text. ==== Normalizer ==== The **norm** field specifies which [[indexing#normalization_functions|Normalization Function]] is to be applied. ==== Parameters ==== If the [[indexing#normalization_functions|Normalization Function]] requires more parameters than simply the text to be normalized, as specified in the definition of the function, then the **params** field contains a [[http://json.org/|JSON]] array containing the specific values to use when the linked normalization function is applied to the linked field. ==== Normalization Ordering ==== When more than one normalization function is mapped onto a specific field, and the order in which those normalization functions would be significant, the **pos** field is used to order the normalization operations. The operations are ordered in ascending **pos** order from lowest to highest. Normalizations with a negative **pos** value are applied to the text before the value is stored, meaning both search and facet fields are affected, whereas those with a zero or positive **pos** value are only applied to search fields. ===== Search-oriented Index Definition Example ===== [[search_idx_def_example|Adding a Local Subjects (690) search index]] ===== Facet-oriented Index Definition Example ===== [[facet_idx_def_example|Adding a Material Type (947$t) facet index]] ===== Query Parser ===== [[documentation:technical:search_grammar|Bib Searching]]