Table of Contents
- Add Information on Virtual Index Definitions in 3.1+
Bibliographic Indexing in Evergreen
Indexing and searching bibliographic data in Evergreen are complex processes. Also, Evergreen is extremely configurable in these areas and can be tuned to the specific needs of each installation. Because of this, there are many moving parts which are tightly integrated and interdependent.
Here we will discuss the nuts and bolts of indexing. In each following subsection, you will see the database definition associated with each component, normally a table, followed by an explanation of the database entity's fields, what those fields control, and how those fields are linked and used by other components.
Table "config.metabib_class" Column | Type | Modifiers --------+------+----------- name | text | not null label | text | not null Indexes: "metabib_class_pkey" PRIMARY KEY, btree (name) "metabib_class_label_key" UNIQUE, btree (label) Referenced by: TABLE "config.metabib_field" CONSTRAINT "field_class_fkey" FOREIGN KEY (field_class) REFERENCES config.metabib_class(name) TABLE "config.metabib_search_alias" CONSTRAINT "metabib_search_alias_field_class_fkey" FOREIGN KEY (field_class) REFERENCES config.metabib_class(name)
Search classes are the largest basic grouping mechanism for indexed bibliographic data. When indexing and searching, data from a bib record or a search query is always constrained to a specific search class. The primary technical reason for this is that each search class has some basic normalization and weighting rules that can be applied globally. For example: when dealing with titles, strings are short and word order is important; when dealing with contributor names, word order can generally be disregarded but year range normalization is important.
Evergreen ships with six search classes which are used to group categories of fields into logical sets for searching in aggregate. The stock setup of these classes are as follows:
- Title – Generally, MODS titleInfo elements
- Author – Generally, MODS name elements that are direct children of the mods top-level element
- Subject – Generally, descendants of the MODS subject element
- Series – MODS titleInfo elements within a relatedItem element that has a type attribute of series
- Keyword – Everything within the MODS record, except the originInfo element and descendants
- Identifier – Record accession number, most MARC standard number fields, and bib-level free-text call numbers. Examples of identifiers that would be included in this search class are ISBNs, ISSNs, ISMNs, LCCNs, and OCLC numbers.
The name of a search class is the value which can be used in the advanced search syntax (more on this in the Field Aliases section). In the stock Evergreen code, it is the lowercase version of the classes in the list above.
An internationalizable field containing the display label for the search class. This shows up in the advanced search dropdown box and faceting sidebar in the OPAC. The translation of this field is dependent on the locale that the user has chosen in the OPAC.
Table "config.xml_transform" Column | Type | Modifiers ---------------+------+----------- name | text | not null namespace_uri | text | not null prefix | text | not null xslt | text | not null Indexes: "xml_transform_pkey" PRIMARY KEY, btree (name)
Evergreen stores MARCXML internally for all records, but provides the ability to index records after transforming them to other XML formats. For instance, the MODS format can be much simpler to work with than MARCXML, even for advanced catalogers, because MODS provides semantic interpretation via XSLT that is difficult or impossible to express in simple XPath.
Each XML Transformation row requires a unique name. This field is linked from the Indexed Field Definitions.
An XSLT document which provides a transformation from one bibliographic format to another. Today in Evergreen, because MARCXML is used as the canonical storage format, these transformation documents must go from MARCXML to something else. Evergreen supplies XSLT documents that target version 3.0, 3.1, 3.2 and 3.3 of MODS, and the stock indexing definitions use either MODS 3.2 or MARCXML.
When defining a new XML Transformation for Evergreen, a target XML Namespace URI must be provided. All viable XML-based bibliographic formats use an XML Namespace to scope their contents.
In addition to a Namespace URI, Evergreen needs a prefix for each Namespace URI. This prefix can be anything that is legal for use as an XML Namespace Prefix, and need not be unique, though uniqueness is recommended for clarity.
Indexed Field Definitions
Table "config.metabib_field" Column | Type | Modifiers --------------+---------+------------------------------------------------------------------- id | integer | not null default nextval('config.metabib_field_id_seq'::regclass) field_class | text | not null name | text | not null xpath | text | not null weight | integer | not null default 1 format | text | not null default 'mods33'::text search_field | boolean | not null default true facet_field | boolean | not null default false label | text | not null facet_xpath | text | Indexes: "metabib_field_pkey" PRIMARY KEY, btree (id) "config_metabib_field_class_name_idx" UNIQUE, btree (field_class, name) Foreign-key constraints: "field_class_fkey" FOREIGN KEY (field_class) REFERENCES config.metabib_class(name) Referenced by: TABLE "metabib.author_field_entry" CONSTRAINT "metabib_author_field_entry_field_pkey" FOREIGN KEY (field) REFERENCES config.metabib_field(id) ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED TABLE "config.metabib_field_index_norm_map" CONSTRAINT "metabib_field_index_norm_map_field_fkey" FOREIGN KEY (field) REFERENCES config.metabib_field(id) ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED TABLE "metabib.identifier_field_entry" CONSTRAINT "metabib_identifier_field_entry_field_pkey" FOREIGN KEY (field) REFERENCES config.metabib_field(id) ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED TABLE "metabib.keyword_field_entry" CONSTRAINT "metabib_keyword_field_entry_field_pkey" FOREIGN KEY (field) REFERENCES config.metabib_field(id) ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED TABLE "config.metabib_search_alias" CONSTRAINT "metabib_search_alias_field_fkey" FOREIGN KEY (field) REFERENCES config.metabib_field(id) TABLE "metabib.subject_field_entry" CONSTRAINT "metabib_subject_field_entry_field_pkey" FOREIGN KEY (field) REFERENCES config.metabib_field(id) ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED TABLE "metabib.title_field_entry" CONSTRAINT "metabib_title_field_entry_field_pkey" FOREIGN KEY (field) REFERENCES config.metabib_field(id) ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED TABLE "search.relevance_adjustment" CONSTRAINT "relevance_adjustment_field_fkey" FOREIGN KEY (field) REFERENCES config.metabib_field(id) DEFERRABLE INITIALLY DEFERRED
Each indexed datum coming from a bibliographic record in Evergreen is extracted based on an Indexed Field Definition. There are approximately thirty default definitions in a stock Evergreen installation spread across all six search classes. These definitions provide Evergreen with the information it needs in order to extract interesting data from bibliographic records.
Each Indexed Field Definition references a Search Class via this field. When searching class-wide, all fields that share a class are searched at the same time.
Each Definition also has a name, which like the name field on Search Classes, is available for use in the advanced search syntax.
Indexed data from different fields will probably be considered to have different importance when calculating the relevance of a matched query term. For instance, a match in a translated title may be considered less important than a match in the title proper. The weight allows control over this.
By supplying a higher or lower relative weight, one field can be made more or less important, in relevance ranking terms, than others. This value is used as a multiplier to the baseline cover-density ranking (discussed below), and so setting this value to 0 will allow matches, but always rank them at the bottom of the list. Likewise, supplying a very large weight multiplier will cause matches to appear at the top of the list. Floating point values are allowed, and values between 0 and 1 can be used to apply fine-grained, percentage-based adjustments.
Evergreen ships with all Indexed Field Definition weights set to 1 by default.
As mentioned above in the XML Transformations section, Evergreen allows indexing of bibliographic records after XSLT processing, which can aid in complex indexing requirements. The format field specifies what XML format the xpath and facet_xpath expressions expect to work in, and from this Evergreen can decide what, if any, XSLT must be applied to the record being indexed. The XML format must match the name of an XSLT transformation specified in
The xpath field holds an XPath expression that is used to extract sections of data from within an XML record, after any format-required XSLT has been applied. The text content of the node set returned by this XPath is then concatenated to provide an initial indexed value.
For instance, the title-related XPath expressions such as
//mods32:mods/mods32:titleInfo[mods32:title and not (@type)]
will extract an XML subset similar to
<titleInfo> <title>Two lyric pieces</title> <subTitle>for piano solo</subTitle> </titleInfo>
From this, the text content is extracted and concatenated to form
Two lyric pieces for piano solo
which is then indexed.
Values extracted from bibliographic records by Indexed Field Definitions can be used for either searching or faceting, or both.
For fields that are used for faceting, where facet_field is true, it is often necessary to further refine the set of XML nodes extracted by the xpath expression. The facet_xpath field holds an optional expression for accomplishing this goal.
For instance, the author-related XPath expressions such as
//mods32:mods/mods32:name[@type='personal' and mods32:role/mods32:roleTerm[text()='creator']]
will extract an XML subset similar to
<name type="personal"> <namePart>Hobson, Bruce (R. Bruce)</namePart> <role> <roleTerm authority="marcrelator" type="text">creator</roleTerm> </role> </name>
which then condenses to
Hobson, Bruce (R. Bruce) creator
which is the entire set of text content. However, for faceting, the "creator" string is harmful. By supplying a facet_xpath expression of
the node set is further refined to just
<namePart>Hobson, Bruce (R. Bruce)</namePart>
for which the text content is
Hobson, Bruce (R. Bruce)
and what we want for indexing.
Table "config.metabib_search_alias" Column | Type | Modifiers -------------+---------+----------- alias | text | not null field_class | text | not null field | integer | Indexes: "metabib_search_alias_pkey" PRIMARY KEY, btree (alias) Foreign-key constraints: "metabib_search_alias_field_class_fkey" FOREIGN KEY (field_class) REFERENCES config.metabib_class(name) "metabib_search_alias_field_fkey" FOREIGN KEY (field) REFERENCES config.metabib_field(id)
First, these aliases provide a mechanism for internationalizing the user-supplied search constraints; for instance, "author" can be aliased to "skrywer" to support searching by native speakers of Afrikaans without having to know the English term "author".
In a similar manner, aliases can be used to map CQL context set match points, which have standard names external to any specific search backend, to appropriate match points in any given Evergreen installation. For instance, the proposed CQL 'bib' context set defines, among others, title indexes called dc.title, bib.titleUniform and bib.titleSeries. In a stock Evergreen installation, dc.title would be aliased to the entire title search class, bib.titleUniform to the uniform field within the title search class, and bib.titleSeries to the seriestitle field within the series search class.
The alias field is required and defines the string to be used to stand in for a Search Class or Indexed Field Definition.
The class is required and specifies which Search Class this alias belongs to.
The field is optional and specifies which Indexed Field Definition this alias stands in for. If left NULL, the entire class, specified in the class field is aliased.
Table "config.index_normalizer" Column | Type | Modifiers -------------+---------+---------------------------------------------------------------------- id | integer | not null default nextval('config.index_normalizer_id_seq'::regclass) name | text | not null description | text | func | text | not null param_count | integer | not null default 0 Indexes: "index_normalizer_pkey" PRIMARY KEY, btree (id) "index_normalizer_name_key" UNIQUE, btree (name) Referenced by: TABLE "config.metabib_field_index_norm_map" CONSTRAINT "metabib_field_index_norm_map_norm_fkey" FOREIGN KEY (norm) REFERENCES config.index_normalizer(id) ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED
Data extracted from bibliographic record, for indexing purposes, will normally require some normalization. In the least, stripping of diacritics is normally required for searches to act as users expect.
Normalizer functions are in-database stored procedures, and can be written in any programming language supported by Postgres. The stock normalizers are all written in either PL/PerlU, PL/pgSQL or SQL.
Twenty-one normalizer functions are registered in the stock Evergreen installation. They are
|Approximate High Date Normalize||Normalize the value to the nearest date-ish value, rounding up|
|Approximate Low Date Normalize||Normalize the value to the nearest date-ish value, rounding down|
|Coded Value Map Normalizer||Applies coded_value_map mapping of values|
|Down-case||Convert text lower case.|
|Extract Dewey-like number||Extract a string of numeric characters ther resembles a DDC number.|
|First word||Include only the first space-separated word of a string.|
|Generic Mapping Normalizer||Map values or sets of values to new values.|
|ISBN 10/13 conversion||Translate ISBN10 to ISBN13, and vice versa, for indexing purposes.|
|Left truncation||Discard the specified number of characters from the left side of the string.|
|NACO Normalize||Apply NACO normalization rules to the extracted text. See https://www.loc.gov/aba/pcc/naco/normrule-2.html for details.|
|NACO Normalize – retain first comma||Apply NACO normalization rules to the extracted text, retaining the first comma. See https://www.loc.gov/aba/pcc/naco/normrule-2.html for details.|
|Normalize date range||Split date ranges in the form of "XXXX-YYYY" into "XXXX YYYY" for proper index.|
|Normalize date range||Normalize the value to NULL if it is not a number|
|Replace||Replace all occurrences of first parameter in the string with the second parameter.|
|Remove Parenthesized Substring||Remove any parenthesized substrings from the extracted text, such as the agency code preceding authority record control numbers in subfield 0.|
|Right truncation||Include only the specified number of characters from the left side of the string.|
|Search Normalize||Apply search normalization rules to the extracted text. A less extreme version of NACO normalization.|
|Strip Diacritics||Convert text to NFD form and remove non-spacing combining marks.|
|Trim Surrounding Space||Trim leading and trailing spaces from extracted text.|
|Trim Trailing Punctuation||Eliminate extraneous trailing commas and periods in text.|
|Up-case||Convert text upper case.|
The name field provides a short, human-readable label for a normalization function, such as NACO Normalize.
The description field provides a description of how a normalization function will change text at indexing and search time.
The func field supplies the function name, schema-qualified if necessary.
Normalization functions always take the text to be normalized as the first parameter. If extra parameters are required, the number of extra parameters are listed here. For instance, the Right Truncation normalizer requires one extra parameter – the number of characters to retain from the beginning of the text.
Field Normalization Settings
Table "config.metabib_field_index_norm_map" Column | Type | Modifiers --------+---------+---------------------------------------------------------------------------------- id | integer | not null default nextval('config.metabib_field_index_norm_map_id_seq'::regclass) field | integer | not null norm | integer | not null params | text | pos | integer | not null default 0 Indexes: "metabib_field_index_norm_map_pkey" PRIMARY KEY, btree (id) Foreign-key constraints: "metabib_field_index_norm_map_field_fkey" FOREIGN KEY (field) REFERENCES config.metabib_field(id) ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED "metabib_field_index_norm_map_norm_fkey" FOREIGN KEY (norm) REFERENCES config.index_normalizer(id) ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED
When searching, these registered normalizations are applied to search terms in the same order as they are applied to indexed terms. This parity is required to ensure that search terms end up in the same normalization state as indexed terms, or else matching is not possible.
Note: Only normalizations with a negative pos value are applied to the facet version of indexed terms!
The norm field specifies which Normalization Function is to be applied.
If the Normalization Function requires more parameters than simply the text to be normalized, as specified in the definition of the function, then the params field contains a JSON array containing the specific values to use when the linked normalization function is applied to the linked field.
When more than one normalization function is mapped onto a specific field, and the order in which those normalization functions would be significant, the pos field is used to order the normalization operations. The operations are ordered in ascending pos order from lowest to highest. Normalizations with a negative pos value are applied to the text before the value is stored, meaning both search and facet fields are affected, whereas those with a zero or positive pos value are only applied to search fields.