User Tools

Site Tools


documentation:indexing

FIXME - Add Information on Virtual Index Definitions in 3.1+

Bibliographic Indexing in Evergreen

Indexing and searching bibliographic data in Evergreen are complex processes. Also, Evergreen is extremely configurable in these areas and can be tuned to the specific needs of each installation. Because of this, there are many moving parts which are tightly integrated and interdependent.

Components

Here we will discuss the nuts and bolts of indexing. In each following subsection, you will see the database definition associated with each component, normally a table, followed by an explanation of the database entity's fields, what those fields control, and how those fields are linked and used by other components.

Search Classes

Table "config.metabib_class"
 Column | Type | Modifiers 
--------+------+-----------
 name   | text | not null
 label  | text | not null
Indexes:
    "metabib_class_pkey" PRIMARY KEY, btree (name)
    "metabib_class_label_key" UNIQUE, btree (label)
Referenced by:
    TABLE "config.metabib_field" CONSTRAINT "field_class_fkey" FOREIGN KEY (field_class) REFERENCES config.metabib_class(name)
    TABLE "config.metabib_search_alias" CONSTRAINT "metabib_search_alias_field_class_fkey" FOREIGN KEY (field_class) REFERENCES config.metabib_class(name)

Search classes are the largest basic grouping mechanism for indexed bibliographic data. When indexing and searching, data from a bib record or a search query is always constrained to a specific search class. The primary technical reason for this is that each search class has some basic normalization and weighting rules that can be applied globally. For example: when dealing with titles, strings are short and word order is important; when dealing with contributor names, word order can generally be disregarded but year range normalization is important.

Evergreen ships with six search classes which are used to group categories of fields into logical sets for searching in aggregate. The stock setup of these classes are as follows:

  • Title – Generally, MODS titleInfo elements
  • Author – Generally, MODS name elements that are direct children of the mods top-level element
  • Subject – Generally, descendants of the MODS subject element
  • Series – MODS titleInfo elements within a relatedItem element that has a type attribute of series
  • Keyword – Everything within the MODS record, except the originInfo element and descendants
  • Identifier – Record accession number, most MARC standard number fields, and bib-level free-text call numbers. Examples of identifiers that would be included in this search class are ISBNs, ISSNs, ISMNs, LCCNs, and OCLC numbers.

Name

The name of a search class is the value which can be used in the advanced search syntax (more on this in the Field Aliases section). In the stock Evergreen code, it is the lowercase version of the classes in the list above.

This name is used as the key for the search class, and is linked to by the Indexed Field Definitions and Field Aliases.

Label

An internationalizable field containing the display label for the search class. This shows up in the advanced search dropdown box and faceting sidebar in the OPAC. The translation of this field is dependent on the locale that the user has chosen in the OPAC.

XML Transformations

   Table "config.xml_transform"
    Column     | Type | Modifiers 
---------------+------+-----------
 name          | text | not null
 namespace_uri | text | not null
 prefix        | text | not null
 xslt          | text | not null
Indexes:
    "xml_transform_pkey" PRIMARY KEY, btree (name)

Evergreen stores MARCXML internally for all records, but provides the ability to index records after transforming them to other XML formats. For instance, the MODS format can be much simpler to work with than MARCXML, even for advanced catalogers, because MODS provides semantic interpretation via XSLT that is difficult or impossible to express in simple XPath.

Name

Each XML Transformation row requires a unique name. This field is linked from the Indexed Field Definitions.

XSLT

An XSLT document which provides a transformation from one bibliographic format to another. Today in Evergreen, because MARCXML is used as the canonical storage format, these transformation documents must go from MARCXML to something else. Evergreen supplies XSLT documents that target version 3.0, 3.1, 3.2 and 3.3 of MODS, and the stock indexing definitions use either MODS 3.2 or MARCXML.

Namespace URI

When defining a new XML Transformation for Evergreen, a target XML Namespace URI must be provided. All viable XML-based bibliographic formats use an XML Namespace to scope their contents.

Namespace Prefix

In addition to a Namespace URI, Evergreen needs a prefix for each Namespace URI. This prefix can be anything that is legal for use as an XML Namespace Prefix, and need not be unique, though uniqueness is recommended for clarity.

Indexed Field Definitions

                                Table "config.metabib_field"
    Column    |  Type   |                             Modifiers                             
--------------+---------+-------------------------------------------------------------------
 id           | integer | not null default nextval('config.metabib_field_id_seq'::regclass)
 field_class  | text    | not null
 name         | text    | not null
 xpath        | text    | not null
 weight       | integer | not null default 1
 format       | text    | not null default 'mods33'::text
 search_field | boolean | not null default true
 facet_field  | boolean | not null default false
 label        | text    | not null
 facet_xpath  | text    | 
Indexes:
    "metabib_field_pkey" PRIMARY KEY, btree (id)
    "config_metabib_field_class_name_idx" UNIQUE, btree (field_class, name)
Foreign-key constraints:
    "field_class_fkey" FOREIGN KEY (field_class) REFERENCES config.metabib_class(name)
Referenced by:
    TABLE "metabib.author_field_entry" CONSTRAINT "metabib_author_field_entry_field_pkey" FOREIGN KEY (field) REFERENCES config.metabib_field(id) ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED
    TABLE "config.metabib_field_index_norm_map" CONSTRAINT "metabib_field_index_norm_map_field_fkey" FOREIGN KEY (field) REFERENCES config.metabib_field(id) ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED
    TABLE "metabib.identifier_field_entry" CONSTRAINT "metabib_identifier_field_entry_field_pkey" FOREIGN KEY (field) REFERENCES config.metabib_field(id) ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED
    TABLE "metabib.keyword_field_entry" CONSTRAINT "metabib_keyword_field_entry_field_pkey" FOREIGN KEY (field) REFERENCES config.metabib_field(id) ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED
    TABLE "config.metabib_search_alias" CONSTRAINT "metabib_search_alias_field_fkey" FOREIGN KEY (field) REFERENCES config.metabib_field(id)
    TABLE "metabib.subject_field_entry" CONSTRAINT "metabib_subject_field_entry_field_pkey" FOREIGN KEY (field) REFERENCES config.metabib_field(id) ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED
    TABLE "metabib.title_field_entry" CONSTRAINT "metabib_title_field_entry_field_pkey" FOREIGN KEY (field) REFERENCES config.metabib_field(id) ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED
    TABLE "search.relevance_adjustment" CONSTRAINT "relevance_adjustment_field_fkey" FOREIGN KEY (field) REFERENCES config.metabib_field(id) DEFERRABLE INITIALLY DEFERRED

Each indexed datum coming from a bibliographic record in Evergreen is extracted based on an Indexed Field Definition. There are approximately thirty default definitions in a stock Evergreen installation spread across all six search classes. These definitions provide Evergreen with the information it needs in order to extract interesting data from bibliographic records.

Field Class

Each Indexed Field Definition references a Search Class via this field. When searching class-wide, all fields that share a class are searched at the same time.

Field Name

Each Definition also has a name, which like the name field on Search Classes, is available for use in the advanced search syntax.

Weight

Indexed data from different fields will probably be considered to have different importance when calculating the relevance of a matched query term. For instance, a match in a translated title may be considered less important than a match in the title proper. The weight allows control over this.

By supplying a higher or lower relative weight, one field can be made more or less important, in relevance ranking terms, than others. This value is used as a multiplier to the baseline cover-density ranking (discussed below), and so setting this value to 0 will allow matches, but always rank them at the bottom of the list. Likewise, supplying a very large weight multiplier will cause matches to appear at the top of the list. Floating point values are allowed, and values between 0 and 1 can be used to apply fine-grained, percentage-based adjustments.

Evergreen ships with all Indexed Field Definition weights set to 1 by default.

XML Format

As mentioned above in the XML Transformations section, Evergreen allows indexing of bibliographic records after XSLT processing, which can aid in complex indexing requirements. The format field specifies what XML format the xpath and facet_xpath expressions expect to work in, and from this Evergreen can decide what, if any, XSLT must be applied to the record being indexed. The XML format must match the name of an XSLT transformation specified in config.xml_transform.

XPath

The xpath field holds an XPath expression that is used to extract sections of data from within an XML record, after any format-required XSLT has been applied. The text content of the node set returned by this XPath is then concatenated to provide an initial indexed value.

For instance, the title-related XPath expressions such as

//mods32:mods/mods32:titleInfo[mods32:title and not (@type)]

will extract an XML subset similar to

    <titleInfo> 
      <title>Two lyric pieces</title> 
      <subTitle>for piano solo</subTitle> 
    </titleInfo> 

From this, the text content is extracted and concatenated to form

Two lyric pieces for piano solo

which is then indexed.

Flags (search/facet)

Values extracted from bibliographic records by Indexed Field Definitions can be used for either searching or faceting, or both.

Facet XPath

For fields that are used for faceting, where facet_field is true, it is often necessary to further refine the set of XML nodes extracted by the xpath expression. The facet_xpath field holds an optional expression for accomplishing this goal.

For instance, the author-related XPath expressions such as

//mods32:mods/mods32:name[@type='personal' and mods32:role/mods32:roleTerm[text()='creator']]

will extract an XML subset similar to

    <name type="personal"> 
      <namePart>Hobson, Bruce (R. Bruce)</namePart> 
      <role> 
        <roleTerm authority="marcrelator" type="text">creator</roleTerm> 
      </role> 
    </name> 

which then condenses to

Hobson, Bruce (R. Bruce) creator

which is the entire set of text content. However, for faceting, the "creator" string is harmful. By supplying a facet_xpath expression of

*[local-name()='namePart']

the node set is further refined to just

<namePart>Hobson, Bruce (R. Bruce)</namePart>

for which the text content is

Hobson, Bruce (R. Bruce)

and what we want for indexing.

Field Aliases

Table "config.metabib_search_alias"
   Column    |  Type   | Modifiers 
-------------+---------+-----------
 alias       | text    | not null
 field_class | text    | not null
 field       | integer | 
Indexes:
    "metabib_search_alias_pkey" PRIMARY KEY, btree (alias)
Foreign-key constraints:
    "metabib_search_alias_field_class_fkey" FOREIGN KEY (field_class) REFERENCES config.metabib_class(name)
    "metabib_search_alias_field_fkey" FOREIGN KEY (field) REFERENCES config.metabib_field(id)

All Search Classes and Indexed Field Definitions can be given aliases for use in several interfaces.

First, these aliases provide a mechanism for internationalizing the user-supplied search constraints; for instance, "author" can be aliased to "skrywer" to support searching by native speakers of Afrikaans without having to know the English term "author".

In a similar manner, aliases can be used to map CQL context set match points, which have standard names external to any specific search backend, to appropriate match points in any given Evergreen installation. For instance, the proposed CQL 'bib' context set defines, among others, title indexes called dc.title, bib.titleUniform and bib.titleSeries. In a stock Evergreen installation, dc.title would be aliased to the entire title search class, bib.titleUniform to the uniform field within the title search class, and bib.titleSeries to the seriestitle field within the series search class.

Alias

The alias field is required and defines the string to be used to stand in for a Search Class or Indexed Field Definition.

Class

The class is required and specifies which Search Class this alias belongs to.

Field

The field is optional and specifies which Indexed Field Definition this alias stands in for. If left NULL, the entire class, specified in the class field is aliased.

Normalization Functions

                               Table "config.index_normalizer"
   Column    |  Type   |                              Modifiers                               
-------------+---------+----------------------------------------------------------------------
 id          | integer | not null default nextval('config.index_normalizer_id_seq'::regclass)
 name        | text    | not null
 description | text    | 
 func        | text    | not null
 param_count | integer | not null default 0
Indexes:
    "index_normalizer_pkey" PRIMARY KEY, btree (id)
    "index_normalizer_name_key" UNIQUE, btree (name)
Referenced by:
    TABLE "config.metabib_field_index_norm_map" CONSTRAINT "metabib_field_index_norm_map_norm_fkey" FOREIGN KEY (norm) REFERENCES config.index_normalizer(id) ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED

Data extracted from bibliographic record, for indexing purposes, will normally require some normalization. In the least, stripping of diacritics is normally required for searches to act as users expect.

Normalizer functions are in-database stored procedures, and can be written in any programming language supported by Postgres. The stock normalizers are all written in either PL/PerlU, PL/pgSQL or SQL.

Twenty-one normalizer functions are registered in the stock Evergreen installation. They are

Name Description
Approximate High Date NormalizeNormalize the value to the nearest date-ish value, rounding up
Approximate Low Date NormalizeNormalize the value to the nearest date-ish value, rounding down
Coded Value Map NormalizerApplies coded_value_map mapping of values
Down-caseConvert text lower case.
Extract Dewey-like numberExtract a string of numeric characters ther resembles a DDC number.
First wordInclude only the first space-separated word of a string.
Generic Mapping NormalizerMap values or sets of values to new values.
ISBN 10/13 conversionTranslate ISBN10 to ISBN13, and vice versa, for indexing purposes.
Left truncationDiscard the specified number of characters from the left side of the string.
NACO NormalizeApply NACO normalization rules to the extracted text. See https://www.loc.gov/aba/pcc/naco/normrule-2.html for details.
NACO Normalize – retain first commaApply NACO normalization rules to the extracted text, retaining the first comma. See https://www.loc.gov/aba/pcc/naco/normrule-2.html for details.
Normalize date rangeSplit date ranges in the form of "XXXX-YYYY" into "XXXX YYYY" for proper index.
Normalize date rangeNormalize the value to NULL if it is not a number
ReplaceReplace all occurrences of first parameter in the string with the second parameter.
Remove Parenthesized SubstringRemove any parenthesized substrings from the extracted text, such as the agency code preceding authority record control numbers in subfield 0.
Right truncationInclude only the specified number of characters from the left side of the string.
Search NormalizeApply search normalization rules to the extracted text. A less extreme version of NACO normalization.
Strip DiacriticsConvert text to NFD form and remove non-spacing combining marks.
Trim Surrounding SpaceTrim leading and trailing spaces from extracted text.
Trim Trailing PunctuationEliminate extraneous trailing commas and periods in text.
Up-caseConvert text upper case.

Name

The name field provides a short, human-readable label for a normalization function, such as NACO Normalize.

Description

The description field provides a description of how a normalization function will change text at indexing and search time.

Function

The func field supplies the function name, schema-qualified if necessary.

Parameter Count

Normalization functions always take the text to be normalized as the first parameter. If extra parameters are required, the number of extra parameters are listed here. For instance, the Right Truncation normalizer requires one extra parameter – the number of characters to retain from the beginning of the text.

Field Normalization Settings

                             Table "config.metabib_field_index_norm_map"
 Column |  Type   |                                    Modifiers                                     
--------+---------+----------------------------------------------------------------------------------
 id     | integer | not null default nextval('config.metabib_field_index_norm_map_id_seq'::regclass)
 field  | integer | not null
 norm   | integer | not null
 params | text    | 
 pos    | integer | not null default 0
Indexes:
    "metabib_field_index_norm_map_pkey" PRIMARY KEY, btree (id)
Foreign-key constraints:
    "metabib_field_index_norm_map_field_fkey" FOREIGN KEY (field) REFERENCES config.metabib_field(id) ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED
    "metabib_field_index_norm_map_norm_fkey" FOREIGN KEY (norm) REFERENCES config.index_normalizer(id) ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED

In order to apply a Normalization Function to an Indexed Field Definition, it is necessary to create a mapping between the two.

When searching, these registered normalizations are applied to search terms in the same order as they are applied to indexed terms. This parity is required to ensure that search terms end up in the same normalization state as indexed terms, or else matching is not possible.

Note: Only normalizations with a negative pos value are applied to the facet version of indexed terms!

Field

The field value specifies which Indexed Field Definition should have a particular Normalization Function applied to its extracted text.

Normalizer

The norm field specifies which Normalization Function is to be applied.

Parameters

If the Normalization Function requires more parameters than simply the text to be normalized, as specified in the definition of the function, then the params field contains a JSON array containing the specific values to use when the linked normalization function is applied to the linked field.

Normalization Ordering

When more than one normalization function is mapped onto a specific field, and the order in which those normalization functions would be significant, the pos field is used to order the normalization operations. The operations are ordered in ascending pos order from lowest to highest. Normalizations with a negative pos value are applied to the text before the value is stored, meaning both search and facet fields are affected, whereas those with a zero or positive pos value are only applied to search fields.

Search-oriented Index Definition Example

Facet-oriented Index Definition Example

Query Parser

documentation/indexing.txt · Last modified: 2022/02/10 13:34 by 127.0.0.1

Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Share Alike 4.0 International
CC Attribution-Share Alike 4.0 International Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki

© 2008-2022 GPLS and others. Evergreen is open source software, freely licensed under GNU GPLv2 or later.
The Evergreen Project is a U.S. 501(c)3 non-profit organization.