Index

Most of PageSeeder’s data is fully indexed so that it can be used for searches, autocomplete features and quick lookups.

What data is indexed

For each group, PageSeeder indexes the following data:

Documents:

PSML (content, metadata, path, properties, labels, etc.).
PDF (content, metadata and path).
Microsoft Word DOCX format (content, metadata and path).
HTML files (content, metadata and path).
Images (metadata and path).

Folders (path).
Individual comments (content, properties and context).
tasks (content, properties, status, due date, priority, context).

The media type of each document determines what is going to be indexed, for example, PageSeeder can extract some metadata from images and the content of PDF and Word documents, while a richer array of semantics is available for PSML documents.

There is one index per group. Items belonging to multiple groups are indexed multiple times. It is possible to search multiple indexes at once through the service API.

For an overview of what data is indexed for each type of item, see index fields.

How the index works

PageSeeder’s index is powered by Apache Lucene and PageSeeder’s Flint (open source). The key components for customizing the PageSeeder index are document types and labels. These are robust tools to add content-specific semantics to the index of document collection.

Index trigger

Whenever an indexable item is created, modified or deleted, the index is updated to reflect that change. Any of these events could trigger an index update:

The document content, metadata or properties are edited.
A document is uploaded.
A document is created.
A document or comment is moved.
A document is archived or deleted.
A new post or reply to a comment.
A task changes status, content or assignment.

A single trigger can cause multiple items to be indexed. For example, if the title of a document is changed, every document pointing to it might need to be reindexed. Similarly, any change in a shared item causes the item to be reindexed for each group it belongs to.

Type of item

Each item is processed by a different indexed template. PageSeeder dispatches each item for indexing based on their media type.

For documents, the media type is based on the file extension. For example, a Word document (ending .doc) is processed by the application/word template; a PSML document (ending in .psml) is processed by the application/vnd.pageseeder.psml+xml template. When multiple templates match, the most specific is used, so recognized XML media types are processed differently from generic XML.

Comments, tasks, and folders have their own set of templates.

Custom index templates which create custom fields are now obsolete and support is removed in PageSeeder v6. Custom fields can often be replaced with PSML properties or metadata.

Index fields

Each index template defines which fields are created and how. Index fields are ultimately what are used by the search engine to find matching documents.

Index fields can have a number of attributes which affect how they are considered during the search.

For example, numeric fields are designed to make sorting more efficient, while analyzed fields are designed for full-text searches.

For more explanations about field attributes, see Legend & explanations.

For a complete definition of each index field in PageSeeder, see index fields definitions.

Created on 6 December 2010, last edited on 7 June 2024 at 18:15