Advanced

Advanced topics

Index

Most of PageSeeder's data is fully indexed so that it can be used for searches, autocomplete features and quick lookups.

What data is indexed

For each group, PageSeeder indexes the following data:

  • documents
    • PSML (content, metadata, path, properties, labels, etc.)
    • PDF (content, metadata and path)
    • Microsoft Word docx format (content, metadata and path)
    • HTML files (content, metadata and path)
    • Images (metadata and path)
  • folders (path)
  • individual comments (content, properties and context)
  • tasks (content, properties, status, due date, priority, context)

The media type of each document determines what is going to be indexed, for example, PageSeeder can extract some metadata from images, the content of PDF and Word documents, while a richer array of semantics is available for PSML documents.

There is one index per group. Items belonging to multiple groups are indexed multiple times. It is possible to search multiple indexes at once via the service API.

For an overview of what data is indexed for each type item, see index fields.

How the index works

PageSeeder's index is powered by Apache Lucene and PageSeeder's Flint (open source).

Index trigger

Whenever an indexable item is created, modified or deleted, the index is updated to reflect that change. Any of these events could trigger an index update:

  • the document content, metadata or properties are edited;
  • a document is uploaded;
  • a document is created;
  • a document or comment is moved;
  • a document is archived or deleted;
  • a new post or reply to a comment;
  • a task changes status, content or assignment.

Note

A single trigger can cause multiple items to be indexed. For example, if the title of a document is changed, every document pointing to it may need to be reindexed. Similarly, any change in a shared item will cause the item to be reindexed for each group it belongs to.

Type of item

Each item is processed by a different indexed template. PageSeeder dispatches each item for indexing based on their media type.

For documents, the media type is based on the file extension. For example, a Word document (ending .doc) will be processed by the application/word template; a PSML document (ending in .psml) will be processed by the application/vnd.pageseeder.psml+xml template. When multiple templates match, the most specific is used so recognized XML media types are processed differently from generic XML.

Comments, tasks and folders have their own set of templates.

Index fields

Each index template defines which fields will be created and how. Index fields are ultimately what will be used by the search engine to find matching documents.

Index fields can have a number of attributes which affect how they will be considered during the search.

For example, numeric fields are designed to make sorting more efficient while analyzed fields are designed for full-text searches.

For more explanations about field attributes, see Legend & explanations.

For a complete definition of each index field in PageSeeder, see index fields definitions.

Created on , last edited on