Most of PageSeeder's data is fully indexed so that it can be used for searches, autocomplete features and quick lookups.
What data is indexed
For each group, PageSeeder indexes the following data:
- PSML (content, metadata, path, properties, labels, etc.),
- PDF (content, metadata and path),
- Microsoft Word docx format (content, metadata and path),
- HTML files (content, metadata and path),
- Images (metadata and path).
- folders (path),
- individual comments (content, properties and context),
- tasks (content, properties, status, due date, priority, context).
The media type of each document determines what is going to be indexed, for example, PageSeeder can extract some metadata from images and the content of PDF and Word documents, while a richer array of semantics is available for PSML documents.
There is one index per group. Items belonging to multiple groups are indexed multiple times. It is possible to search multiple indexes at once via the service API.
For an overview of what data is indexed for each type item, see index fields.
How the index works
PageSeeder's index is powered by Apache Lucene and PageSeeder's Flint (open source). The key components for customizing the PageSeeder index are document types and labels. These are robust tools to add content-specific semantics to the index of document collection.
Whenever an indexable item is created, modified or deleted, the index is updated to reflect that change. Any of these events could trigger an index update:
- the document content, metadata or properties are edited;
- a document is uploaded;
- a document is created;
- a document or comment is moved;
- a document is archived or deleted;
- a new post or reply to a comment;
- a task changes status, content or assignment.
Type of item
Each item is processed by a different indexed template. PageSeeder dispatches each item for indexing based on their media type.
For documents, the media type is based on the file extension. For example, a Word document (ending
.doc) will be processed by the
application/word template; a PSML document (ending in
.psml) will be processed by the
application/vnd.pageseeder.psml+xml template. When multiple templates match, the most specific is used so recognized XML media types are processed differently from generic XML.
Comments, tasks and folders have their own set of templates.
Each index template defines which fields will be created and how. Index fields are ultimately what will be used by the search engine to find matching documents.
Index fields can have a number of attributes which affect how they will be considered during the search.
For example, numeric fields are designed to make sorting more efficient while analyzed fields are designed for full-text searches.
For more explanations about field attributes, see Legend & explanations.
For a complete definition of each index field in PageSeeder, see index fields definitions.