URL types
URL types can be used to:
- Store metadata about any URL using all the features of PSML properties.
- Extract, and store as editable properties, metadata from a URL where the source is HTML.
URL types can only be used from the version 6 user interface in PageSeeder v5.99 or higher.
Extraction
PageSeeder can extract metadata when creating URLs or on existing URLs – on the Global template page, click Manage global types, then the Reprocess button.
Reprocessing is useful:
- When your URL template has been changed, or
- When the source websites might have been updated and have new or changed metadata.
To preview the raw metadata, click show more on the display page for a URL. Alternatively, the metadata properties derived from the raw metadata can be viewed anywhere the Document properties are displayed.
When uploading PSML containing <link>
elements, the URL metadata is automatically extracted where possible but by default won’t overwrite metadata for an existing URL.
If the user has permission to edit all URLs, then:
- To overwrite existing metadata, on the Upload document dialog, select Developer options, then select the “Overwrite metadata and document properties (title, docid, publication id/type, description, labels)” option.
- Any PSML metadata for URLs included in the upload is used instead of extracting it from the URL source.
Following are example URL metadata field names for HTML sources but these vary widely, especially between different domains.
apple-touch-icon byl canonical content-language description image license news_keywords og:description og:image og:image:alt og:title og:type og:url pdate shortcut icon size thumbnail title twitter:card twitter:description twitter:image twitter:image:alt twitter:title twitter:url url
Configuration
Configuration of a URL type is global for the whole server and is done with the following files in the Global template under the url/[url type]
folder:
url-config.xml
url-template.psml
editor-config.xml
*.sch
(schematron files)
To create these files on the Global template page, click Manage global types, then Create URL type. Then click create in the column for that type.
URL config
When URLs of any type are created, the options specified in the URL config are applied. The URL config follows this structure:
<url-config> <creation> ... </creation> <labeling> ... </labeling> <publishing> ... </publishing> </url-config>
These elements are used to configure the following:
- <creation> – which domains and media types the type can be used for (required).
- <labeling> – which labels are available on this URL (optional). It has the same format as for the document config except the
labels/@type
must beurl
. - <publishing> – which publishing options are available to a particular type (optional).
<creation>
The <creation>
element has the following structure:
<creation [disable="true"]> <title> ... </title>? <description> ... </description>? <domain name="..."/>* <media type="..."/>* </creation>
All the following elements are optional:
Element | Description |
---|---|
<title> | The title of this URL type |
<description> | Description of this URL type |
<domain> | Which domains this type can be used on |
<media> | Which media types this type can be used for |
- @disable – set the boolean value to
true
on the@disable
attribute to stop the creation of this URL type. This can be used to disable built-in URL types.
<title>
The title is used by the PageSeeder UI to provide a user-friendly name of the document type which doesn’t have the restrictions imposed on the name of the document type, by allowing any character.
It defaults to the name of the document type.
<description>
The description is displayed to the user and is good way to document what the URL type is for. It is recommended that every URL type have a description.
<domain> & <media>
The <domain>
and <media>
elements determine what type is used for which URL based on its domain (for example youtube.com
) and media type (for example application/pdf
) using the following rules.
The default type for a URL is the first URL type in alphabetical order on name:
- With matching domain and media type, then
- With matching domain and no media type, then
- With no domain and matching media type.
If no type matches, then the default
type is used.
Following is an example url-config.xml
file.
<url-config> <creation> <title>YouTube</title> <description>Allows YouTube videos to be played in PageSeeder.</description> <domain name="youtu.be"/> <domain name="youtube.com"/> <domain name="www.youtube.com"/> </creation> </url-config>
URL template
The url-template.psml
controls the processing of the metadata fields for each URL type. By default, there are no url-template.psml
files.
They follow the same format as document templates except that @level
on <document>
must be metadata
. The metadata fields are inserted by using {$meta.[field name]}
for attributes and <t:value name="meta.[field name]" />
for content.
After creating or modifying a URL template, all relevant URLs can be updated on the Global template page. Click the Manage global types button then the Reprocess button for the type and choose one or more of the following options:
- “Overwrite URL properties (title, description, labels)”.
- “Update the type of existing URLs to this one if their domain/media type matches”
- Select the metadata properties to have their value updated.
When reprocessing the metadata, properties that aren’t currently in the template are always deleted and new properties that are not selected to be updated are added empty.
Defaults
Any of the following that are not specified in the template are extracted from the URL source metadata. This is the case for the default URL type unless it has been overridden in the Global template.
uri/@title
: is set fromtwitter:title
if it exists, otherwiseog:title
, otherwisedc:title
, otherwisetitle
.uri/description
: is set fromtwitter:description
if it exists, otherwiseog:description
, otherwisedc:description
, otherwisedc:description.abstract
, otherwisedescription
.uri/@size
: is set fromsize
.uri/@mediatype
: is set frommedia-type
.
Following is an example url-template.psml
file.
<document type="video" level="metadata" xmlns:t="http://pageseeder.com/psml/template"> <documentinfo> <uri title="{$meta.twitter:title}"> <description> <t:value name="meta.twitter:description" /> </description> <labels>video</labels> </uri> </documentinfo> <metadata> <properties> <property name="content" title="Content" value="{$meta.og:video:content}" /> <property name="width" title="Width" value="{$meta.og:video:width}" /> <property name="height" title="Height" value="{$meta.og:video:height}" /> <property name="format" title="Format" value="{$meta.og:video:type}" /> <property name="category" title="Category" multiple="true"> <value> <t:value name="meta.category" /> </value> </property> </properties> </metadata> </document>
Editor config
Metadata for a URL can be edited through the Document info & metadata panel. Alternatively, the Edit sheet supports bulk editing of properties in a grid view.
The metadata editor behavior is configurable through the editor-config.xml
for the editor name PSMLMetadata
using the same options as the PSML properties editor.
Following is an example editor-config.xml
file.
<editor-configs> <editor-config name="PSMLMetadata"> <field name="width" type="text" label="Width" pattern="[0-9]" /> <field name="height" type="text" label="Height" pattern="[0-9]" /> <field name="format" type="select" label="Format"> <value>video/3gpp</value> <value>video/mp4</value> <value>video/mpeg</value> <value>video/quicktime</value> <value>video/vivio</value> </field> </editor-config> </editor-configs>