Skip to main content

 Publishing

Publishing PageSeeder data to print, the Web or both

Microsoft Word DOCX - Import Config

For creating a PSML file from a DOCX document.

For the default Word import config, see the default config example.

For further information regarding the use of this file, see:

Overview

The word-import-config provides a way to modify the conversion handling of Word DOCX files to PSML. It is suitable for developers or technically-minded end-users. To change the file requires some familiarity with XML syntax and a solid understanding of Microsoft Word. Depending on the needed workflow, using a modified config requires elevated permission or very little permission.

The default version of the file is located here and can be accessed through the Document types page. This requires System Admin rights.

/WEB-INF/template/default/document/docx/word-import-config.xml

While not all properties of Word's docx file format are accommodated by the Import Config, there is ample processing capability within Microsoft Word itself. The areas that are best leveraged by the Import Config are the following:

  • Hierarchical document structure – as expressed by Word heading levels, outline levels, sections, numbered lists and paragraphs.

  • Semantics – captured in paragraph, character styles, list styles, table styles, bookmarks, document types and various IDs.

  • Linking – although links can arguably be both structure and semantics, PageSeeder’s unique linking capabilities are worth mentioning separately. The import config can process Word’s cross-references into PageSeeder xrefs. It also builds multi-level links from the references document to the component documents.

There are also aspects of Word documents that aren’t currently imported into PageSeeder. Anywhere the following markup is meaningful, search and replace it with something that the import config can process:

  • Fonts / typefaces – good typography should always be applauded, but dependencies on operating systems, applications and devices combined with a spectrum of licensing options means that the only way to import typographical information is when it is attached to styles, not set directly.
    • Style settings – be aware that the actual settings of any styles, such as font specifications, are not imported into PageSeeder, only the style name.
  • Measurements – tab stops, margins, indents, column widths, line spacing, borders or the many other measurements in Word aren’t imported into PageSeeder. Exceptions include table widths and image size.
  • Headers / footers – in the Word package, headers and footers are hard to reconcile with the rest of the document. If they have been used to represent structure in a document, move the header and footer into the body of the document before importing. 
  • Macros – no Word macro code will work on PageSeeder. However, if macros are required when the document is exported, include them in the template. Be aware that some virus checkers or firewalls might be configured to block macro-enabled Word files, in which case the user might be required to manage the document extension.
  • Content controls – where a forms-based user interface is required in PageSeeder, the templates must be created separately, not imported.
  • Fields – some field data such as bookmarks and cross-references are automatically imported and some fields such as Index entries can be supported with additional configuration. However, by default the majority of Word field data are discarded. 
  • Security – PageSeeder has a rich, role and group-based, permission model that provides different levels of access to files. These must be separately set through the PageSeeder interface, not read as part of the import process.

Usage

To change from the PageSeeder default import config in PageSeeder, follow the steps below:

  1. Sign in to PageSeeder and select a project or group.
  2. Select Enable developer mode under Account menu > Preferences.
  3. To change the config for yourself ONLY (This is available to group managers only), go to a group. Click the Upload document... option (located under the Plus button on several group pages) and drop/browse a docx file. Click Import options icon, Import document as PageSeeder PSML. Click either:
    1. +Show more options and click Edit config,
    2. OR, click the Preview button then click the config button. 
  4. To change the default configuration for everyone in a project (This is available to project managers and administrators only ), click Administration menu > Template > Types, then in the Media types table, Word import column, docx row, select override under the default option Only available on Types page. (To revert to the default config, click delete in the Word import column, docx row as above or delete your customized default folder on the Administration menu > [Project]Template > Template files).

 

When editing this config file in PageSeeder, pressing ctrl-space displays autocomplete options to make editing easier.

Background

Commonly referred to as ‘dock-ex’, the format for storing Microsoft Word document is rich and comprehensive.  PSML, the format of PageSeeder documents, is deliberately designed to be simpler than docx without compromising the quality of the final document. This is accomplished by moving complexity out of the file format and into the PageSeeder development environment.

This shift makes the system easier to learn for end users and makes it easier to control for developers. That means developers don’t have to spend as much time coding to prevent problems and can spend more time coding features. This makes both stakeholders more productive  and much faster are getting started. By providing a well designed development environment, PageSeeder makes developers productive and innovative. However, in between users and developers are legacy documents. Processing legacy documents can require considerable attention and who is best placed to fix the documents (developers or users) is not always clear. 

The Word Import Config has been designed so that technically confident users and developers can customize the import. Before configuring the import function, it is important to understand how Word files work. 

  • The starting point for this is to become familiar with Microsoft’s Open XML and Open Packaging Conventions . A handy tool for this is the OpenXML viewer for Google’s Chrome browser. It is free to download from here .
  • Secondly, it’s important to understand how the Word files have been formatted regarding the consistency and comprehensiveness of styles  and other aspects of Word.

Once that information is understood, converting DOCX files to PageSeeder is much easier.

Finally, nine times out of ten, the best return on investment is to clean up the Word document in Word.

Configurable Components

For an in-progress additional document to this page, see Word docx Import Config Schema reference.

This document covers the primary markup that is converted from Word. These components are as follows:

  • <split> – tells PageSeeder how to handle MathML (equations), footnotes and endnotes.
  • <lists> – configures how lists are handled.
  • <styles> – converts the implied semantics and structure of Word styles into PageSeeder instructions.

Word style names in the import config refer to the style ID which is not the same as the style name that appears in the Word user interface. Style IDs are usually the same as the style name but with the spaces removed (e.g. name List Number 2 has ID ListNumber2).

Store – <split>

The <main>, <document> and <section> elements previously under <split> are deprecated as of PageSeeder v5.99.

Importing a Word document involves two configuration files. First the DOCX document content is processed using the word-import-config.xml, then it is split using the psml-split-config.xml to give the final imported PSML.

For more detail, see PSML split config usage.

<split>
  <mathml select="true"
          output="generate-files"
          convert-to-mml="true"/>
  <footnotes select="true"
             output="generate-files"/>
  <endnotes select="true"
            output="generate-files"/>
</split>
<mathml>

This option generates mathml files for any mathml objects that might exist inside of the word document.

<mathml select="true" 
        output="generate-files" 
        convert-to-mml="true"/>

The attributes for <mathml> are:

  • @select – to disable generation of mathml content, select must be set to false, which will ignore mathml content (default true).
  • @output – If value is generate-files, each mathml object is placed in a separate file, under a mathml folder. If value is generate-fragments, each mathml object is placed in a fragment inside it’s own document with the path mathml/mathml-[n].psml (default is generate-fragments, requires pso-docx version 0.7.8 or higher).
  • @convert-to-mml  – if true, then mathml objects are converted to the original math ml. If false, the objects retain the Office Open mathml syntax (default true and always true for generate-fragments option).
<footnotes>

This option generates footnote files for any footnotes referenced inside the Word document.

<footnotes select="true" output="generate-files"/>

The attributes for <footnotes> are:

  • @select – to disable generation of footnote content, select must be set to false and footnotes content is ignored (default true).
  • @output – if generate-files is selected, each footnote reference is placed in a separate file under footnotes folder. If generate-fragments is selected, then each footnote is placed in a fragment inside the footnotes/footnotes.psml file (default generate-fragments).
<endnotes>

This option generates endnote files for any endnotes referenced inside the Word document.

<endnotes select="true" output="generate-files"/>

The options for endnotes are:

  • @select – to disable generation of endnote content, select must be set to false and endnotes content is ignored (default true).
  • @output – if generate-files is selected, each endnote reference is placed in a separate file, under endnotes folder. If generate-fragments is selected each endnote is placed in a fragment inside the endnotes/endnotes.psml file (default generate-fragments) .

Order and organize – <lists>

Provides options for interpreting Word lists, numbered headings and paragraphs, plus any numbering added to the documents.

<lists>
      <add-numbering-to-document-titles select="true"/>
      <convert-to-list-roles select="false"/>
           <!-- generate numbered attribute to
                   paragraphs for lists -->
      <convert-to-numbered-paragraphs select="true">
        <level value="1" output="prefix"/>
           <!-- attach prefix or numbering or inline=[label]
                    or text -->
        <level value="2" output="prefix"/>
        <level value="3" output="prefix"/>
        <level value="4" output="numbering"/>
        <level value="5" output="numbering"/>
        <level value="6" output="inline=level6"/>
      </convert-to-numbered-paragraphs>

      <convert-manual-numbering select="true">
        <value match="^[\(|\[|\{][a-z]+[\)|\]|\}]">
          <inline label="numbering-lowercase" />
        </value>        
        <value match="^[\(|\[|\{][A-Z]+[\)|\]|\}]">
          <prefix/>
        </value>
        <value match="^[\(|\[|\{][ivx]+[\)|\]|\}]">
          <list role="numbering-roman"/>
        </value>
      </convert-manual-numbering>
  </lists>

<add-numbering-to-document-titles>

to add numbering to the titles of each of the split documents, the @select attribute must be set to true. With any other value, numbers are not added.

<add-numbering-to-document-titles select="true"/>

<convert-to-list-roles>

allows lists to contain a @role attribute set with the value of the Word paragraph style. To invoke the @select attribute must be set to true. By default it is set to false.

<convert-to-list-roles select="false"/>

A use case for list roles is to mimic the List Styles feature of Word. In List Styles, the Level 1 style restarts the numbering for all levels below. This functionality can be replicated in PageSeeder using the list role.

<convert-to-numbered-paragraphs>

Is used to control the conversion of  numbered paragraph styles to numbered paragraphs or lists in PageSeeder. To convert to numbered paragraphs, the @select attribute must be set to true. If it contains any other value, it converts to <list> or <nlist> depending on the type of numbered value.

<convert-to-numbered-paragraphs select="true">

The conversion to numbered paragraphs uses the following @output values for each of the list levels:

  • prefix – to generate a prefix with the value of the current numbering value for each of the Word numbered paragraphs.

  • numbering – to generate numbered paragraphs.

  • inline=[labelname] – use this option to wrap the paragraph number from Word in an inline label. 

  • text – to put in the paragraph content the current numbering value for each of the Word numbered paragraphs.

<convert-to-numbered-paragraphs select="true">
        <-- prefix or numbering or inline=[label] or text -->
        <level value="1" output="prefix"/>
        <level value="2" output="prefix"/>
        <level value="3" output="text"/>
        <level value="4" output="numbering"/>
        <level value="5" output="numbering"/>
        <level value="6" output="inline=level6"/>
</convert-to-numbered-paragraphs>

In the preceding example, level 1 and 2 are transformed into prefix, level 3 is transformed into text, levels 4 and 5 are transformed into numbered paragraphs and level 6 is transformed into an inline label, with name level6.

<convert-manual-numbering>

Controls non-automated numbering values that can exist in the Word file. To convert manual numbered values from paragraphs the @select attribute must be set to true. If it contains any other value, it won’t use this calculation.

<convert-manual-numbering select="true">

The <value> element contains a @match attribute, that has to follow the normal conventions of XSLT regular expressions ( www.w3.org/TR/xslt20/#regular-expressions ).

The <value> element accepts 3 options:

  • <prefix/> – generates a @prefix attribute with the value of the current numbering value for each of the Word paragraphs containing the regular expression value.

  • <inline label="[numbering-format]" /> – to generate an <inline> element with content equal to the current numbering value for each of the Word paragraphs containing the regular expression value.

  • <list role="[list role]"/> – to generate a <list> element from Word paragraphs containing the regular expression value; (still in beta version).

Is the list @role still in beta version?

<convert-manual-numbering select="true">
        <value match="^[\(|\[|\{][a-z]+[\)|\]|\}]">
          <inline label="numbering-lowercase" />
        </value>
        <value match="^[\(|\[|\{][A-Z]+[\)|\]|\}]">
          <prefix/>
        </value>
        <value match="^[\(|\[|\{][ivx]+[\)|\]|\}]">
          <list role="numbering-roman"/>
        </value>
        <value match="Part&#160;[A-Z0-9]+">
          <prefix />
         </value>
        <value match="Note:\s*">
           <prefix />
        </value>
        <value match="\s*[0-9]+[A-Z]*$">
           <prefix />
        </value>
</convert-manual-numbering>

Values that match the following regular expression are output in an <inline> element with an attribute of @label="numbering-lowercase".

^[\(|\[|\{][a-z]+[\)|\]|\}]

Values that match the following regular expression are output as the value of a @prefix attribute in a <para> element. 

^[\(|\[|\{][A-Z]+[\)|\]|\}]

Values that match the following regular expression are converted into a list with @role ="numbering-roman".

^[\(|\[|\{][ivx]+[\)|\]|\}]

Any number of <value> elements can be added, as long as the @match attribute is a valid regular expression (check www.w3.org/TR/xslt20/#regular-expressions )

Semantics and formatting – <styles>

Contains the settings for general transformations of DOCX to PSML. These are the <ignore>, <default> and <wordstyle> .

<styles>
        <ignore>
            <wordstyle value="TOC1" />
            <wordstyle value="TOC2" />
            <wordstyle value="TOC3" />
            <wordstyle value="TOC4" />
        </ignore>
        <default>
            <paragraphStyles value="block" />
            <!-- possible values: 'para' or 'block' -->
            <characterStyles value="inline" />
            <!-- possible values: 'none' or 'inlineLabel' -->
            <smart-tag keep="true" />
            <!--  possible values: 'false' or 'true' -->
            <references psmlelement="link" />
            <!-- required when using PSML split config -->
      <!--  <property name="prefix" value="true" /> possible values:
                                               'false' or 'true' -->
         </default>

        <!-- Values accepted: name of style available in word file -->
        <wordstyle name="Title" psmlelement="block">
          <label type="block" value="title"/>
        </wordstyle>

        <wordstyle name="Subtitle" psmlelement="block">
          <label type="block" value="subtitle"/>
        </wordstyle>
        <wordstyle name="Heading1" psmlelement="heading">
            <level value="1" />
        </wordstyle>

        <wordstyle name="Heading2" psmlelement="para">
        </wordstyle>

        <wordstyle name="Heading3" psmlelement="inline">
            <label value="Heading3" />
        </wordstyle>

        <wordstyle name="Heading4" psmlelement="block">
            <label value="Heading4" />
        </wordstyle>

        <wordstyle name="Heading5" psmlelement="heading">
            <level value="5" />
            <label type="block" value="Heading5" />
        </wordstyle>

        <wordstyle name="Heading6" psmlelement="heading">
            <level value="6" />
        </wordstyle>

    </styles>

<ignore>

Determines which content should not be processed. For example, the Word Table of Contents paragraphs can often be discarded. To do this, use something like the following:

        <ignore>
            <wordstyle value="TOC1" />
            <wordstyle value="TOC2" />
            <wordstyle value="TOC3" />
            <wordstyle value="TOC4" />
        </ignore>

<default>

Defines settings for the following:

<paragraphStyles>

Defines a mapping for a paragraph style not mapped by <wordstyle> or <lists>, where @value supports the following: 

  • para – transforms all un-mapped Word paragraph styles to a PSML <para> element.

  • block – transforms all un-mapped Word paragraph styles to a PSML <block> element with a label equal to the Word paragraph style ID (note: the ID is different from Word paragraph style name).

To get the style ID, Word strips underscores and spaces from the style name but preserves hyphens.

<characterStyles>

Defines general rule for any character style not mapped with <wordstyle>.  @value supports the following:

  • none – strips the markup for un-mapped Word character styles.

  • inline – transforms un-mapped Word character styles to a PSML <inline> element with a label equal to the Word character style ID (note: the ID is different from Word character style name).

<smart-tag>

Word  smart tag  information can be either discarded or captured in PageSeeder as an inline label, with a value equal to that of the smart tag. To do this, the @keep attribute must be set to true. With any other value, the smart-tag markup is discarded. 

<references>

This element sets whether internal references in the DOCX are imported as PSML <link> or <xref> elements. The  @psmlelement attribute can have values link or xref (default is xref). The link option also imports all bookmarks as <anchor> elements. This requires pso-docx v0.8.3 or higher.

<wordstyle>

These rules transform Word paragraph or character styles into PSML elements. Example PSML elements include:

  • <para>
  • <heading>
  • <monospace>
  • <preformat>
  • <caption>
  • <block>
  • <inline>

Consider the following example:

 <wordstyle name="Heading1" psmlelement="heading">
   <level value="1" />
 </wordstyle>

 <wordstyle name="Heading2" psmlelement="para">
 </wordstyle>

 <wordstyle name="Heading3" psmlelement="inline">
   <label value="Heading3" />
 </wordstyle>
 
 <wordstyle name="Heading4" psmlelement="block">
   <label value="Heading4" />
 </wordstyle>

 <wordstyle name="Heading5" psmlelement="heading">
    <level value="5" />
    <label type="block" value="Section_highlight" />
 </wordstyle>

 <wordstyle name="HTMLCode" psmlelement="monospace"/>

 <wordstyle name="HTMLPreformatted" psmlelement="preformat"/>

 <wordstyle name="TableCaption" psmlelement="caption" table="default"/>

Given the rules expressed in the code above:

  • Word paragraphs with a style ID of Heading1 are transformed to <heading> element with an attribute @level="1".

  • Word paragraphs with style ID Heading2 are transformed to <para>.

  • Word paragraphs with style ID  Heading3 are transformed to <inline> element with an attribute @label="Heading3".

  • Word paragraphs with style ID  Heading4 are transformed to <block> element with an attribute @label="Heading4".

  • Word paragraphs with style ID  Heading5 are transformed to <heading> element with an attribute @level="5" wrapped in a <block> element with attribute @label="Section_highlight".

  • Word paragraphs with style ID  Strong are transformed to <monospace> element.

  • Word paragraphs with style ID  HTMLpreformat are transformed to <preformat> element.

  • Word paragraphs with style ID  TableCaption are transformed to <caption> element for all tables.

@psmlelement
heading

Possible child elements are:

<wordstyle name="Heading1" psmlelement="heading">
  <level value="1" />          
  <label type="block" value="heading1">          
</wordstyle>
  • <level> with attribute @value ranging from 1 to 6.

  • <label> with attributes:

    • @type with values block or inline.

    • @value with value of  a [valid label name].

para

Possible child elements are:

<wordstyle name="paragraph1" psmlelement="para">
  <indent value="1" /> 
  <label type="block" value="para1"/>
  <numbering select="true" value="inline"/>
</wordstyle>
<wordstyle name="paragraph2" psmlelement="para">
  <indent value="2" /> 
  <label type="inline" value="para2"/>
  <numbering select="true" value="prefix"/>
</wordstyle>
<wordstyle name="paragraph3" psmlelement="para">
  <indent value="3" /> 
  <numbering select="true" value="inline">
    <label value="num3"/>
  </numbering>
</wordstyle>
  • <indent> with attribute:

    • @value with values of 1 to 6.

  • <label> with attributes of:

    • @type with values of block or inline.

    • @value with value of a [valid label name].

  • <numbering> with attributes of:

    • @select with values of true or false.

    • @value with value of:

      • inline: wrap number in an inline label specified by nested <label value="[valid label name]"> element.

      • text: include number in paragraph text.

      • prefix: insert number in @prefix attribute.

      • numbering: add @numbered="true".

caption

Should have a @table attribute with value default meaning it applies to all tables, or the value of a specific table style ID to which it should apply.

Created on , last edited on