Publishing

Publishing PageSeeder data to print, the Web or both

Microsoft Word docx Import Config

Overview

The docx import config provides a way for both developers and less-technical users to control how Microsoft Word (docx) documents are imported to PageSeeder. As a simple XML file, editing values in the import config is the easiest way to change the treatment of specific aspects of a docx file. 

While not all properties of Word's docx file format are accommodated by the Import Config, there is ample processing capability within Microsoft Word itself. The areas that are best leveraged by the Import Config are the following:

  • hierarchical document structure – as expressed by Word heading levels, outline levels, sections, numbered lists and paragraphs.

  • semantics – captured in paragraph, character styles, list styles, table styles, bookmarks, document types and various IDs.

  • linking – although links can arguably be both structure and semantics, PageSeeder's unique linking capabilities are worth mentioning separately. The import config can process Word's cross references into PageSeeder XRefs, it will also build multi-level links from the references document to the component documents.

There are also aspects of Word documents that are not currently imported into PageSeeder. Anywhere the following markup is meaningful, search and replace it with something that the import config can process:

  • fonts / typefaces – good typography should always be applauded, but dependencies on operating systems, applications and devices combined with a spectrum of licensing options means that the only way to import typographical information is when it is attached to styles, not set directly.
    • style settings – be aware that the actual settings of any styles, such as font specifications, are not imported into PageSeeder, only the style name.
  • measurements – tab stops, margins, indents, column widths, line spacing, borders or the many other measurements in Word are not imported into PageSeeder. Exceptions include table widths and image size.
  • headers / footers – in the Word package, headers and footers are hard to reconcile with the rest of the document. If they have been used to represent structure in a document, move the header and footer into the body of the document before importing. 
  • macros – no Word macro code will work on PageSeeder. However, if macros are required when the document is exported, include them in the template. Be aware that some virus checkers or firewalls may be configured to block macro-enabled Word files, in which case the user may be required to manage the document extension.
  • content controls – where a forms-based user interface is required in PageSeeder, the templates must be created separately, not imported.
  • fields – some field data such as bookmarks and cross-references are automatically imported and some fields such as Index entries can be supported with additional configuration. However, by default the majority of Word field data are discarded. 
  • security – PageSeeder has a rich, role and group-based, permission model that provides different levels of access to files. These must be separately set through the PageSeeder interface, not read as part of the import process.

Background

Commonly referred to as 'dock-ex' the format for storing Microsoft Word document is rich and comprehensive. PageSeeder documents are designed to be much simpler than Word documents. By reducing the surface area for end users and increasing it for developers, PageSeeder implementations shift complexity away from users.  This makes users more productive and much faster are getting started. By providing a well designed development environment, PageSeeder makes developers productive and innovative. However, in between users and developers are legacy documents. Processing legacy documents can require considerable attention and who is best placed to fix the documents (developers or users) is not always clear. 

Because every organisation and every dataset is different, the Word Import Config tries to support both users and developers. However before configuring the import function, it is important to understand how Word files work. 

  • the starting point for this is to become familiar with Microsoft's Open XML and Open Packaging Conventions . A handy tool for this is the OpenXML viewer for Google's Chrome browser. It is free to download from here .
  • secondly, it is important to understand how the Word files have been formatted regarding the consistency and comprehensiveness of styles  and other aspects of Word.

Once that information is understood, converting docx files to PageSeeder will be much easier.

Finally, nine times out of ten the best return on investment is to clean up the Word document in  Word. For some tips on how to do this, read Before you Import.

Using the Word Import Config file

To override the PageSeeder default config in PageSeeder follow the steps below:

  1. Login to PageSeeder and select a project or group.
  2. Select the Developer Perspective from the top left of the page.
  3. To change the config for yourself ONLY click the Upload document icon and drop/browse a docx file. Click Options next to it, then Import/Preview and click Edit config. This is available to group managers only.
  4. To change the config for everyone in the project select Document config under the Dev menu and follow the steps below. This is available to project managers and administrators only.
  5. Click on the Publish configurations link at the top right of the page.
  6. Click create next to word-import-config.xml under Media types.

 

Note

When editing this config file in PageSeeder pressing ctrl-space will display autocomplete options to make editing easier.

 

Note

For anyone that does not have Manager or Administrator permissions, the config file can be passed to PageSeeder by uploading a word-import-config.xml along with the docx document.

See Task import-docx for further information.

Key Components

This document covers the primary markup that is converted from Word. These components are as follows:

  • split – tells PageSeeder where to split the Word file into documents and editable fragments.
  • lists – configures how lists are handled.
  • styles – converts the implied semantics and structure of Word styles into PageSeeder instructions.

<split>

This element controls the point at which the Word file is split into component documents and within these where the content is split into editable fragments.  Splitting for both of these objects (documents and fragments) can be controlled by three different Word constructs: Sections Breaks, Outline Levels and Paragraph Styles.

<split>
    <document select="true">
      <!-- <sectionbreak select="evenPage" /> -->
      Values accepted: evenPage,oddPage,  -->
      <outlinelevel select="0" />
      <outlinelevel select="1" />
      <!-- Values accepted: outlineLevel[0-5], -->
      <wordstyle  select="Heading1" />
      <wordstyle  select="Heading2" />
      <!-- Values accepted: name of style available in word file, -->
    </document>
    <section select="true">     
      <!-- <sectionbreak select="evenPage" /> -->
      Values accepted: continuous, evenPage, oddPage,  -->
      <outlinelevel select="0" />
      <outlinelevel select="1" />
      <outlinelevel select="2" />
      <outlinelevel select="3" />      
      <!-- Values accepted: outlineLevel[0-5], -->
      <wordstyle  select="Heading1" />
      <wordstyle  select="Heading2" />
      <wordstyle  select="Heading3" />
      <wordstyle  select="Heading4" />
      <!-- Values accepted: name of style available in word file,  -->
    </section>
  </split>

The default values for the <split> element are printed in the example above.

Splitting a Word file into a references document plus component documents means that in the Import Config file, the @select attribute of the <document> element must be set as follows:

 <document select="true">

To split the content of the component documents into fragments means the @select attribute of <section> must be set as follows:

<section select="true">

Any other attribute value on either element will prevent the split rules from being processed.

After these decision have been made, the following markup in the docx file can be used to determine the transformation to PSML.

<main> level split

The main level split contains the options to be applied to the main references document that is generated.

<main>
  <type>references</type>
  <label>production,test</label>
</main>
  • <type> will be the type of psml document.
  • <label> will be the list of labels to be applied to that document

<mathml> Generation

This option will generate math ml files for any math ml objects that might exist inside of the word document.

<mathml select="true" 
        output="generate-files" 
        convert-to-mml="true"/>

The options for mathml are:

  • select: true or false; To enable generation of math ml content, select must be set to true, otherwise, mathml content wil be ignored.
  • output: generate-files or generate-fragments; If generate-files is selected, then each mathml object will be placed in it's own file, under mathml folder. If generate-fragments is selected then each math ml object will be placed in its own fragment inside the mathml/mathm.psml file
  • convert-to-mml: true or false; If true, then math ml objects will be converted to the original math ml. If false, then they will keep the Open Office math ml syntax.

<footnotes> Generation

This option will generate footnote files for any footnotes referenced inside of the word document.

<footnotes select="true" 
           output="generate-files"/>

The options for footnotes are:

  • select: true or false; To enable generation of footnote content, select must be set to true, otherwise, footnotes content will be ignored.
  • output: generate-files or generate-fragments; If generate-files is selected, then each footnote reference will be placed in it's own file, under footnotes folder. If generate-fragments is selected then each footnote reference will be placed in its own fragment inside the footnotes/footnotes.psml file

<endnotes> Generation

This option will generate endnote files for any endnotes referenced inside of the word document.

<endnotes select="true" 
           output="generate-files"/>

The options for endnotes are:

  • select: true or false; To enable generation of endnote content, select must be set to true, otherwise, endnotes content will be ignored.
  • output: generate-files or generate-fragments; If generate-files is selected, then each endnote reference will be placed in it's own file, under endnotes folder. If generate-fragments is selected then each endnote reference will be placed in its own fragment inside the endnotes/endnotes.psml file

<document> level split

A word file can be imported to psml using:

<document select="true">

This option can also contain the following:

<document select="true" use-real-titles="true">

use-real-titles option generates a specific filename for each of the split documents, based on the content of the first paragraph. It will use any numbering formmat applied to the paragraph, plus the content of the paragraph, up to a maximum value of 249 characters. So the filename would be:

[Numbering format][Content].psml

All non word characters ([\W] in regular expression) will be converted to underscores.

This way of splitting should be used to generate files than are planned to be overwritten, or altered from a future import of a word document, for example, a list of definitions. This way it is ensured that the split files will overwrite the same files, as long as its title has not changed.

Generic Split options

A word document can be imported to psml and split into multiple documents using specific markup:

  • <sectionbreak>
  • <outlinelevel>
  • <splitstyle>
  • <wordstyle>

 

<sectionbreak>

Section Breaks  in docx can be mapped to number of functions in PageSeeder. To split a Word document at both the 'evenPage' and 'oddPage' <sectionbreak>, the import config must be mapped as follows:

<split>
    <document select="true">
      <sectionbreak select="evenPage" />
      <sectionbreak select="oddPage" />
    </document>
</split>

 

<outlinelevel>

can contain any number of the following values:

0 or 1  or 2  or 3  or 4  or 5

These have to be enumerated each time. So to split at both '0' and '1' levels requires the following:

<split>
    <document select="true">
      <outlinelevel select="0" />
      <outlinelevel select="1" />
    </document>
  </split>

 

<wordstyle>

Using, for example, both 'Heading1' and 'Heading2' Word styles to split the content on import, would be expressed as follows:

<split>
    <document select="true">     
      <wordstyle  select="Heading1"  >
        <label>Private,Public</label>
        <type>contract</type>
        <level value="0"/>
      </wordstyle>
      <wordstyle  select="Heading2" />
    </document>
  </split>

 

<wordstyle> is a special case of splitting as it also accepts <label> <type> and <level> options. These options do as follows:

<label>: applies labels to that specific document. The labels can be multiple and be in a comma separated list.

<type>: specifies what type of psml document will be.

<level>: specifies the indentation level this document will be inside of the references file. Indentation levels may alter heading levels further down the hierarchy.

 

<splitstyle>

Using  'splittingStyle1' Word styles to split the content on import, would be expressed as follows:

<split>
    <document select="true">     
      <splitstyle select="splittingStyle1"/>
    </document>
  </split>

 

<splitstyle> is unique way of splitting word files, as it uses a specific style from the document to split the files into psml documents, but ignores the content. This way of splitting should be used for files that are not rich enough to be imported to psml only using the styles provided.

<section> level split

A word document can be imported to psml and split into multiple documents using specific markup:

  • <sectionbreak>
  • <outlinelevel>
  • <splitstyle>
  • <wordstyle>

 

<sectionbreak>

Section Breaks  in docx can be mapped to number of functions in PageSeeder. To split a Word document at both the 'evenPage' and 'oddPage' <sectionbreak>, the import config must be mapped as follows:

<split>
    <section select="true">
      <sectionbreak select="evenPage" />
      <sectionbreak select="oddPage" />
    </section>
</split>

 

<outlinelevel>

can contain any number of the following values:

0 or 1  or 2  or 3  or 4  or 5

These have to be enumerated each time. So to split at both '2' and '3' levels requires the following:

<split>
    <section select="true">
      <outlinelevel select="2" />
      <outlinelevel select="3" />
    </section>
  </split>

 

<wordstyle>

Using, for example, both 'Heading3' and 'Heading4' Word styles to split the content on import, would be expressed as follows:

<split>
    <section select="true">     
      <wordstyle  select="Heading3"  >
        <type>definition</type>
      </wordstyle>
      <wordstyle  select="Heading4" />
    </section >
  </split>

 

<wordstyle> is a special case of splitting as it also accepts <type> option:

  • <type>: specifies what type of psml fragment will be.

 

<splitstyle>

Using  'splittingStyle2' Word styles to split the content on import, would be expressed as follows:

<split>
    <section select="true">     
      <splitstyle select="splittingStyle2"/>
    </section >
  </split>

<splitstyle> is unique way of splitting word files, as it uses a specific style from the document to split each fragment, but ignores the content. This way of splitting should be used for files that are not rich enough to be imported to psml only using the styles provided.

<lists>

Contains all options that relate to the interpretation of Word list paragraphs and numbered paragraphs, plus any numbering added to the documents.

<lists>
      <add-numbering-to-document-titles select="true"/>
      <convert-to-list-roles select="false"/>
           <!-- generate numbered attribute to
                   paragraphs for lists -->
      <convert-to-numbered-paragraphs select="true">
        <level value="1" output="prefix"/>
           <!-- attach prefix or numbering or inline=[label]
                    or text -->
        <level value="2" output="prefix"/>
        <level value="3" output="prefix"/>
        <level value="4" output="numbering"/>
        <level value="5" output="numbering"/>
        <level value="6" output="inline=level6"/>
      </convert-to-numbered-paragraphs>

      <convert-manual-numbering select="true">
        <value match="^[\(|\[|\{][a-z]+[\)|\]|\}]">
          <inline label="numbering-lowercase" />
        </value>
        <value match="^[\(|\[|\{][A-Z]+[\)|\]|\}]">
          <prefix/>
        </value>
        <value match="^[\(|\[|\{][ivx]+[\)|\]|\}]">
          <list role="numbering-roman"/>
        </value>
      </convert-manual-numbering>
  </lists>

<add-numbering-to-document-titles>

to add numbering to the titles of each of the split documents, the @select attribute must be set to 'true'. With any other value, numbers will not be added.

<add-numbering-to-document-titles select="true"/>

<convert-to-list-roles>

allows lists to contain a @role attribute set with the value of the Word paragraph style. To invoke the @select attribute must be set to "true". By default it is set to "false".

<convert-to-list-roles select="false"/>

A use case for list roles is to mimic the List Styles feature of Word. In List Styles, the Level 1 style restarts the numbering for all levels below. This functionality can be replicated in PageSeeder using the list role.

<convert-to-numbered-paragraphs>

Is used to control the conversion of  numbered paragraph styles to numbered paragraphs or lists in PageSeeder. To convert to numbered paragraphs the @select attribute must be set to "true". If it contains any other value it will convert to <list> or <nlist> depending on the type of numbered value.

<convert-to-numbered-paragraphs select="true">

The conversion to numbered paragraphs uses the following options for each of the list levels:

  • @numbering – to generate numbered paragraphs;

  • @prefix – to generate a prefix with the value of the current numbering value for each of the Word numbered paragraphs;

  • @inline=[labelname] – use this option to wrap the paragraph number from Word in an inline label. 

  • @text – containing the value of the current numbering value for each of the Word numbered paragraphs;

<convert-to-numbered-paragraphs select="true">
        <-- prefix or numbering or inline=[label] or text -->
        <level value="1" output="prefix"/>
        <level value="2" output="prefix"/>
        <level value="3" output="text"/>
        <level value="4" output="numbering"/>
        <level value="5" output="numbering"/>
        <level value="6" output="inline=level6"/>
</convert-to-numbered-paragraphs>

In the preceding example, level 1 and 2 are transformed into prefix, level 3 is transformed into text, levels 4 and 5 are transformed into numbered paragraphs and level 6 is transformed into an inline label, with name "level6".

<convert-manual-numbering>

Controls non-automated numbering values that can exist in the Word file. To convert manual numbered values from paragraphs the @select attribute must be set to "true". If it contains any other value it will not use this calculation.

<convert-manual-numbering select="true">

It accepts 3 options:

  • <prefix/> – generates a <prefix> element with the value of the current numbering value for each of the Word paragraphs containing the regular expression value;

  • <inline label="[numbering-format]" /> – to generate an <inline> element with an attribute @label equal to the current numbering value for each of the Word paragraphs containing the regular expression value;

  • <list role="[list role]"/> – to generate a <list> element from Word paragraphs containing the regular expression value; (still in beta version).

<convert-manual-numbering select="true">
        <value match="^[\(|\[|\{][a-z]+[\)|\]|\}]">
          <inline label="numbering-lowercase" />
        </value>
        <value match="^[\(|\[|\{][A-Z]+[\)|\]|\}]">
          <prefix/>
        </value>
        <value match="^[\(|\[|\{][ivx]+[\)|\]|\}]">
          <list role="numbering-roman"/>
        </value>
</convert-manual-numbering>
  • In this case, any value found matching the regular expression

^[\(|\[|\{][a-z]+[\)|\]|\}]

will be output in an <inline> element with an attribute of @label="numbering-lowercase".

  • any value matched by the following regular expression

^[\(|\[|\{][A-Z]+[\)|\]|\}]

will be output as the value of a <prefix>.

  • any paragraph found with a value matching the regular expression

^[\(|\[|\{][ivx]+[\)|\]|\}]

will be converted into a list with @role = numbering-roman.

Any number of <value> elements can be added, as long as the @match attribute is a valid regular expression (check www.w3.org/TR/xslt20/#regular-expressions )

<styles>

Contains the settings for general transformations of docx to PSML. These are the <ignore>, <default> and <wordstyle> .

<styles>
        <ignore>
            <wordstyle value="TOC1" />
            <wordstyle value="TOC2" />
            <wordstyle value="TOC3" />
            <wordstyle value="TOC4" />
        </ignore>
        <default>
            <paragraphStyles value="block" />
            <!-- possible values: 'para' or 'block' -->
            <characterStyles value="inline" />
            <!-- possible values: 'none' or 'inlineLabel' -->
            <smart-tag keep="true" />
            <!--  possible values: 'false' or 'true' -->
        </default>

        <wordstyle name="Title" psmlelement="title"/>

        <!-- Values accepted: name of style available in word file -->
        <wordstyle name="Heading1" psmlelement="heading">
            <level value="1" />
        </wordstyle>

        <wordstyle name="Heading2" psmlelement="para">
        </wordstyle>

        <wordstyle name="Heading3" psmlelement="inline">
            <label value="Heading3" />
        </wordstyle>

        <wordstyle name="Heading4" psmlelement="block">
            <label value="Heading4" />
        </wordstyle>

        <wordstyle name="Heading5" psmlelement="heading">
            <level value="5" />
            <block label="Heading5" />
        </wordstyle>

        <wordstyle name="Heading6" psmlelement="heading">
            <level value="6" />
        </wordstyle>

    </styles>

<ignore>

tells the converter which content should not be processed. For example, Word table of content paragraphs are usually generated, but not needed as contents, as their references are dynamic. To remove TOC styles from content:

        <ignore>
            <wordstyle value="TOC1" />
            <wordstyle value="TOC2" />
            <wordstyle value="TOC3" />
            <wordstyle value="TOC4" />
        </ignore>

 

 

<default>

defines settings for paragraph styles, character styles and smart tags.

  • <paragraphStyles> – defines a mapping for a paragraph style not mapped by <wordstyle> or <lists>. It's attribute @value has two allowed values: 

    • para – will output any Word paragraph styles that haven't been explicitly mapped as a PSML <para> element.

    • block – will output any Word paragraph styles that haven't been explicitly mapped as a PSML <block> element with a label equal to the Word paragraph style ID (note: the ID is different from Word paragraph style name).

  • <characterStyles> – defines what is the general fallback for a character style not mapped with <wordstyle>. It's attribute @value has two allowed values:
    • none – the content of all non-transformed Word character styles will be output as text with no markup.
    • inline – the content of all non transformed Word character styles will generate a PSML <inline> element with a label equal to the Word character style ID (different from Word character style name).
  • <smart-tag> – Word  smart tag  information can be either discarded or uploaded to PageSeeder as an inline label, with a value equal to that of the smart tag. To do this, the @keep attribute must be set to "true". With any other value, the smart-tag markup will be discarded. 

<wordstyle>

These rules transform Word paragraph or character styles into PSML elements. Example PSML elements include <para>, <heading>, <monospace>, <preformat>, caption>, <block> and <inline>.

Consider the following example:

        <wordstyle name="Heading1" psmlelement="heading">
            <level value="1" />
        </wordstyle>

        <wordstyle name="Heading2" psmlelement="para">
        </wordstyle>

        <wordstyle name="Heading3" psmlelement="inline">
            <label value="Heading3" />
        </wordstyle>

        <wordstyle name="Heading4" psmlelement="block">
            <label value="Heading4" />
        </wordstyle>

        <wordstyle name="Heading5" psmlelement="heading">
            <level value="5" />
            <block label="Section_highlight" />
        </wordstyle>

        <wordstyle name="Strong" psmlelement="monospace"/>

        <wordstyle name="HTMLpreformat" psmlelement="preformat"/>

        <wordstyle name="TableCaption" psmlelement="caption" table="default"/>

 

 

Given the rules expressed in the code above:

  • any Word paragraph with a style ID of "Heading1" will be transformed into a PSML <heading> element with an attribute @level="1";
  • any Word paragraph with style id "Heading2" will be transformed into a PSML <para>;
  • any Word paragraph with style id "Heading3" will be transformed into a PSML <inline> element with an attribute @label="Heading3";
  • any Word paragraph with style id "Heading4" will be transformed into a PSML <block> element with an attribute @label="Heading4"; and
  • any Word paragraph with style id "Heading5" will be transformed into a PSML <block> element with attribute @label="Section_highlight" wrapped around a PSML <heading> element with an attribute @level="5".
  • any Word paragraph with style id "Strong" will be transformed into a PSML <monospace> element;
  • any Word paragraph with style id "HTMLpreformat" will be transformed into a PSML <preformat> element; and
  • any Word paragraph with style id "TableCaption" will be transformed into a <caption> element for all tables; and

Created on , last edited on