Skip to main content

 Version 5

Legacy documentation for PageSeeder v5

Store - <split>

The <split> element is deprecated as of PageSeeder v5.98. Use PSML split config instead.

This element controls the point at which the Word file is split into component documents and within these, where the content is split into editable fragments.  Splitting for both of these objects (documents and fragments) can be controlled by three different Word constructs: Sections Breaks, Outline Levels and Paragraph Styles.

<split>
    <!-- <main>...</main> --> 
    <document select="true">
      <!-- <sectionbreak select="evenPage" /> -->
      Values accepted: evenPage,oddPage,  -->
      <outlinelevel select="0" />
      <outlinelevel select="1" />
      <!-- Values accepted: outlineLevel[0-5], -->
      <wordstyle  select="Heading1" />
      <wordstyle  select="Heading2" />
      <!-- Values accepted: name of style available in word file, -->
    </document>
    <section select="true">     
      <!-- <sectionbreak select="evenPage" /> -->
      Values accepted: continuous, evenPage, oddPage,  -->
      <outlinelevel select="0" />
      <outlinelevel select="1" />
      <outlinelevel select="2" />
      <outlinelevel select="3" />      
      <!-- Values accepted: outlineLevel[0-5], -->
      <wordstyle  select="Heading1" />
      <wordstyle  select="Heading2" />
      <wordstyle  select="Heading3" />
      <wordstyle  select="Heading4" />
      <!-- Values accepted: name of style available in word file,  -->
    </section>
    <!-- <mathml/> -->
    <!-- <footnotes/> -->  
    <!-- <endnotes/> --> 
  </split>

The default values for the <split> element are printed in the example above.

Splitting a Word file into a references document plus component documents means that in the Import Config file, the @select attribute of the <document> element must be set as follows:

 <document select="true">

To split the content of the component documents into fragments means the @select attribute of <section> must be set as follows:

<section select="true">

Any other attribute value on either element prevents the split rules from being processed.

After these decision have been made, the following markup in the docx file can be used to determine the transformation to PSML.

The document root

<main>

When importing a docx file, the root PSML file can be either a single file or it can be a references document. The single PSML file is only appropriate for simple, small source documents. The options for this document can be set as follows:

<main>
  <type>references</type>
  <label>production,test</label>
</main>
  • <type> – determines the PSML document type.
  • <label> – is the list of labels to be applied to that document.

The title of the main PSML document is taken from the Word document title property (dc:title) if it exists, otherwise it is the filename without the extension.

<mathml>

This option generates mathml files for any mathml objects that might exist inside of the word document.

<mathml select="true" 
        output="generate-files" 
        convert-to-mml="true"/>

The attributes for <mathml> are:

  • @select – to enable generation of mathml content, select must be set to true, otherwise, mathml content is ignored.
  • @output – If value is generate-files, each mathml object is placed in a separate file, under a mathml folder. If value is generate-fragments, each mathml object is placed in a fragment inside it’s own document with the path mathml/mathml-[n].psml (requires pso-docx version 0.7.8 or higher).
  • @convert-to-mml  – if true, then mathml objects are converted to the original math ml. If false, the objects retain the Office Open mathml syntax (always true for generate-fragments option).
<footnotes>

This option generates footnote files for any footnotes referenced inside the Word document.

<footnotes select="true" output="generate-files"/>

The attributes for <footnotes> are:

  • @select – true or false  To enable generation of footnote content, select must be set to true, otherwise, footnotes content is ignored.
  • @output – if generate-files is selected, each footnote reference is placed in a separate file under footnotes folder. If generate-fragments is selected, then each footnote is placed in a fragment inside the footnotes/footnotes.psml file.
<endnotes>

This option generates endnote files for any endnotes referenced inside the Word document.

<endnotes select="true" output="generate-files"/>

The options for endnotes are:

  • @select – to enable generation of endnote content, select must be set to true, otherwise, endnotes content is ignored.
  • @output – if generate-files is selected, each endnote reference is placed in a separate file, under endnotes folder. If generate-fragments is selected each endnote is placed in a fragment inside the endnotes/endnotes.psml file.

The component documents

Component documents are the nodes or leaves of references documents. The import process creates these through the <document> element in the import config.

<document>

If the value of @select="true" then the following options are supported:

<document select="true" use-real-titles="true">

The title of each component PSML document is based on the first paragraph of its content. It uses the numbering format applied to the paragraph style, plus the content of the paragraph, up to a maximum value of 249 characters.

  • @use-real-titles="true" – this option generates a filename for each component using the component document title above. The regular expression processing the filenames converts all special characters ([\W] to underscores and the resulting format is:
[Numbering format][Paragraph content].psml

Splitting a single Word document into multiple PSML documents on  import can be done with the options:

<sectionbreak>

Section Breaks  in the docx format can have the following values evenPage, oddPage, continuous, nextColumn or nextPage. These values are mirrored in the import-config, but using these to split a Word document requires a <sectionbreak> declaration for each. In the example below, both evenPage and oddPage are individually expressed.

<split>
    <document select="true">
      <sectionbreak select="evenPage" />
      <sectionbreak select="oddPage" />
    </document>
</split>
<outlinelevel>

This value is most often identified with Word’s default heading levels 1-9, but outline levels can be applied to any Word style. When using <outlinelevel> to split, be careful of ambiguity or unexpected results if multiple styles share a level. The level is expressed as follows:

[0-5]

Also, be aware that while Word levels go up to nine (9), PageSeeder levels only go up to six (6), with zero (0) being the first level.

These have to be enumerated each time. Splitting at levels 0 and 1 levels requires the following:

<split>
  <document select="true">
    <outlinelevel select="0" />
    <outlinelevel select="1" />
  </document>
</split>
<wordstyle>

For Word styles Heading1 and Heading2 to split the content on import requires the following:

<split>
    <document select="true">     
      <wordstyle  select="Heading1"  >
        <label>Private,Public</label>
        <type>contract</type>
        <level value="0"/>
      </wordstyle>
      <wordstyle  select="Heading2" />
    </document>
  </split>

Splitting with <wordstyle> is a special case. It also accepts the following <label>, <type> and <level> options:

  • <label> – applies labels to that specific document. The labels can be multiple and be in a comma separated list.
  • <type> – specifies the PSML document type.
  • <level> – indents the XRef in the references document. This controls the heading levels in the target document. 
manual splitting – <splitstyle>

Sometimes users need to process documents that are not rich enough to meet the requirements. When this happens, it might be easier to manually designate where the document should split rather than use existing styles or structure. 

This is what splittingStyle1 was designed for. Essentially this is non-standard, disposable style name with a specific role of marking where the document should be split. Through the import-config, It can be invoked as follows:

<split>
    <document select="true">     
      <splitstyle select="splittingStyle1"/>
    </document>
  </split>

<splitstyle> use cases could be splitting a document after every ten pages or splitting a document where the use of styles has been inconsistent. In these circumstances, the task of manually placing the styles to split at could be preferable to dealing with an imprecise conversion or the side-effects of processing poor quality markup. 

Managing fragments

How fragments are created through the import config is as follows:

<section>

A Word document can be imported to PSML and split into multiple fragments by first using @select="true". The following markup determines what triggers the creation of a fragment.

<sectionbreak>

Section Breaks  in docx can have the following values evenPage, oddPage, continuous, nextColumn or nextPage.

<split>
  <section select="true">
    <sectionbreak select="evenPage" />
    <sectionbreak select="oddPage" />
  </section>
</split>

These values are mirrored in the import-config, but using these to split a Word document requires a <sectionbreak> declaration for each. In the example above, both evenPage and oddPage are individually expressed. 

<outlinelevel>

can contain any number valid against this regular expression.

[0-5]

These declarations have to be enumerated each time. So to split at both 2 and 3 levels requires the following:

<split>
  <section select="true">
    <outlinelevel select="2" />
    <outlinelevel select="3" />
  </section>
</split>

Also, be aware that while Word levels go up to nine (9), PageSeeder levels only go up to six (6), with zero (0) being the first level.

<wordstyle>

Using, for example, both Heading3 and Heading4 Word styles to split the content on import, would be expressed as follows:

<split>
  <section select="true">
    <wordstyle select="Heading3">
      <type>definition</type>
    </wordstyle>
    <wordstyle select="Heading4" />
  </section>
</split>

<wordstyle> is a special case of splitting as it also accepts <type> option:

  • <type>: specifies what @type the psml fragment has.
manual splitting – <splitstyle>

Sometimes users need to process documents that are not rich enough the requirements. When this happens, it might be easier to manually designate where the document should split rather than using styles or structure. 

This is what splittingStyle1 was designed for. Essentially this is non-standard, disposable style name with a specific role of marking where the document should be split. Through the import-config, It can be invoked as follows:

<split>
    <section select="true">     
      <splitstyle select="splittingStyle1"/>
    </section>
  </split>

<splitstyle> use cases could be splitting a document after every ten pages. Or a splitting a document where the use of styles has been inconsistent. In these circumstances, the task of manually placing the styles could be preferable to dealing with an imprecise conversion or the side-effects of processing poor quality markup. 

Created on , last edited on