Skip to main content

 Publishing

Publishing PageSeeder data to print, the Web or both

Task import-docx

This task converts a Microsoft Word document from the docx format into the PSML format according to rules expressed in an XML file called word-import-config.xml.

A list of the latest changes are available at github .

To use this Ant extension standalone outside PageSeeder, you can download the pso-docx-ant-x.jar and pso-docx-core-x.jar files from maven central .

Task definition

Following are two examples of task definitions. 

Minimal

<import-docx src="[source]" />

Full

<import-docx src="[source]"
             dest="[destination]"
             working="[working-directory]"
             mediafolder="[media folder name]"
             componentfolder="[component folder name]"
             config="[config.xml]" />

Attributes

 

AttributeDescriptionRequired
srcPath of the source file to process. It must point to a docx fileYes
destPath of the destination folder or file. If no value is specified, this defaults to the folder of the source file. If it doesn’t end with ‘.psml’, it is treated as a folder and the destination filename is the source filename in lowercase with spaces changed to underscores.No
workingThe folder holding temporary files. Defaults to [java.io.tmpdir]/psdocx-[number].No
componentfolderThe name of the subfolder where documents referenced by the main document is placed ("" means no subfolder). Defaults to components. Requires pso-docx version 0.8.0 or higher.No
mediafolderThe name of the subfolder where images are placed. Defaults to images ("" means [filename]_files).No
configPath to .xml config fileNo

Usage

Invoking through the user interface

As part of the PageSeeder Upload interface, the import-docx task is associated with files that have an extension of  “.docx”. After being uploaded into the PageSeeder loading zone, the following Import options are available.

Import DOCX file (default options)

  • No processing – uploads the file as a docx document.
  • Import document as PageSeeder PSML document – runs the import-docx Ant task with the default configuration.

These additional Developer options are available:

Import DOCX file – As PSML with developer options

  • Display upload confirmation – receive confirmation of which files will be overwritten or created.
  • Overwrite metadata and document properties (title, docid, publication id/type, description, labels) – select to overwrite the metadata and properties of existing documents with the same filename, with the metadata and properties attached to the new document you are uploading.
  • Resolve references – resolve references to documents, images and URLs in the document you are uploading. Clear the checkbox if the referenced objects will be uploaded later and then resolve references later from the group maintenance page.
    Validate with default Schematron – runs a Schematron validation on the docx file to ensure the contents can be converted.
  • Index – index your documents. Clearing this checkbox is not recommended but indexing can be done later from the Group maintenance page.

Invoking using Ant

Typically, this task is run through PageSeeder using the Task ps-upload-get (obsolete) to download the docx and Task ps-upload-put (obsolete) to upload the generated PSML and image files. Running this task without being connected to PageSeeder requires a <taskdef/> .

<project ... xmlns:psd="antlib:org.pageseeder.docx.ant">

  <!-- only required for standalone -->
  <taskdef uri="antlib:org.pageseeder.docx.ant" 
           esource="org/pageseeder/docx/ant/antlib.xml" 
           classpath="pso-docx-ant-0.5.9.jar"/>

  <target ... >
    <!-- Invoke Task -->
    <psd:import-docx src="test.docx"
                     dest="result"/>
  </target>
</project>

Using a namespace  is not required, but it is a good practice for documenting the task. The recommended namespace is ‘psd’.

How does it work?

The Ant task does the following:

  1. Loads the config file.
  2. Unzips the docx format.
  3. Transforms the docx XML files into PSML text and organize the image references.
  4. Uploads the content back to a PageSeeder group.
Created on , last edited on