Skip to main content

 Publishing

Publishing PageSeeder data to print, the Web or both

Task import-docx

This task converts a Microsoft Word document from the docx format into the PSML format according to rules expressed in an XML file called word-import-config.xml.

A list of the latest changes are available at github .

To use this Ant extension standalone outside PageSeeder you can download the pso-docx-ant-x.jar and pso-docx-core-x.jar files from bintray .

Task definition

Following are two examples of task definitions. 

Minimal

<import-docx src="[source]" />

Full

<import-docx src="[source]"
             dest="[destination]"
             working="[working-directory]"
             mediafolder="[media folder name]"
             componentfolder="[component folder name]"
             config="[config.xml]" />

Attributes

 

AttributeDescriptionRequired
srcPath of the source file to process. It must point to a docx fileYes
destPath of the destination folder or file. If no value is specified, this defaults to the folder of the source file. If it doesn’t end with ‘.psml’, it is treated as a folder and the destination filename is the source filename in lowercase with spaces changed to underscores.No
workingThe folder holding temporary files. Defaults to [java.io.tmpdir]/psdocx-[number].No
componentfolderThe name of the subfolder where documents referenced by the main document is placed ("" means no subfolder). Defaults to components. Requires pso-docx version 0.8.0 or higher.No
mediafolderThe name of the subfolder where images are placed. Defaults to images ("" means [filename]_files).No
configPath to .xml config fileNo

Usage

Invoking through the user interface

As part of the PageSeeder Upload interface, the import-docx task is associated with files that have an extension of  “.docx”. After being uploaded into the PageSeeder loading zone, the following options become available:

docx-loadingzone.png

  • No processing – uploads the file as a docx document.
  • Import document as PageSeeder PSML document – runs the import-docx Ant task with the default configuration.
  • Validate DOCX document – runs a Schematron validation on the docx file to ensure the contents can be converted.

In the Developer perspective, the following additional options are available:

docx-developer-options.png

  • Remove original document – deletes the docx file after the transformation has run.
  • Create subfolder for document – because a large docx file can translate into hundreds of PSML files, this option creates a folder to store the documents.
  • Use external configuration – to override the default import settings,  creating an XML file according to Import Microsoft Word DOCX config usage and named the following:
     word-import-config.xml

Invoking using Ant

Typically, this task is run through PageSeeder using the Task ps-upload-get (deprecated) to download the docx and Task ps-upload-put (deprecated) to upload the generated PSML and image files. Running this  task without being connected to PageSeeder requires a <taskdef/> .

<project ... xmlns:psd="antlib:org.pageseeder.docx.ant">

  <!-- only required for standalone -->
  <taskdef uri="antlib:org.pageseeder.docx.ant" 
           esource="org/pageseeder/docx/ant/antlib.xml" 
           classpath="pso-docx-ant-0.5.9.jar"/>

  <target ... >
    <!-- Invoke Task -->
    <psd:import-docx src="test.docx"
                     dest="result"/>
  </target>
</project>

Using a namespace  is not required, but it is a good practice for documenting the task. The recommended namespace is ‘psd’.

How does it work?

The Ant task does the following:

  1. Loads the config file.
  2. Unzips the docx format.
  3. Transforms the docx XML files into PSML text and organize the image references.
  4. Uploads the content back to a PageSeeder group.
Created on , last edited on