Publishing

Publishing PageSeeder data to print, the Web or both

Task import-docx

This task will convert a Microsoft Word document from the docx format into the PSML format according to rules expressed in an XML file called Microsoft Word docx Import Config.

A list of the latest changes are available at github .

To use this Ant extension standalone outside PageSeeder you can download the pso-docx-ant-x.jar and pso-docx-core-x.jar files from bintray .

Task Definition

Following are two examples of task definitions. 

Minimal

<import-docx src="[source]" />

Full

<import-docx
    src="[source]"
    dest="[destination]"
    working="[working-directory]"
    mediafolder="[media folder name]"
    config="[config.xml] />

Attributes

 

AttributeDescriptionRequired
@srcPath of the source file to process. It should point to a docx file.Yes
@destPath of the destination folder. If no value is specified, this defaults to the location of the source file.No
@workingThe directory holding temporary files. Defaults to [java.io.tmpdir]/psdocx-[number].No
@mediafolderThe name of the subfolder where images will be placed. Defaults to [source filename]_files.No
@configPath to .xml config file.No

Usage

Invoking via the user interface

As part of the PageSeeder Upload interface, the import-docx task is associated with files that have an extension of  ".docx". After being uploaded into the PageSeeder loading zone, the following options become available:

docx-loadingzone.png

  • No processing – uploads the file as a docx document;
  • Import document as PageSeeder PSML document – runs the import-docx Ant task with the default configuration.
  • Validate DOCX document – runs a Schematron validation on the docx file to ensure the contents can be converted.

In the Developer perspective, the following additional options are available:

docx-developer-options.png

  • Remove original document – deletes the docx file after the transformation has run.
  • Create subfolder for document – because a large docx file can translate into hundreds of PSML files, this option creates a folder to store the documents.
  • Use external configuration – to override the default import settings,  creating an XML file according to Microsoft Word docx Import Config and named the following:  
word-import-config.xml

Invoking via Ant

Typically this task is run through PageSeeder using the Task ps-upload-get to download the docx and Task ps-upload-put to upload the generated PSML and image files. Running this  task without being connected to PageSeeder requires a <taskdef/> .

<project ... xmlns:psd="antlib:org.pageseeder.docx.ant">

  <!-- only required for standalone -->
  <taskdef uri="antlib:org.pageseeder.docx.ant" 
      resource="org/pageseeder/docx/ant/antlib.xml" 
     classpath="pso-docx-ant-0.5.9.jar"/>

  <target ... >
    <!-- Invoke Task -->
    <psd:import-docx src="test.docx" dest="result"/>
  </target>
</project>

Using a namespace is not required, but it is a good practice for documenting the task. The recommended namespace is 'psd'.

How does it work?

The Ant task above will do the following:

  1. Load the config file.
  2. Unzip the docx format.
  3. Transform the docx XML files into PSML text and organize the image references.
  4. Upload the content back to a PageSeeder group.

Created on , last edited on