Skip to main content

 Documents

Handling and managing documents

Word regex

Often the best way to import Word documents into PageSeeder is to clean them up first. In this circumstance, the best way to clean up Word files is in Word itself.

Unaware that the Word Find and Replace supports a form of Regular Expressions, many users struggle with the clean-up task. This article provides some practical examples of the Word regex syntax known as “wildcards”.

Wildcards are not quite the same syntax that most developers are familiar with, but their inbuilt knowledge of the DOCX format makes the differences a little more bearable. 

Examples of this are the following:

CharacterWildcard expression
Opening field bracket^19 – ^19 REF – finds every cross-reference
Graphic^g
Tab character^t
Page or section break^m
Column break^n

For a quality set of examples see this page .

Find and Replace

The following is an example of how wildcard Find works in Word and how it can be used to capture dates that have been expressed inconsistently in a document.

word_regex-date.jpg

([ ][0-9]{1,2})[ ](<[AFJMNSOD]*>)[ ]([0-9]{4})
  • “(” and “)” provide boundaries around the patterns. Counted from left to right on the “Find what” field, patterns can be output in any order by changing their display in the “Replace with” field. For example “\1 \2 \3”, will be the same order as the data was found, “\3 \2 \1” will be the reverse.
  • “[” and “]” delimits the individual character patterns and the hyphen specifies characters in a range.
  • Within the pattern, “<” and “>” specifies the start and end of a word, asterisk “*” matches any character and question mark “?” matches a single character.
  • “{” and “}” specifies the occurrence of whatever precedes it, where comma “,” separates the occurrence values.

Therefore, when looking for the date pattern above, the expression would be processed as follows:

  1. ([0-9]{1,2})[ ] finds any number between zero and nine that occurs one or two times followed by a space. This would look for the days of the month.
  2. (<[AFJMNSOD]*>)[ ] says that after matching the pattern described in point one, the system looks for the start of the word, then look for one of the capital letters. This would look for the months of the year.
  3. [ ]([0-9]{4}) looks for a space before digits. This would look for a year expressed as four digits.

Of course, once found the dates have been found, it would be best to wrap them in a character style so it is easier to find them again. 

Removing excess returns

This macro removes excess paragraph marks from a document. 

Sub ReplacePara() 
   Selection.HomeKey Unit:=wdStory
   Selection.Find.ClearFormatting
   With Selection.Find
      .Text = "^p^p"
      .Replacement.Text = ""
      .Forward = True .Wrap = wdFindContinue
      .Format = False .MatchCase = False
      .MatchWholeWord = False
      .MatchWildcards = False
      .MatchSoundsLike = False
      .MatchAllWordForms = False End With
   Selection.Find.Execute
   While Selection.Find.Found
       Selection.MoveRight Unit:=wdCharacter, Count:=1
       Selection.TypeBackspace
       Selection.MoveLeft Unit:=wdCharacter, Count:=2
       Selection.Find.Execute
    Wend
  End Sub

The first part of the macro uses Word’s built-in Find and Replace capabilities to find all instances of two paragraph marks in sequence. The macro doesn’t replace the sequential paragraph marks; it only finds them. The second part uses the Selection.Find.Found property to delete the second of the two sequential paragraph marks.

The reason for this approach is because it leaves the formatting intact on the remaining paragraph mark. If consecutive paragraph marks are replaced with a single paragraph mark, it is possible that important formatting might be lost.

This is a similar reason for never using the ‘^13’ character in the Replace field. Because Word stores the formatting on the paragraph mark, collapsing paragraphs doesn’t have the same consequences as collapsing spaces, where all values are the same.

Find and remove excess spaces

  1. In the Find What box, enter a single space followed by the characters ‘{2,}’.
  2. In the Replace With box, type ‘space character’.
  3. Make sure the Use Wildcards checkbox is selected.

Find numbering

Description

word_regex_numbering_3.PNG

Find numbering prefixed by upper/lower case letter

Description

word_regex_numbering_4.PNG

Find word then numbering

Description

word_regex_numbering_5.PNG

“Smart” quotation marks

Left and right curly quotation marks (also known as “smart quotes”) can be generated on a Windows keyboard using alt+0417 “ and alt+0418 ”—or use the autocorrect function in Word to replace straight quotation marks.

Paragraph numbers into text

There are times when it’s best to freeze the numbers on headings and paragraphs rather than allowing them to update. 

Selection.Range.ListFormat.ConvertNumbersToText

This one line macro turns the incrementing numbers on Word headings and paragraphs into text so they won’t change. Use carefully!

Created on , last edited on