Word regex
Often the best way to import Word documents into PageSeeder is to clean them up first. In this circumstance, the best way to clean up Word files is in Word itself.
Unaware that the Word Find and Replace supports a form of Regular Expressions, many users struggle with the clean-up task. This article provides some practical examples of the Word regex syntax known as “wildcards”.
Wildcards are not quite the same syntax that most developers are familiar with, but their inbuilt knowledge of the DOCX format makes the differences a little more bearable.
Examples of this are the following:
Character | Wildcard expression |
---|---|
Opening field bracket | ^19 – ^19 REF – finds every cross-reference |
Graphic | ^g |
Tab character | ^t |
Page or section break | ^m |
Column break | ^n |
For a quality set of examples see this page .
Find and Replace
The following is an example of how wildcard Find works in Word and how it can be used to capture dates that have been expressed inconsistently in a document.
([ ][0-9]{1,2})[ ](<[AFJMNSOD]*>)[ ]([0-9]{4})
- “(” and “)” provide boundaries around the patterns. Counted from left to right on the “Find what” field, patterns can be output in any order by changing their display in the “Replace with” field. For example “\1 \2 \3”, will be the same order as the data was found, “\3 \2 \1” will be the reverse.
- “[” and “]” delimits the individual character patterns and the hyphen specifies characters in a range.
- Within the pattern, “<” and “>” specifies the start and end of a word, asterisk “*” matches any character and question mark “?” matches a single character.
- “{” and “}” specifies the occurrence of whatever precedes it, where comma “,” separates the occurrence values.
Therefore, when looking for the date pattern above, the expression would be processed as follows:
([0-9]{1,2})[ ]
finds any number between zero and nine that occurs one or two times followed by a space. This would look for the days of the month.(<[AFJMNSOD]*>)[ ]
says that after matching the pattern described in point one, the system looks for the start of the word, then look for one of the capital letters. This would look for the months of the year.- [ ]([0-9]{4}) looks for a space before digits. This would look for a year expressed as four digits.
Of course, once found the dates have been found, it would be best to wrap them in a character style so it is easier to find them again.
Removing excess returns
This macro removes excess paragraph marks from a document.
Sub ReplacePara() Selection.HomeKey Unit:=wdStory Selection.Find.ClearFormatting With Selection.Find .Text = "^p^p" .Replacement.Text = "" .Forward = True .Wrap = wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With Selection.Find.Execute While Selection.Find.Found Selection.MoveRight Unit:=wdCharacter, Count:=1 Selection.TypeBackspace Selection.MoveLeft Unit:=wdCharacter, Count:=2 Selection.Find.Execute Wend End Sub
The first part of the macro uses Word’s built-in Find and Replace capabilities to find all instances of two paragraph marks in sequence. The macro doesn’t replace the sequential paragraph marks; it only finds them. The second part uses the Selection.Find.Found property to delete the second of the two sequential paragraph marks.
The reason for this approach is because it leaves the formatting intact on the remaining paragraph mark. If consecutive paragraph marks are replaced with a single paragraph mark, it is possible that important formatting might be lost.
This is a similar reason for never using the ‘^13’ character in the Replace field. Because Word stores the formatting on the paragraph mark, collapsing paragraphs doesn’t have the same consequences as collapsing spaces, where all values are the same.
Find and remove excess spaces
- In the Find What box, enter a single space followed by the characters ‘{2,}’.
- In the Replace With box, type ‘space character’.
- Make sure the Use Wildcards checkbox is selected.
Find numbering
Description
Find numbering prefixed by upper/lower case letter
Description
Find word then numbering
Description
“Smart” quotation marks
Left and right curly quotation marks (also known as “smart quotes”) can be generated on a Windows keyboard using alt+0417 “ and alt+0418 ”—or use the autocorrect function in Word to replace straight quotation marks.
Paragraph numbers into text
There are times when it’s best to freeze the numbers on headings and paragraphs rather than allowing them to update.
Selection.Range.ListFormat.ConvertNumbersToText
This one line macro turns the incrementing numbers on Word headings and paragraphs into text so they won’t change. Use carefully!