SonghaySystem(::)

Clean HTML for Word 2000/2002

Clean HTML is the Microsoft Word Template that creates “Clean” HTML equivalents of the contents of Word 2000/2002 files. This template translates W3C-compliant HTML strings out of Microsoft Word 2000/2002 documents.1 These HTML strings, created from highlighted portions or entire Word documents, can then be copied and pasted into other data applications using the Windows Clipboard—or saved to a text file. Simultaneously, Clean HTML translates a large number of typographic symbols (like curly quotes “” or the em dash —) and “extended” Latin characters into the appropriate HTML entity. Clean HTML takes advantage of the rich user interface of Microsoft Word 2000/2002 without sacrificing vendor-independent standards. Clean HTML also writes HTML 4.0 text files that can automatically link to external CSS and script files.

Clean HTML Known Issues

The following table summarizes the changes made to the Clean HTML template file and the “known issues.” Microsoft Word VBA Projects do not have versioning features by default. It follows that the summarizing table below marks updates by date:

Date Description
9/10/04 Updated the code to reflect locale-specific, built-in Style names for international versions of Word.
9/7/04

Clean HTML will work correctly in Office Word 2003 but the Clean HTML template has to be loaded with every editing session after a security warning prompt.

Promise: A future version of this product will be based on .NET, its security technology, to avoid this issue (among others). Spending upwards of $400US on Authenticode stuff is out of the question!

2/24/2003 The way Clean HTML handles Word tables has been completely revised and is documented below.
2/24/2003 Improved support for Clean HTML Character Styles in combination with Paragraph Styles added. Support for the Word 2002 Target Frame… entry added to Hyperlink translation.
8/29/2002 Having multiple document windows open in Word 2002 may cause CleanHTML to disappear behind an open window. Flipping through windows using Alt + Tab helps.
8/29/2002

Currently CleanHTML ignores Bookmarks. This implies that headers (e.g. a paragraph of style Header 2) will not be translated if it is being used as a Bookmark.

Note: Pasting HTML into a Microsoft Word file may cause Bookmarks to appear—even if Match Destination Formatting is used in Word 2002 and above.

7/11/2002 Fixed problems with List Bullet and List Number styles.
7/11/2002 Fixed various problems associated with the first character of the document being formatted.
7/11/2002

Had fatal problems with international versions of Microsoft Word that are not based on the English language. Clean HTML is not known to run correctly with international versions of Microsoft Word.

NOTE: We believe this problem is corrected. See 9/10/04 above.

7/11/2002 “Hidden” characters are now shown during Clean HTML translation to prevent problems associated with Word 2002.
1/20/2002 The ability to save Clean HTML text files to disk was added.
1/20/2002

Clean HTML is unable to translate paragraphs highlighted within a table.

Workaround: copy and paste the text into a new document without the table and run Clean HTML.

1/20/2002 Clean HTML moves through Footnote characters very, very slowly.

Clean HTML Installation

Because of the awesome negative impact of “macro viruses,” Clean HTML will not deploy itself automatically after running some kind of setup program. Installation of Clean HTML must be done manually by copying Clean HTML.dot to a Word 2000/2002 Startup folder and loading it with the Templates and Add-Ins… command under the Word 2000/2002 Tools menu.2 The Templates and Add-Ins… dialog shown below should summarize the installation of Clean HTML:

Install Clean HTML.dot in your StartUp folder.

After Clean HTML has been properly installed, it can be run from the Tools > Macro > Macros… dialog:

Run Clean HTML from this dialog.

Or, if you are familiar with the customizing features of Word 2000/2002, you can assign Clean HTML to a button and place this button on one of the Tool Bars. You can set up a button that looks like this:

Press this button or enter Alt + c to run Clean HTML

This means that you can run Clean HTML by pressing the button or typing Alt + c.

Clean HTML Functionality

Clean HTML processes a Word 2000/2002 document with two logical loops. The first loop moves through the Collection of Paragraphs in a Word document. Within each Paragraph, the second loop moves through each Word.3 Additionally, there is a command to wrap the Clean HTML in document-level elements (including customizable meta elements) and save to an external file. Each loop looks for certain conditions to determine what to translate to HTML. Let’s call the first loop the Paragraph Loop—and the second loop the Word Loop. Let’s specify what each loop does:

Paragraph Loop

Condition Response
A paragraph with Style Code Block.

Send this paragraph to the Word Loop (see below). Place the output of this loop in a pre element.

This is a custom Style created for Clean HTML.

A paragraph with Style Hidden Block.

Clean HTML ignores this paragraph.

This is a custom Style created for Clean HTML.

A paragraph with Style HTML Block.

Directly translate the text in this paragraph as HTML.

This is a custom Style created for Clean HTML.

A paragraph with Style List Bullet. Send this paragraph to the Word Loop (see below). Wrap the output of this loop in li elements. If it is the first paragraph of this style, prefix an opening ul element. Suffix the last paragraph with a closing ul element.
A paragraph with Style List Number. Send this paragraph to the Word Loop (see below). Wrap the output of this loop in li elements. If it is the first paragraph of this style, prefix an opening ol element. Suffix the last paragraph with a closing ol element.
A paragraph with Style Normal. Send this paragraph to the Word Loop (see below). Wrap the output of this loop in p elements.
A paragraph with Style Style Block.

Send this paragraph to the Word Loop (see below). Do not wrap the output of this loop in p elements.

This style allows Word 2000/2002 paragraphs to be wrapped in customized HTML written in the Style HTML Block.

This is a custom Style created for Clean HTML.

A paragraph with Style Heading 1. Send this paragraph to the Word Loop (see below). Wrap the output of this loop in h1 elements.
A paragraph with Style Heading 2. Send this paragraph to the Word Loop (see below). Wrap the output of this loop in h2 elements.
A paragraph with Style Heading 3. Send this paragraph to the Word Loop (see below). Wrap the output of this loop in h3 elements.
A paragraph with Style Heading 4. Send this paragraph to the Word Loop (see below). Wrap the output of this loop in h4 elements.
A paragraph with Style Heading 5. Send this paragraph to the Word Loop (see below). Wrap the output of this loop in h5 elements.
A paragraph with Style Heading 6. Send this paragraph to the Word Loop (see below). Wrap the output of this loop in h6 elements. Additional headings are ignored.
A paragraph consisting of one in-line Shape.

Translate to an HTML block containing one img element.

If the in-line Shape is not connected to an external file, the src attribute is determined by reading a URI in the Alternative text box under the Format Image…>Web tab. If Alternative Text is specified then we have a pipe-delimited list of the form <uri>|<alt attribute>.

Failing this, the formatting stops and an alert message displays.

The width and height attributes of the img element are calculated with the following formulae:

VBA.Round(.Width / 0.75, 0)

and

VBA.Round(.Height / 0.75, 0)

where 0.75 is a conversion factor to pixels from the units used by the Shape.

A paragraph inside of a Table.

If this paragraph is the first cell of the table then it is translated to a temporary marker of the form <tablen> where n starts at 1 and counts up to the number of “top-level” tables in the document.4

At the end of the Paragraph Loop these temporary tags are replaced with the ordered Collection of tables in the word document.

The first row of a table is translated to th elements. As of this writing, vertically merged rows are not supported by Microsoft Word VBA (see below for more details).

The table element has the following attributes:

<table
    align="<left>|<center>|<right>"
    class="cleanHTMLTable"
    id="cleanHTMLTable_n_CSS_ID">

where n corresponds to the index of the table in the Word Tables Collection; CSS_ID is an optional custom file property used to uniquely identify the table among tables translated out of different Word documents (see “Formatting Tables with CSS Level 2” below).

The align attribute of the table element is determined by the setting under Table Properties…> Table > Alignment. To prevent the table from floating next to subsequent block elements, <br clear="all"> is appended at the end of the table.

The tr opening element has the following attributes:

<tr
    class="cleanHTMLTableRow"
    id="cleanHTMLTableRow_n_r_CSS_ID">

where n corresponds to the index of the table in the Word Tables Collection and r is the respective Index property of the Word Row object of the Table; CSS_ID is an optional custom file property used to uniquely identify the table among tables translated out of different documents (see “Formatting Tables with CSS Level 2” below).

The th or td opening element has the following attributes:

<th
    align="<center>|<left>|<right>"
    class="cleanHTMLTableHeader"
    colspan="n"
    id="cleanHTMLTableHeader_n_r_c_CSS_ID"
    valign="<top>|<middle>|<bottom>">

or

<td
    align="<center>|<left>|<right>"
    class="cleanHTMLTableHeader"
    colspan="n"
    id="cleanHTMLTableData_n_r_c_CSS_ID"
    valign="<top>|<middle>|<bottom>">

where the align attribute is either center, left or right (justified paragraphs are not supported by Clean HTML until hyphenation is supported in the “mainstream” web browsers). n for colspan is only equal to maximum number of columns in the table (a number greater than one). This is illustrated below in “Tables with Horizontally Merged Cells.”

n for the id attribute corresponds to the index of the table in the Word Tables Collection; r is the respective Index property of the Word Row object of the Table; c is the respective ColumnIndex property of the Word Cell object; CSS_ID is an optional custom file property used to uniquely identify the table among tables translated out of different documents (see “Formatting Tables with CSS Level 2” below).

The valign attribute is determined by the setting under Table Properties…> Cell > Vertical alignment.


Formatting Tables with CSS Level 2

Clean HTML table formatting depends almost entirely on CSS Level 2 and its support in “mainstream” browsers. As of the revision date of this document, the support for CSS Level 2 is satisfactory—especially for the formatting of HTML tables. This dependency allows Clean HTML the highest level of flexibility. Clean HTML assigns a unique ID to each table, each row and even each cell on the HTML page. By using ID selectors (from CSS Level 1), a Cascading style sheet can reach any table element including the table itself.

Tables translated to Clean HTML from different Word documents can still be uniquely identified by using the CSS_ID custom file property discussed below under “Translation of Word 2000/2002 File Properties into Clean HTML.”

The following style block below is one of the simplest designs for HTML tables:

<style><!--
    .cleanHTMLTable{
        border:solid 1px #000000;
        border-collapse:collapse;
    }
    .cleanHTMLTableHeader,.cleanHTMLTableData{
        border:solid 1px #000000;
    }
 //-->
</style>

Most importantly, the border-collapse property is the key to making an HTML table “look like” an HTML table rendered without Cascading Style Sheets. More advanced table formatting techniques are discussed in detail at W3.org.

Tables with Horizontally Merged Cells

The following table should translate into Clean HTML and produce the expected results:

A Word Table with Horizontally Merged Cells

Cell One Cell Two Cell Three Cell Four
As of this writing, Clean HTML only supports horizontally merged cells across all columns of the table.

Word Loop

Condition Response
Characters formatted as Bold. Translate to strong elements. This translation will not take place for paragraphs with heading styles (e.g. Heading 1).
Characters formatted as Italic. Translate to em elements. This translation will not take place for paragraphs with heading styles (e.g. Heading 1).
Characters formatted as Underline (or any of the other underline character formats—e.g.: Double Underline). Translate to span elements with style attribute. (u elements have been deprecated in HTML 4.0.)
Characters formatted as Strikethrough (e.g.: strikethrough). Translate to span elements with style attribute. (strike elements have been deprecated in HTML 4.0.)
Characters with Style Code Block. Translate to code elements. Note that this is a paragraph-level Word style.
Characters with Style Code Line. Translate to code elements.
Characters with Style Footnote Reference. Word automatically creates this Style when Footnotes are inserted.

Translate to a elements pointing to named anchors (“HTML endnotes”) programmatically appended to the end of the Clean HTML.

Native Word 2000/2002 Endnotes are ignored.

Characters with Style Hyperlink.

Translate to a elements with the Screen Tip… entry placed in the alt attribute and the Word 2002 Target Frame… entry placed in the target attribute.

Word Bookmark locations is not supported.

The Line Break character. If this character is found in a paragraph of style Code Block, translate to carriage return and line feed characters; otherwise translate to the <br> element.
Characters generated by a Field object. Translate according to the rules of the current Style names aforementioned. Warning: only fields that produce results (e.g. formulas generating string values) have been tested with Clean HTML.
Characters within Bookmarks. Bookmarks are ignored by Clean HTML.
Characters within Comments. Comments are ignored by Clean HTML.

HTML Entities

The Word Loop looks for “special characters” to translate into HTML entities. The following table summarizes the characters supported by Clean HTML:

Character Codes Translated into Clean HTML

0–63 128–159 160–191 192–223 224–255
" ¡ À À
& ƒ ¢ Á Á
< £ Â Â
> ¤ Ã Ã
  ¥ Ä Ä
  ¦ Å Å
  ˆ § Æ Æ
  ¨ Ï Ç
  Š © È È
  ª É É
  Œ « Ê Ê
  ¬ Ë Ë
  ® Ì Ì
  ¯ Í Í
  ° Î Î
  ± Ï Ï
  ² Ð Ð
  ³ Ñ Ñ
  ˜ ´ Ú Ò
  µ Ó Ó
  š Ô Ô
  · Õ Õ
  œ ¸ Ö Ö
  Ÿ ¹ × ÷
    º Ø Ø
    » Ù Ù
    ¼ Ú Ú
    ½ Û Û
    ¾ Ü Ü
    ¿ Ý Ý
      Þ Þ
      ß Ÿ

Pictures

Here’s a “picture paragraph” of Clean HTML output:

This is how an alt attribute is included for this picture.

Its original linking information, paragraph alignment and Alternative text settings are translated into Clean HTML. As of this writing, Hyperlink information assigned to objects other than Range objects is ignored. The next picture paragraph shows the Alternative text settings:

Alternative text settings

Saving Clean HTML to a Text File

The Clean HTML output window shows the Save button. This command saves a text file of Clean HTML adding “document-level” tags according to the following rules:

Rule Remarks
The file will be written in Unicode text format. Any information suggesting that the Unicode format has a negative impact on a system is currently beyond the scope of this document.
The HTML will be level 4.0 transitional. This is denoted by the DOCTYPE declaration.
Word 2000/2002 File Properties are translated into meta and base elements. Word 2000/2002 has both “built-in” file properties and Custom File Properties. Both of these property types are found in the Properties dialog under the File menu (see below for more details).
Clean HTML recognizes Custom File Properties that create references to external style sheets and script files. Some of these Properties are undocumented (see below for more details).

Translation of Word 2000/2002 File Properties into Clean HTML

The dialog tabs shown below show a portion of the built-in and custom file properties:

Word 2000/2002 Built-in Properties Word 2000/2002 Custom Properties

The table below summarizes the translation of these properties into Clean HTML elements:

Word Property Clean HTML Element
General > Created
General > Modified

Each are directly translated into one meta element:

<meta
    name="date created"
    content="5/16/2001 6:58:00 PM">
<meta
    name="date modified"
    content="1/20/2002 9:34:00 PM">
Summary > Title

Direct translation into the text within table elements:

<title>Clean HTML</title>
Summary > Subject

Direct translation into one meta element:

<meta
    name="subject"
    content="W3C Compliant HTML
from Word 2000/2002">
Summary > Author

If a pipe-delimiter is not found then translate into one meta element:

<meta
    name="author"
    content="Bryan D. Wilhite">

If a pipe-delimiter is found then assume that the text is a delimited string of form author|email and add a reply-to meta element making two meta elements:

<meta
    name="author"
    content="Bryan D. Wilhite">
<meta
    name="reply-to"
    content="rasx@kintespace.com">
Summary > Manager

Direct translation into one meta element:

<meta name="manager" content="None">
Summary > Company

Direct translation into one meta element:

<meta
    name="company"
    content="Songhay System">
Summary > Category

Direct translation into one meta element:

<meta
    name="category"
    content="Utilities">
Summary > Keywords

Direct translation into one meta element:

<meta
    name="keywords"
    content="HTML, Word 2000/2002,
W3C, Clean HTML for Word 2000/2002,
Songhay System">
Summary > Comments

Direct translation into one meta element:

<meta
    name="description"
    content="These Summary attributes
should appear in the meta tags of the
HTML document.">
Summary > Hyperlink base

Direct translation into one base element:

<base
    href="http://www.kintespace.com"
    target="_self" >
Custom > Name > [name]
Custom > Value > [value]

These default, custom name-value pairs are scanned by Clean HTML and each pair directly translates into one meta element:

<meta name="[name]" content="[value]">
Custom > Name > CSS
Custom > Value > [URI]

This is a special name-value pair that Clean HTML translates into a link element that refers to an external style sheet:

<link
    rel="stylesheet"
    type="text/css"
    href="[URI]">

Specifically, based on the example in the image above, we have:

<link
    rel="stylesheet"
    type="text/css"
    href="../../../root.css">
Custom > Name > CSS_ID
Custom > Value > [string]
This is a special name-value pair that Clean HTML translates into uniquely identifying the id attribute of table elements, tr elements, th elements and td elements.

Flippant Remarks, Tips and Tricks

The idea behind Clean HTML is not to produce HTML that emulates the formatting of Office documents. That’s Microsoft’s job. Office XP subscribers may see some improvements on Microsoft’s previous attempts to produce useful HTML documents. In addition to product interoperability, we also need a clean HTML representation of the contents of Word documents to interact with systems based on vendor agnostic standards.

The elements in Clean HTML are in lowercase. However, as of this writing, Clean HTML does not produce XHTML or HTML streams to be inserted into well-formed XML documents.

As of this writing, Clean HTML does not explicitly support HTML 4.01. This W3C specification is known as a “subversion” of HTML 4.0. Many of the popular browsers (like Mozilla) reads the 4.01 DOCTYPE and cause display “problems.” These problems may actually be the browser switching off “quirks mode” and actually showing the HTML according to standards. More Clean HTML research in this area will be needed!

The specifications in this document imply that Clean HTML works best with relatively “simple” documents. A “simple” document would only have the default Word Styles (like Normal and Heading 1) plus the styles from Clean HTML. When a Word Document has a table of contents, columns, comments, text boxes, multiple versions, customized styles, form fields, etc., Clean HTML may not work properly or produce unexpected results.

Clean HTML will not produce well-formed HTML from Word Range objects containing characters with “multiple formatting.” For example, a word formatted with Bold, Italic and Underline will produce poorly formed HTML. In terms of both traditional typography and Clean HTML processing, using multiple formatting is not recommended. However, the following table summarizes some workarounds:

Word Formatting Clean HTML
<em>Experimental Typography (Working With Computer Type , No 4)</em> Experimental Typography (Working With Computer Type , No 4)
<em><strong>bold-italic</strong></em> bold-italic

Heading <em>Three</em>

Heading Three


Future enhancements to Clean HTML should include XHTML support. As of this writing, Clean HTML has been extensively tested on Word 2000. For obvious historical reasons, Clean HTML has not been tested extensively on Word XP (or Word 2002). A digital signature for Clean HTML will be forthcoming with increased sales of Clean HTML. For Clean HTML news, registration information, bug reports, etc., please mail.

Clean HTML is for that happy few who need to separate their lengthy prose from its visual HTML presentation. Such people enjoyed writing a good essay in Microsoft Word but did not enjoy how Microsoft decided to “help” publish that essay to the Internet. Now, with Clean HTML, we can still use Word and feel little more secure that our word processing documents can move to the Internet very quickly, with links, endnotes, images and typographically correct characters—all under a standard Web-Consortium format, open to as many browsers as possible.

This document is rendered entirely via Clean HTML. Please support standards-enabling software and purchase Clean HTML today.

Endnotes

1 Clean HTML is not compatible with any other version of Microsoft Word (past or future). Clean HTML was only tested in Microsoft Windows installations capable of running Microsoft Office.

2 This manual procedure will also verify that you have enough security permissions to use Clean HTML without related errors.

3 Actually it loops through each character in each paragraph. Interestingly, there is no “Word” object in the Word VBA Object model. There is a Words Collection where each Item returns a Range Object. It turns out that the Range Object is as important to Word VBA as the Recordset Object is for Access VBA (or ADO).

4 “Top-level” implies that Word’s nested tables are not supported as of this writing.

 
This document was last reviewed on Friday, September 10, 2004 at 04:53 PM PDT.
Copyright© 2008 by Bryan D. Wilhite All rights reserved. No part of this material may be used or reproduced in any form or by any means, or stored in a database or retrieval system, without prior written permission of the publisher except in the case of brief quotations embodied in critical articles and reviews. Making copies of any part of this material for any purpose other than your own personal use is a violation of United States copyright laws.

The information provided by Bryan D. Wilhite at kintespace.com is provided “as is” without warranty of any kind. In no event shall Bryan D. Wilhite or any of his affiliates be liable for any damages whatsoever including, but not limited to, direct, indirect, incidental, consequential, loss of business profits or special damages due to material published by Bryan D. Wilhite or any of his affiliates.