The subject of this document centers upon a labor-intensive way of making Word documents interchangeable with standard XML schemas related to formatting. The previous, January 2005, version of this document was marked up by hand with a modified XHTML 1.0 Strict Schema from the w3c. The document you are reading now is still authored in Word 2003 but how it was transformed into XHTML is currently out of the scope of this document and is currently fodder for one Blog entry or two…
Using XML in Word is very exciting. The exiting part of the process is not covered in this writ so I strongly recommend looking up the MSDN Nugget “Using Word as an XML editor in Office 2003.” This motion picture presentation by the MDSN UK people, featuring Mike Taulty, provides an excellent introduction of “what the hell” I am talking about here.
Coyness aside, this primitive method of marking up a Word document with an XHTML schema works quite well for relatively short passages of prose without complex formatting. The rumors out there suggest that most Microsoft Word users (myself included) barely use 10 percent of the formatting features in the program. So the effort required to write this article using this technique appears worthwhile. By the end of this document, I should be able to share my opinion regarding how painful this process is! I am not hurting yet.
The XHTML 1.0 Strict Schema from the w3c is not considered “well-formed” by Microsoft Office Word 2003 (11.6359.6360 SP1) or Microsoft Office InfoPath 2003 (11.6357.6360 SP1). Fortunately, InfoPath provides some clues as to why this is the case. As of this writing, Microsoft regards any namespace beginning with xml in any letter-case-combination as reserved. This makes sense when we agree that user-defined schemas used by Office System should not redefine XML itself. So I opened the XHTML 1.0 Strict Schema in Visual Studio .NET and ran a find/change operation to change references to the namespace xml to w3xml. Office System Word 2003 was then able to open the schema. We can see how it appeared in the Schema Library at right.
Note how there is a second schema named XML 1998 in the list. The XHTML 1.0 Strict Schema makes an external reference to this schema but a message box in Word clearly states that it will not be used because it is reserved. I left this external reference to the document out of respect for the original author(s), assuming it won’t come back to haunt me later! For more information about XML schemas and Office System Word documents, please see MSDN Article “New XML Features of the Microsoft Office Word 2003 Object Model.”
| Automating This Process |
The XML features in Office System Word 2003 are new and leave much to be desired. It would be great, for example, to apply XML elements in formatting- and style-based find/change operations. In the mean time, Microsoft has released a few add-ons to Word such as the Microsoft Office Word 2003 XML Toolbox and the Word XML SDK. |
The title Element Is Required |
In XHTML, the title element is required. This means that validation errors may appear when the html element is present without the title element nested in the head element. |
| Doing Twice the Formatting | Inserting hyperlinks definitely requires entering data twice. The href attribute of the a element needs the same data that’s in the Address: field in the Edit Hyperlink… dialog. I am sure that at least one or two minds at Microsoft is perfectly fine with this level of inconvenience so that we users are ‘inspired’ to stick with WordML. |
| Tables and Images |
Marking up tables with a formatting schema can get tedious but some mitigating factors include tabbing into a new row and having markup automatically inserted. What is very impressive is that the open and closing The ![]() Most modern Web browsers expect the |
| Cutting, Copying and Pasting | Cutting, Copying and pasting selections that include XML elements work as expected when the operation takes place in the same document. To have the expected results across documents, the same schema must be loaded in the destination document as that in the source document. |
| Post-processing |
In order to publish this document on the Internet I had to do more stuff by hand that has my mind frantically working to devise ways to make the process better. On the other hand, the post work was not so bad that I am unwilling to wait for a decent version of Visual Studio Tools for Office. Out of the box, you will have to save a copy of your Word document to an XML file. The image below summarizes the selected save options: ![]() When you save the file you should receive a warning message that might be considered humorous in this context: ![]() |
| Typographic Entities |
From the very beginning or my wired life, since HTML 3.x, I have always taken my work habits from desktop publishing, typography and graphic design, along with me—this means I need to see em dashes and curly quotes in my English prose. This typographic desire corresponds, in part, with the subject of entities in the world of XML. The XML.com article “Entities: Handling Special Content” by John E. Simpson summarizes what a proper XML parser should know about entities, “If what appears between the What this means for Songhay enterprise workflow is that the final destination of entities in textual data is in the presentation—most likely our contemporary web browser. Simultaneously, this textual data must store the actual glyphs otherwise, in view of the available technology, editing textual data full of ampersands and semicolons used to represent glyphs will be even more frustrating and tedious. |
| RAM Requirements | The XHTML schema is huge! When this schema is loaded into Office Word 2003, the RAM footprint skyrockets toward 80MB and above. |
| Downloading |
|
This technique is a brute-force solution for people who are not familiar with VSTO and/or XSLT but they still (somehow) desire to produce XHTML. In fact, one of these Word documents can serve as a cut-and-paste data source for multiple Web documents. The View XML > Current Selection (Data Only) command in the Microsoft Office Word 2003 XML Toolbox can go a long way to produce a crude workflow scenario where a few Word documents can support multiple Web sites (and Blog entries).
It will be said that I am not satisfied with this solution and it is published here in order to inform those writers and publishers out there who may not want to use Blogger™ for Word or Dreamweaver as a word processor.