Your browser may have trouble rendering this page. See supported browsers for more information.

This page shows the source for this entry, with WebCore formatting language tags and attributes highlighted.

Title

Office Formats

Description

Microsoft recently released documentation for their <a href="http://www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx">binary office formats</a> in both PDF and their own XPS format. The PDF for Word weighs in at 2.8MB and has 210 pages. <a href="http://www.joelonsoftware.com/items/2008/02/19.html" author="Joel Spolsky">Why are the Microsoft Office file formats so complicated?</a>, provides a lot of good reasons for <i>why</i> the formats are so complicated (most rooted in history), like speed, complexity of the task, purely internal formats (until now), etc. Where Spolsky veers off the path (and he almost always does) is in reaching a bit too far with his "workarounds". Instead of trying to load the binary formats yourself, he suggests simply launching Word or Excel as a COM object <iq>directly, even from ASP or ASP.NET code running under IIS</iq>. The caveat comes only later and tells only of a <iq>few gotchas</iq>, like it <iq>not [being] officially supported by Microsoft</iq>. He includes a link to a knowledge base article which uses a lot of words to say the equivalent of "for the love of the sweet baby Jesus, don't do this." Clearly, Spolsky was so enchanted by his prose and clever examples that he didn't think Microsoft explicitly countermanding his idea was enough reason not to publish it to the world<fn>. His advice to hide this type of solution behind a web service for Linux servers actually goes for ASP.NET servers as well. If you need to read the Office format (or generate it), there are libraries that do this without using office itself. The <a href="http://poi.apache.org/">POI</a> java library from Apache works quite well for generating Excel and Word documents. If you're using .NET, you can hide the POI library behind a web service and call that instead. Even a Tomcat server to run a little web service won't weigh more than running Office in a Windows 2003 Server. If you do have to run a Windows 2003 Server in a Linux environment, consider running it a virtual machine under <a href="http://xen.xensource.com/">Xen</a> or some other virtualization solution. Some of the other suggestions also indicate that Spolsky was just trying to fill out his bullet lists, like <iq>[o]pening an Excel workbook, storing some data in input cells, recalculating, and pulling some results out of output cells</iq>---that sounds like the kind of stuff you could just write in .NET or Java directly, no?<fn> Or what about <iq>[u]sing Excel to generate charts in GIF format</iq>---there are libraries for that, aren't there? Do you really have to consider automation in a server process (including a likely bottlenecking nightmare) just to generate a chart? Happily, he closes strongly with good suggestions for generating the least complex format possible for fulfilling the task, such as using RTF for formatted documents (it's a text format, reasonably legible, and is well-documented) or CSV for simple Excel data. In the end, the formats for the office applications are published. This is what Microsoft deals with in their office products---there's no use complaining that they're too complicated. They are what they are and most people should be able to avoid having to deal with them---unless you do something silly like joining the Office development team in Redmond. <hr> <ft>Their <a href="http://support.microsoft.com/default.aspx?scid=kb;EN-US;257757">exact words</a> are <iq>Microsoft does not currently recommend, and does not support, Automation of Microsoft Office applications from any unattended, non-interactive client application or component (including ASP, DCOM, and NT Services), because Office may exhibit unstable behavior and/or deadlock when run in this environment.</iq> That's not even typically vague. Even when a vendor (not just Microsoft) is effusive about their solution, you should be careful. When they tell you <i>not</i> to do something, you should really just walk away. Did Spolsky even read his linked article?</ft> <ft>There are also rumors out there that at least one open source project is working on a way of compiling Excel formulae directly to Java bytecode for execution in server environments. Can't find a web page for it, though.</ft>