Transforming WYSIWYG output with XSLT

Written by Chriztian Steinmeier.
Got comments? I’m @greystate on Twitter. ← Back to Article Index

One of my absolute top pet peeves is the WYSIWYG editor - so here’s my take on that; Like most of my other articles, it’s based off an Umbraco feature, but really applicable to XSLT transforms in general, so go ahead ->

One thing I really don’t get about the WYSIWYG editor found in various Content Management Systems, is the fact that when its contents are saved, they’re saved as a string of XHTML, wrapped in a strange construct called a CDATA Section, which makes it useless in an XSLT context - think about that for a second: The actual content is stored in the least flexible way, with regards to the rendering technology.

It looks like this:

Embedding RichText content in XML should be fairly simple, right? It is just XML after all, so you really shouldn’t need to do anything but shove the tags in there? Unfortunately, most of the time, you’ll see RichText content being wrapped in a CDATA Section, probably because the editing component at some point delivered HTML, which could of course create “a Paradox in the Space-Time Continuum!”…

Let me tell you why this is bad: The contents of a CDATA Section can only be interpreted as pure text - its sole purpose is to mark a section of an XML document as “Character Data” which is not to be parsed. So the XML parser will make sure not to confuse the “<” signs in that section of the document with the “<” signs otherwise used throughout the document (i.e., signaling the start of an element). In fact, to the parser, the above line is identical to this one:

The CDATA Section is just a nice way to not have to escape those characters, and indeed makes it easy to embed code snippets etc. into an XML file. The really sad part is that all that nice, so-called “Rich Text” formatting goes out the window as soon as you stick it in a CDATA Section. From then on, you can no longer use the powerful XPath language to say things like: “Give me all the <p> elements that have a style attribute” (p[@style]) or: “I want the <figcaption> of the first <figure> that has a PNG image in it (figure[contains(img/@src, '.png')][1]/figcaption).

To me, the “controller” (or data component) has a crystal clear responsibility here - the render component is using XSLT, so the data component should deliver XML (because that’s what XSLT eats for breakfast every day). It’s actually not a hard problem to create XHTML (i.e., valid XML) from HTML (potentially invalid XML), which seems to have been the only reason for doing this CDATA thing in the first place.

Until that happens, we’re stuck having to sometimes be really creative when trying to maintain the styles set forth in our style guide for the client. And frankly, I’m sick and tired of doing that, so here’s how I’m handling the RichText Editor these days:

The Steps

It’s essentially two steps:

Convert the output from the RichText Editor to XML
Process the XML as usual (apply-templates, etc.)

The second is just another day at the XSLT Office™, but the first one is of course a bit outside of my territory, since it requires writing an extension - either as an App_Code Extension or as a compiled extension.

At first, I forced Kim, a former colleague of mine, to write an extension as part of the project we were working on, but then I ran into the obvious solution when dealing with Umbraco… uComponents, of course!

Another one for uComponents

There’s a nice little extension in the ucomponents.xml namespace, called ParseXhtml() ^[1] which simply takes care of converting a string of XHTML into XML, for use in our XSLT. And thanks to Lee Kelleher, the following paragraph is now hollow and empty, due to him fixing the caveats I ran into with the first iteration - and you won’t have to wade through entities “the size of gorillas” to understand what was going on. Yay Lee, #h5yr! :-)

<p>&nbsp;</p>

“So what we’re gonna do is:”

Download and install the uComponents and XSLTouch^[2] packages
Download _WYSIWYG.xslt
Include _WYSIWYG.xslt in your XSLT file:
```
<xsl:include href="_WYSIWYG.xslt" />
```

In your XSLT file, replace

<xsl:value-of select="bodyText" disable-output-escaping="yes" />

with this:

<xsl:apply-templates select="bodyText" mode="WYSIWYG" />

Yay - you’re ready to transform the output from the WYSIWYG editor and do all sorts of cool things

_WYSIWYG.xslt

But why even bother?

You may indeed ask that (very legitimate) question - to which I have to say: Because I like delivering great solutions, and every time I see this:

I know in my heart that for all intents and purposes, this is what it really means:

Unfortunately, [redacted] has taught everyone and their mother how to edit text by assigning colors and sizes instead of structure. But I still want to render the “correct” markup for what the editor was trying to do, so I set out to fix the problem for myself.

Performing that transformation with the CDATA wrapped content would require lots of string manipulation - preferably Regular Expressions (btw.: I do not suffer from Fear of RegExes, but it’s not the right tool for the job at hand).

Using _WYSIWYG.xslt I finally have a solution that works great, and which enables me to deliver solutions where the CSS doesn’t need to jump impossible hoops, trying to keep the formatting, colors, sizes etc.

How it works

This very simple chunk of XSLT works by using what’s known in XSLT as an “Identity Transform” — a set of templates that simply copies the source XML verbatim from the input tree to the output tree; But doing so by processing every node along the way. That enables us to intervene at any given point, to perform some sort of override - e.g. to remove any trace of a <font> tag, we can just create a template that simply continues processing the contents, preserving the text (which shouldn’t be punished for its creator wrapping that horrid element around it, no?) :

<xsl:template match="font | FONT">
	<xsl:apply-templates />
</xsl:template>

This template doesn’t copy its matching element, but makes sure that any content inside gets processed. Even if there were nested <font> tags inside, they’d be silently nuked by this same template. Gotta love that.

Making it sing

As you can probably see, you’re suddenly back in the driver’s seat now, with the possibility to “fix” the stuff that the WYSIWYG component generates. I’ll leave you with some examples of what I’ve been doing from project to project - some requiring lots of fixes, others merely simple reformatting:

Off you go, transforming XML …

So, we’re down to the bye-bye part again, and I’d like to point out a few things:

I actually do this on pretty-darn close to every project, yes
I don’t always have to mingle the data, but fairly often I actually do - and then it’s a breeze to add a couple of templates for it
I worry a little bit about how I’m going to do something similar to (and as flexible as) this in, say, V5 of Umbraco …

Big thanks to you as always - I put a star next to your name for every article you’ve read the whole way through, you know?

Why is the casing like that in so many C# projects? Methods like that one really should be named ParseXHTML(), period. I can’t stand it the other way :-) ↑
XSLTouch saves you a lot of hair when using include and/or import statements in (.NET) XSLT, by automatically touching the master file to recompile the stylesheet when the included file changes. Don’t thank me, thank Pete! ↑