tl;dr - how do you future-proof database content, i.e. avoid specific HTML in the database? (Sorry if this has been discussed properly before, I couldn’t find anything relevant.)
I’ve been maintaining a Rails project for the past couple of years, and we’re running into a couple of problems with what you could call content maintainability. We publish several new pages a day (public event pages and some news stories) and we have a large repository of more static pages too. We’re using Refinery CMS, which is… okay. A good choice in 2014 when we first implemented this version of the site, a bad choice in 2021. But the real problem isn’t the CMS per se, but the fact that the body text for each page is stored as site-specific HTML. We’ve tried to use page parts and custom fields that isolate specific content like images and video, but it’s unreasonable and counterproductive to stop editors (there are about 7 people actively creating content) from using images or other content in the main document flow, e.g. in a long interview.
Why is HTML the problem? Let’s start with an obvious case: I’ve created an editor button that will wrap a YouTube embed in a div class=“fitvid” element to make iframes responsive with js. Since our iframes are meant to be a part of the flow of the article, they need to be part of body text, and thus (at least in Refinery) saved as HTML. However, I’m refactoring most of the Javascript to use Stimulus. This is fine for the code that are parts of templates, but the body text that is already in the db in HTML format is a real problem. No problem if you have a couple of specific html elements and know which pages they’re on, but what if you have hundreds and hundreds of pages to fix? Iframes are not the only ones, we’ve also recently changed our markup for pull quotes, which means that we need to go back to every pull quote in the CMS - or keep bad markup and legacy CSS.
All this could have been avoided with a better way to store content. Jeff Eaton’s excellent 2014 article about this is still largely correct (except the stuff about Web Components being the future, which, I think, turned out to be wrong). CMSes, headless or traditional, should abstract the database-stored content from the final HTML with a custom transformation layer in between. When presentational markup (like in my example above) changes, you change the code in the transformation layer abstraction, not in 100 separate CMS pages.
The headless CMS Sanity.io hides html completely and you think in blocks and content types, and then use the stored data however you want (an emphasis in Sanity might result in an em tag or an i tag on your frontend, it’s up to you, Sanity doesn’t care). The actual output is a json format called portable text that can be used in various ways, but sadly Javascript frontend frameworks (which I don’t want to use, apart from Stimulus and possibly a Svelte component if needed) are the main first class citizens with Sanity.
Jeff Eaton’s 2014 solution was using XML under the hood for custom content editor elements, and let the backend compile/transform to the desired HTML. This seems more cumbersome, but at least this way you can keep html for most of the content (like bog-standard p’s and regular links) and use custom xml elements for specifics (like pull quotes, specific tables, video embeds, etc). (i.e. <Pullquote><p>Quote here</p></Pullquote>
is compiled to the desired <aside class="pullquote"><blockquote><p>Quote here</p></blockquote></aside>
)
Writing my own, say, Nokogiri-based, Rails helper logic is doable, but there must be some Rails content solution that has considered these questions and even solved them? I could probably implement something like this for Refinery, but it would be trying to fit a square peg into a round hole. At minimum, it would break preview and I would have to loosen the strict html allowlist, and I don’t particularly want to do this. Most (all?) CMS or CMS-adjacent (like Trestle with tinyMCE extension) solutions store raw HTML in the database, so this must be a problem for a lot of people. Any ideas how to tackle this, obvious or not so obvious?
PS: If a good solution for future-proofing this exists, I’m willing to spend a lot of time fixing existing pages/db entries if it means I won’t have to do it again in two years.
PPS: There might be some things to leverage from ActionText, but sadly it’s off the table for me because of this Trix bug.