Storing rich text in db

tl;dr - how do you future-proof database content, i.e. avoid specific HTML in the database? (Sorry if this has been discussed properly before, I couldn’t find anything relevant.)

I’ve been maintaining a Rails project for the past couple of years, and we’re running into a couple of problems with what you could call content maintainability. We publish several new pages a day (public event pages and some news stories) and we have a large repository of more static pages too. We’re using Refinery CMS, which is… okay. A good choice in 2014 when we first implemented this version of the site, a bad choice in 2021. But the real problem isn’t the CMS per se, but the fact that the body text for each page is stored as site-specific HTML. We’ve tried to use page parts and custom fields that isolate specific content like images and video, but it’s unreasonable and counterproductive to stop editors (there are about 7 people actively creating content) from using images or other content in the main document flow, e.g. in a long interview.

Why is HTML the problem? Let’s start with an obvious case: I’ve created an editor button that will wrap a YouTube embed in a div class=“fitvid” element to make iframes responsive with js. Since our iframes are meant to be a part of the flow of the article, they need to be part of body text, and thus (at least in Refinery) saved as HTML. However, I’m refactoring most of the Javascript to use Stimulus. This is fine for the code that are parts of templates, but the body text that is already in the db in HTML format is a real problem. No problem if you have a couple of specific html elements and know which pages they’re on, but what if you have hundreds and hundreds of pages to fix? Iframes are not the only ones, we’ve also recently changed our markup for pull quotes, which means that we need to go back to every pull quote in the CMS - or keep bad markup and legacy CSS.

All this could have been avoided with a better way to store content. Jeff Eaton’s excellent 2014 article about this is still largely correct (except the stuff about Web Components being the future, which, I think, turned out to be wrong). CMSes, headless or traditional, should abstract the database-stored content from the final HTML with a custom transformation layer in between. When presentational markup (like in my example above) changes, you change the code in the transformation layer abstraction, not in 100 separate CMS pages.

The headless CMS Sanity.io hides html completely and you think in blocks and content types, and then use the stored data however you want (an emphasis in Sanity might result in an em tag or an i tag on your frontend, it’s up to you, Sanity doesn’t care). The actual output is a json format called portable text that can be used in various ways, but sadly Javascript frontend frameworks (which I don’t want to use, apart from Stimulus and possibly a Svelte component if needed) are the main first class citizens with Sanity.

Jeff Eaton’s 2014 solution was using XML under the hood for custom content editor elements, and let the backend compile/transform to the desired HTML. This seems more cumbersome, but at least this way you can keep html for most of the content (like bog-standard p’s and regular links) and use custom xml elements for specifics (like pull quotes, specific tables, video embeds, etc). (i.e. <Pullquote><p>Quote here</p></Pullquote> is compiled to the desired <aside class="pullquote"><blockquote><p>Quote here</p></blockquote></aside>)

Writing my own, say, Nokogiri-based, Rails helper logic is doable, but there must be some Rails content solution that has considered these questions and even solved them? I could probably implement something like this for Refinery, but it would be trying to fit a square peg into a round hole. At minimum, it would break preview and I would have to loosen the strict html allowlist, and I don’t particularly want to do this. Most (all?) CMS or CMS-adjacent (like Trestle with tinyMCE extension) solutions store raw HTML in the database, so this must be a problem for a lot of people. Any ideas how to tackle this, obvious or not so obvious?

PS: If a good solution for future-proofing this exists, I’m willing to spend a lot of time fixing existing pages/db entries if it means I won’t have to do it again in two years.

PPS: There might be some things to leverage from ActionText, but sadly it’s off the table for me because of this Trix bug.

2 Likes

There are minor annoyances with Trix but the other benefits of Action Text such as custom attachments are worth it for me.

Concerning the bug you described, I’ve used the workaround below for styling. Then the challenge is converting your existing data to that format.

Thanks for the link, but that’s not a workaround, it’s putting lipstick on the proverbial pig. Not going to scrap web semantics for an editor, it’s not 2002 anymore. HTML elements have meaning, and line breaks aren’t paragraphs. (But please let’s not make this thread a thread about Trix. I only mentioned it because it makes action text useless for me.)

I understand. In my case, in spite of this imperfection I considered Action Text to be the most future-proof solution (custom attachments, integration with Active Storage, it’s the “official Rails solution”, etc). I also developed an integration to translate its rich text with Mobility (mobility-actiontext).

I read somewhere else here that the maintainers have nothing against having someone adapt the code to support other editors, if you’d like to explore that route.

But doesn’t ActionText store this as Trix’s HTML output? If it does, it doesn’t solve anything for me, I might as well use Refinery. Attachment handling sounds nice, but Refinery already has Dragonfly integration in the editor.

To illustrate my point better: if the Trix bug is fixed in the future, won’t you have lots of leftover <br> tags in your database, as well as (bad) legacy CSS?

Unless I’m misunderstanding what action text does, using it will result in a worse version of the problem I already have. Now I have iframe classes I don’t want in my db. I don’t like it, but I can live with it. But br instead of p? So much worse.

My point with this thread was talking about how to compile/transpile editor output into good HTML with classes my frontend uses without coding my own editor.

Attachments can be embedded anywhere in the flow of the content. This is stored in DB using <action-text-attachment> tags. There is out-of-the box support to e.g. embed uploaded images, and you can develop custom attachments e.g. to embed a YouTube video, to @ mention a user, to link to some other parts of your website, etc. Personally I would like to see a way to embed blobs already uploaded in Active Storage (some kind of gallery selector).

In addition to attachments uploaded through Active Storage, Action Text can embed anything that can be resolved by a Signed GlobalID.

1 Like

Thanks for those pointers. This is very interesting, but it won’t change Trix’s markup, which worries me a lot more than attachments when it comes to future-proofing.

I did, however, try to write a Nokogiri-based helper method for Trix content now just for fun, and I think it’s possible to get most of the evil out of it, though probably not in a safe or maintainable way. This is what I have so far, though there are plenty of combinations of elements unaccounted for, and it’s generally hideous (I’m using Spina CMS since it was easier to test with a real Rails app and real page content):

# app/helpers/application_helper.rb
require 'nokogiri'

module ApplicationHelper
  def proper_body_text
    # Get SpinaCMS's body text block
    trix_content = Nokogiri::HTML::DocumentFragment.parse(content(:text))
    # Replace Trix div tags with p tags
    trix_content.css('div:not([class])').each do |div|
      div.name = 'p'
    end
    # Replace two br's with p and ensure figure is a block element
    body_text_string = trix_content.to_s.gsub('<br><br>', '</p><p>').gsub('<br><figure', '</p><figure')
    # Output html
    body_text_string.html_safe
  end
end
# app/views/default/pages/homepage.html.erb
<h1><%= current_page.title %></h1>
<%= proper_body_text %>

(So ugly, especially the gsub bits. But it shows a little bit of what might be possible to salvage, at least.)

Mark did some fixes for Action Text in this other thread, maybe you could ping him to see if he would have ideas to improve the situation: