Help with regular expression

ryan wrote:

# doesn't work at all (for either one) def strip_pre_and_quote_blocks text.gsub(/(^<pre>[^<]*<\/pre>$|^<blockquote>.*<\/blockquote>$)/,'') end

Here are some general suggestions. Write a test case for this regexp, have it print out a result, and run it over and over again as you incrementally add elements. I note this because you appear to have leapt directly from a simple to a complex regexp without incrementally examining the reaction to each individual symbol. Get used to writing test cases as experiments, and frequently running them (with one keystroke preferrably).

Next, put the ^ and $ outside the (), on principle.

Next, try (<pre>|<blockquote>).

Next, [^<]* is always tempting, but you can get less greedy IIRC with .*?. That will look-ahead a little more.

Next, post this to a Ruby or Regex newsgroup, because it's not about Rails.

Next, treat your HTML as XHTML and parse it with REXML. You can use XPath to reach in and nab each pre, and change its tag or contents to whatever you want. Then write it all back as XHTML.

Next, each < probably needs an escape, like \<

When I try that, it doesn't replace either one of the "pre" or "blockquote" blocks with '', but the first method works perfect for the "pre" block only. I can't even get the blockquote one to work by itself.

You could run two gsubs; one with <pre> and one with <blockquote>.

To strip, you could just yank all 4 items with 4 gsubs, too.

ryan wrote:

# doesn't work at all (for either one) def strip_pre_and_quote_blocks text.gsub(/(^<pre>[^<]*<\/pre>$|^<blockquote>.*<\/blockquote>$)/,'') end

Here are some general suggestions. Write a test case for this regexp, have it print out a result, and run it over and over again as you incrementally add elements. I note this because you appear to have leapt directly from a simple to a complex regexp without incrementally examining the reaction to each individual symbol. Get used to writing test cases as experiments, and frequently running them (with one keystroke preferrably).

Next, put the ^ and $ outside the (), on principle.

Next, try (<pre>|<blockquote>).

Next, [^<]* is always tempting, but you can get less greedy IIRC with .*?. That will look-ahead a little more.

Next, post this to a Ruby or Regex newsgroup, because it's not about Rails.

Next, treat your HTML as XHTML and parse it with REXML. You can use XPath to reach in and nab each pre, and change its tag or contents to whatever you want. Then write it all back as XHTML.

Next, each < probably needs an escape, like \<

When I try that, it doesn't replace either one of the "pre" or "blockquote" blocks with '', but the first method works perfect for the "pre" block only. I can't even get the blockquote one to work by itself.

You could run two gsubs; one with <pre> and one with <blockquote>.

To strip, you could just yank all 4 items with 4 gsubs, too.

--   Phlip   http://www.greencheese.us/ZeekLand <-- NOT a blog!!!

If you're trying to get rid of just the tags, then this might work. However, if you intend to remove the content between the tags, too ("The <pre>stuff doesn't</pre> work" => "The work"), then you either need to specifically match the newlines or just use the 'm' modifier on your regexp to use multi-line mode.

If you want to remove <pre> and <blockquote> along with the contents (inner_html) of those, then you can try something like this:

regexp = %r{<(pre|blockquote)\b[^>]*>.*?</\1>}m

=> /<(pre|blockquote)\b[^>]*>.*?<\/\1>/m

html_fragment = <<EOF

And then someone said: <blockquote> When I try that, it doesn't replace either one of the "pre" or "blockquote" blocks with '', but the first method works perfect for the "pre" block only. I can't even get the blockquote one to work by itself. </blockquote> So I told them to use this code: <pre class="ruby">    regexp = %r{<(pre|blockquote)\\b[^>]*>.*?</\\1>}m    my_string = <<-EOS And then someone said: <blockquote> When I try that, it doesn't replace either one of the "pre" or "blockquote" blocks with '', but the first method works perfect for the "pre" block only. I can't even get the blockquote one to work by itself. </blockquote>    EOS    my_string.gsub!(regexp, '...') </pre> <blockquote> When I try that, it doesn't replace either one of the "pre" or "blockquote" blocks with '', but the first method works perfect for the "pre" block only. I can't even get the blockquote one to work by itself. </blockquote> So I told them to use this code: <pre class="ruby">    regexp = %r{<(pre|blockquote)\\b[^>]*>.*?</\\1>}m    my_string = <<-EOS And then someone said: <blockquote> When I try that, it doesn't replace either one of the "pre" or "blockquote" blocks with '', but the first method works perfect for the "pre" block only. I can't even get the blockquote one to work by itself. </blockquote>    EOS    my_string.gsub!(regexp, '...') </pre> EOF => "And then someone said:\n<blockquote>\nWhen I try that, it doesn't replace either one of the \"pre\" or\n\"blockquote\" blocks with '', but the first method works perfect for the\n\"pre\" block only. I can't even get the blockquote one to work by\nitself.\n</blockquote>\nSo I told them to use this code:\n<pre class=\"ruby\">\n regexp = %r{<(pre|blockquote)\\b[^>]*>.*?</\\1>}m\n my_string = <<-EOS\nAnd then someone said:\n<blockquote>\nWhen I try that, it doesn't replace either one of the \"pre\" or\n\"blockquote\" blocks with '', but the first method works perfect for the\n\"pre\" block only. I can't even get the blockquote one to work by\nitself.\n</blockquote>\n EOS\n my_string.gsub!(regexp, '...')\n</pre>\n"

html_fragment.gsub(regexp, '...')

=> "And then someone said:\n...\nSo I told them to use this code:\n...\n"

puts _

And then someone said: ... So I told them to use this code: ... => nil

In addition to using an XML parser as Philip suggests, you might want to actually read about regular expressions. You can start with the pickaxe (Programming Ruby, The Pragmatic Programmers' Guide, 2nd ed.) pages 68-70, 324-328, 600-603.

-Rob

Rob Biedenharn http://agileconsultingllc.com Rob@AgileConsultingLLC.com