Parsing html tags to ruby characters

Hi,

I have a string containing some ruby code and html tags in-between. For example,

str = "require 'my_class.rb'<br>require 'your_class.rb'<br>&nbsp;&nbsp;:key=&gt;'hello'"

I want these html tags('<br>', '&nbsp;', '&gt;', '&lt;', '<p>', '<font>' etc...) to be replaced by the equivalent ruby characters("\n", " ", ">", "<" etc...).

These html tags can change dynamically according to the inputs.

Is there any way to parse these html tags to equivalent ruby characters?

Thanks in advance...

Thanks Ryan. But I can't guess what are all the tags i will be getting. Because those are dynamic. Any possible tag can come. So if I have to use the 'gsub' method, I will have to write for each and every html tag. Then that will be big.

So I am looking for any other easier way to implement this(something like html parser kind of).

You never specified what you wanted the

and tags replaced with either.

Sorry. That's my mistake. The final thing i want from the string is a runnable ruby code. So <p> and <font> tags can be removed from the string without any replacement.

Now I think, the only way to implement this is to use the 'gsub' method for each and every possible tag.

Sorry. That's my mistake. The final thing i want from the string is a runnable ruby code. So <p> and <font> tags can be removed from the string without any replacement.

Now I think, the only way to implement this is to use the 'gsub'
method for each and every possible tag.

Well assuming the only tag with special meaning is <br> Then you can
just convert entities to their respective characters (there are tables
of these), <br> to "\n" and then just replace every other tag with ''.
No need for one regexp per tag for that!

Fred

But "&gt;" and "&lt;" need to be replaced with ">" and "<" respectively. Because I will having some ruby hash code in the string.

Also I need to find out all the html tags in that string. Is there any way to find that?

But "&gt;" and "&lt;" need to be replaced with ">" and "<"
respectively. Because I will having some ruby hash code in the string.

I'm not seeing the problem :slight_smile: Replace entities and then look for
everything between < and >. Change it to a newline if it's a br, or
just replace it with blank and add it to your list of html tags. Fred

Thanks for your replies. I have done as I wanted. The following the code for that.

    markup = markup.gsub('<br>', "\n")     markup = markup.gsub(/[\<]([\/])*([A-Za-z0-9])*[\>]/, '')     markup = markup.gsub('&gt;', ">")     markup = markup.gsub('&lt;', "<")     markup = markup.gsub('&nbsp;', " ")     markup = markup.gsub('&amp;', "&")

It's working fine now. But I am not sure whether I have covered all the tags and characters or not.

depends what you are trying todo. there are far more html entities that that. (a partial list is here http://www.w3schools.com/tags/ref_entities.asp) and of course there are the unicode style ones (http://theorem.ca/~mvcorks/code/charsets/auto.html)

Fred

I have a very, very strong suspicion that the need is only to translate character enconding (e.g., &amp=>'&').

It might be worth considering iterating over an array of hashes rather than repeating the same code with different parameters:

[{:regex=>/\<br\>/, :decoded=>"\n"}, {:regex=>/[\<]([\/])*([A-Za-z0-9])*[\>]/, :decoded=>''}, {:regex=>/&gt;/, :decoded=>'>'} ... ].each do |decoding_hash|   markup.gsub!(decoding_hash[:regex], decoding_hash[:decoded]) end

The advantage is in keeping the code DRY and making the intentions of the block a bit clearer.