Removing extranious html

11175 · October 8, 2009, 3:52am

Morgan Morgan wrote:

I can't seem to find a way to do this.. i have a bunch of html files that i just need to remove from the <!DOCTYPE to the <BODY> tag on the top then i need to remove from </body> to </html> on the bottom.

i looked at gsub and i'm learning regular expressions but i can't seem to figure out how they work. so far i've been able to figure out how to kill single words and single letters but not whole blocks of letters and words.

it's mildly frustrating.

well if anyone can help it would be greatly appreciated. i'm off to my regex book.

Your regex book will be the best help, but here's a clue: I think you're going about it inside-out. It would probably easiest to extract the entire <body> element. It's relatively simple to write a regex that will cover most cases, but if you have to cover absolutely every valid case, you may want to use Nokogiri, Hpricot, or JavaScript DOM manipulation instead.

thanks in advanced.

Best,

11175 · October 8, 2009, 5:21am

Marnen Laibow-Koser wrote:

Your regex book will be the best help, but here's a clue: I think you're going about it inside-out. It would probably easiest to extract the entire <body> element. It's relatively simple to write a regex that will cover most cases, but if you have to cover absolutely every valid case, you may want to use Nokogiri, Hpricot, or JavaScript DOM manipulation instead.

thanks in advanced.

Best, -- Marnen Laibow-Koser http://www.marnen.org marnen@marnen.org

hrmm. i was using gsub to blank all the stuff i didn't want.. maybe i'll just pull the stuff that i do. the marvels of reversing your logic. thanks.

Topic		Replies	Views
Help with a regex rubyonrails-talk	4	123	December 7, 2006
Extract copy from HTML? rubyonrails-talk	0	106	December 21, 2006
Regex in Ruby - Strip HTML out of comments - help rubyonrails-talk	3	167	August 21, 2006
Strip out ALL javascript from HTML source. rubyonrails-talk	21	810	April 3, 2007
sanitizing and stripping some html? rubyonrails-talk	0	113	April 22, 2007

Removing extranious html

Related topics

More Resources