Looking for an HTML parser

Stef_T · September 1, 2006, 3:52pm

Hey nuno, Urm, okay, call me stupid Ishmael, but, why not merely subclass the current htmlparser and then whenever you get a 'bad tag' do whatever you want to do with it. I dare say that if someone passes me a badly formed document, I -want- them to see an error, however whatever -you- decide to do with it is upto (well) -you-. If you want to try and 'fix' certain errors in a bad document, thats surely down to 'you'

You may get lucky and someone may have already trod this path, but, surely in the case of 'bad data' your not best placed to say whats 'valid' and whats not. surely thats something only the originating user can do. Mean to say, you can deal with things like a missing '>' fairly simply, but what about character transposition ? inptu instead of input, or character addition <input name="freds"> instead of <input name="fred"> ..

I think the -saniest- thing a parser can do, is raise an error on badly formed. Perhaps not the answer you want, and I look forward to being proved 'wrong' but, well, *polite shrug* there's my 2c ;p Regards Stef

nuno wrote:

Stef_T · September 1, 2006, 4:31pm

Hello Michael, Whereas I agree with you in regards to the whole 'you cant control someone elses webpage when they dont conform to the standard', I do think that if you scraping a webpage, you don't really want to fling it into an HTMLParser anyway. surely its much quicker to treat the html as a 'string' and then regex out what you need ?

Of course, this is probably either my perl background,rampant pragmatism or bad programming showing .. but .. whenever I have wanted to check the 'well formed-ness' of a document, its almost usually been 'uploaded' to the system I am using. So, thats where I base my whole 'fling an error on error' practice from So, in essence, I guess it depends what the user is using the HTMLParser 'for'

Regards Stef

Michael Modica wrote:

Topic		Replies	Views
Looking for an HTML parser rubyonrails-talk	0	131	September 1, 2006
Looking for an HTML parser rubyonrails-talk	0	136	September 1, 2006
Looking for an HTML parser rubyonrails-talk	1	189	September 3, 2006
Looking for an HTML parser rubyonrails-talk	1	203	September 4, 2006
Hpricot help - parsing malformed HTML rubyonrails-talk	2	163	November 17, 2006

Looking for an HTML parser

Related topics

More Resources