Looking for an HTML parser

Hey nuno,
    Urm, okay, call me stupid Ishmael, but, why not merely subclass the current htmlparser and then whenever you get a 'bad tag' do whatever you want to do with it. I dare say that if someone passes me a badly formed document, I -want- them to see an error, however whatever -you- decide to do with it is upto (well) -you-. If you want to try and 'fix' certain errors in a bad document, thats surely down to 'you'

    You may get lucky and someone may have already trod this path, but, surely in the case of 'bad data' your not best placed to say whats 'valid' and whats not. surely thats something only the originating user can do. Mean to say, you can deal with things like a missing '>' fairly simply, but what about character transposition ? inptu instead of input, or character addition <input name="freds"> instead of <input name="fred"> ..

    I think the -saniest- thing a parser can do, is raise an error on badly formed. Perhaps not the answer you want, and I look forward to being proved 'wrong' but, well, *polite shrug* there's my 2c ;p

nuno wrote:

Hello Michael,
    Whereas I agree with you in regards to the whole 'you cant control someone elses webpage when they dont conform to the standard', I do think that if you scraping a webpage, you don't really want to fling it into an HTMLParser anyway. surely its much quicker to treat the html as a 'string' and then regex out what you need ?

    Of course, this is probably either my perl background,rampant pragmatism or bad programming showing .. but .. whenever I have wanted to check the 'well formed-ness' of a document, its almost usually been 'uploaded' to the system I am using. So, thats where I base my whole 'fling an error on error' practice from :wink: So, in essence, I guess it depends what the user is using the HTMLParser 'for' :slight_smile:


Michael Modica wrote: