How to scrape a page without knowing its html structure

Kalyan · December 16, 2009, 6:12am

I’m doing one module in my site, there I need to import user blog into

my site. I can use RSS feeds to read the blog information but using

RSS feeds I’m not getting entire information. So, I need to scrape the

user blog page. How to scrape a pages without knowing its html

structure of a page? Please anyone can help me for this issue. Thanks

in advance.

hassan · December 16, 2009, 6:18am

You asked this exact question 4 days ago and got 2 answers, that basically you can't -- you have to know *something* about way the pages are marked up.

It's still true.

jman · December 16, 2009, 10:40pm

It seems that looking at the structure would be the easiest way, but if you wanted something more complex...your scraping program could infer the layout structure and separate this from the content. Your program would need to be fed multiple pages and would assume the layout to be the portion that stays mostly the same from page to page. That's an oversimplification, but that's the general idea.

Good luck.

Topic		Replies	Views
How to scrape a page without knowing its html structure rubyonrails-talk	0	144	December 16, 2009
How to scrape a page without knowing its html structure rubyonrails-talk	2	121	December 12, 2009
Scraping rubyonrails-talk	1	104	February 1, 2007
Web Scraping rubyonrails-talk	1	124	November 5, 2009
Web Scraping rubyonrails-talk	0	147	November 5, 2009

How to scrape a page without knowing its html structure

Related topics

More Resources