How to scrape a page without knowing its html structure

Hi,

I'm doing one module in my site, there I need to import user blog into my site. I can use RSS feeds to read the blog information but using RSS feeds I'm not getting entire information. So, I need to scrape the user blog page. How to scrape a pages without knowing its html structure of a page? Please anyone can help me for this issue. Thanks in advance.

Unless you want the entire page, you need to know something about the page structure.

Well. If the page is even reasonably marked up (DIVs/Ps-wise) and you create an array of block elements, you *might* get away with the assumption that the ones with significant amounts of text (for some value of "significant") are the actual blog post.

Might. I'd imagine a lot more going into that heuristic, since you're looking for an AI solution :slight_smile:

Good luck,

I think you'll find you need to know _something_ about the page layout. If there are a finite number of places you need to scrape from you could do this pretty simply.

Assume you had a css selector to find the desired content in each URL of interest, and it was stored in an active record (ish) model.

# ... # lookup the selector @selector = Selector.find_by_url @the_url_to_scrape

doc = Nokogiri::HTML(open(@the_url_to_scrape))

# Search for nodes by css doc.css(@selector).each do |link|   puts link.content end #...

I did a write up on simple scraping with nokogiri and selectorgadget here:

iri/