I'm doing one module in my site, there I need to import user blog into
my site. I can use RSS feeds to read the blog information but using
RSS feeds I'm not getting entire information. So, I need to scrape the
user blog page. How to scrape a pages without knowing its html
structure of a page? Please anyone can help me for this issue. Thanks
in advance.
Unless you want the entire page, you need to know something about
the page structure.
Well. If the page is even reasonably marked up (DIVs/Ps-wise) and
you create an array of block elements, you *might* get away with the
assumption that the ones with significant amounts of text (for some
value of "significant") are the actual blog post.
Might. I'd imagine a lot more going into that heuristic, since you're
looking for an AI solution
I think you'll find you need to know _something_ about the page layout. If
there are a finite number of places you need to scrape from you could do
this pretty simply.
Assume you had a css selector to find the desired content in each URL of
interest, and it was stored in an active record (ish) model.
# ...
# lookup the selector
@selector = Selector.find_by_url @the_url_to_scrape
doc = Nokogiri::HTML(open(@the_url_to_scrape))
# Search for nodes by css
doc.css(@selector).each do |link|
puts link.content
end
#...
I did a write up on simple scraping with nokogiri and selectorgadget here: