I'm trying to scrape images from a page. I'm using Hpricot to scrape the actual image URLs into an array but I've encountered a problem regarding resolving the full image paths.
Example:
The src of the images can be like any of the following:
http://external.site.com/images/image.jpg (Full URL) /images/image.jpg (Absolute Path) ../images/image.jpg (Relative Path) images/image.jpg (Relative Path)
Is there a way to resolve these paths to the proper URLs? So I can copy the images to my server or whatever else I need to do with them?
Hope that makes sense.
Cheers,
Jim
You use URI.join
require 'uri'
=> true
page_and_images = {
?> 'http://external.site.com/somedir/somepage.html’ => ['http://external.site.com/images/image.jpg’