I'm trying to scrape images from a page. I'm using Hpricot to scrape the
actual image URLs into an array but I've encountered a problem regarding
resolving the full image paths.
The src of the images can be like any of the following:
http://external.site.com/images/image.jpg (Full URL)
/images/image.jpg (Absolute Path)
../images/image.jpg (Relative Path)
images/image.jpg (Relative Path)
Is there a way to resolve these paths to the proper URLs? So I can copy
the images to my server or whatever else I need to do with them?
Parse the url into pieces... extract the domain name and the "directory" part of the path.
Then just match them up. If your image starts with http just use that. If it starts with a slash then prepend the domain name. Otherwise domain + directory_path + image.