I need to grab all site data with all tree structure. Every page have links to children pages. How to build site tree with Nokogiri? It must be recursive page visiting and scraping all directory links, but I can’t recognize full algorhytm. How to do that?
P.S. And I don’t need to “Save all site on disk with HTTRack”. Data will be processed and copied on the new version of redesigned original site.
Some time ago I solved similar problem (but I needed continuous grabbing), organizing several workers: https://medium.com/@vladimir_vg/dsl-74d0fcf03cae (in Russian language)
Probably you do not need such a complex thing, but you may get some ideas from it.
def get_subtree(url)
#fetch the page
#parse it
#for each link
#normalize the link
#if link not already visited
#add link to table of visited links
get_subtree(link)
#end#end
end