Is this your html, or are you scraping someone else's html?
If it's yours, organize your html differently... if you know you want to
be processing a section at a time, wrap those sections with an
identifiable container, then scope your searches by the container.
(doc/"div").each do |dv|
this_h3 = (dv/"h3")
if this_h3.inner_html == "blah2"
(dv/"li").each do |li|
puts li.inner_html
end
end
end
emits just c, and d
If its someone else's html in that format, you'll probably have to go
elem by elem for the whole doc with state machine-ish code to track what
you've seen previously since there doesn't seem to be any real 'path' to
the li's per h3.
Your html is still flat, so you have to work with the patterns that you
see.
You have:
span
li
li
li
span
li
li
li
etc...
An ugly, brute force, one case solution is to:
read the page with Hpricot
remove the header
convert it to a simple string representation
stick your opening tag '<see>' at the head
stick your closing tag and a div end '</div></see>' at the tail
change all '<span>' to '</div><div><span>'
doctor up the new head from '<see></div><div>' to just '<see><div>'
re-create your Hproicot doc from the modified string
Please don't top post, it annoys readers on this list and makes it
less likely that you will get help.
I have not used hpricot but if I were in your situation the first
thing I would do is carefully look through the documentation for
hpricot. Have you done that?