Lazy regexp is not lazy enough

Consider the string

  xyz<h1>x P0 y</h1><h1>x Q1 y</h1><h1>1 Placeholder2 2</h1>abc

and the pattern

  <(h\d)>.*?Placeholder2.*?<\/\1>

the pattern matches   <h1>x P0 y</h1><h1>x Q1 y</h1><h1>1 Placeholder2 2</h1>

I want it to match

  <h1>1 Placeholder2 2</h1>

How can I do this? That is, I want to find the nearest <h1> ... </h1> surrounding Placeholder2.

Don't know if/how to bend ruby's regular expressions to do this but trying to parse html with regular expressions is doomed to fail eventually. Use something like nokogiri.

Fred

Ralph Shnelvar wrote in post #980334:

How can I do this? That is, I want to find the nearest <h1> ... </h1> surrounding Placeholder2.

First, I'll +1 Fred on using Nokogiri for parsing HTML.

But you can modify you regex so any markup '<' characters are excluded using [^<], as in:

p = /<(h\d)>[^<]+Placeholder2.*?<\/\1>/ s = "xyz<h1>x P0 y</h1><h1>x Q1 y</h1><h1>1 Placeholder2 2</h1>abc" p =~ s

=> 33

$1

=> "h1"

Is that what you meant?

- ff

In terms of Nokogiri and Hpricot ...

I develop on a Windows machine and my ISP's machine is a Linux.

Nokoogiri works great on my development machine. My ISP does not support Nokogiri on his ... unless I am willing to spend the money to have him install it .. which I don't ... and for political reasons, I can't move to another ISP.

Hpricot has given me lots and lots of problems ...

So I have been reduced to parsing some html myself. I don't want to do it ... but I gotta.

Fearless, your solution seems to work ... but I am clueless as to how and why it works!

Ralph Shnelvar wrote in post #980369:

Fearless, your solution seems to work ... but I am clueless as to how and why it works!

I'm FAR from a regex wizard, but it's worth noting:    [abc] means match any occurrence of a or b or c    [^abc] means match any character that is NOT a or b or c ergo   [^<] means match anything that is NOT an open bracket   [^<]+ means match one or more things are are not open brackets

so   /<(h\d)>[^<]+Placeholder2.*?<\/\1>/

matches an open < followed by an h followed by a digit followed by a close >, then any number of characters as long as they are NOT < followed by "Placeholder2" ... etc

Of course, this will break as soon as someone adds attributes to the <h1> tag, such as <h1 class="navbar">, which is why we all like Nokogiri. I'm sorry your ISP doesn't agree! :slight_smile: