Lazy regexp is not lazy enough

11155 · February 8, 2011, 3:46pm

Consider the string

xyz<h1>x P0 y</h1><h1>x Q1 y</h1><h1>1 Placeholder2 2</h1>abc

and the pattern

<(h\d)>.*?Placeholder2.*?<\/\1>

the pattern matches <h1>x P0 y</h1><h1>x Q1 y</h1><h1>1 Placeholder2 2</h1>

I want it to match

<h1>1 Placeholder2 2</h1>

How can I do this? That is, I want to find the nearest <h1> ... </h1> surrounding Placeholder2.

Frederick_Cheung · February 8, 2011, 4:37pm

Don't know if/how to bend ruby's regular expressions to do this but trying to parse html with regular expressions is doomed to fail eventually. Use something like nokogiri.

Fred

11155 · February 8, 2011, 5:34pm

Ralph Shnelvar wrote in post #980334:

How can I do this? That is, I want to find the nearest <h1> ... </h1> surrounding Placeholder2.

First, I'll +1 Fred on using Nokogiri for parsing HTML.

But you can modify you regex so any markup '<' characters are excluded using [^<], as in:

p = /<(h\d)>[^<]+Placeholder2.*?<\/\1>/ s = "xyz<h1>x P0 y</h1><h1>x Q1 y</h1><h1>1 Placeholder2 2</h1>abc" p =~ s

=> 33

$1

=> "h1"

Is that what you meant?

- ff

11155 · February 8, 2011, 6:10pm

In terms of Nokogiri and Hpricot ...

I develop on a Windows machine and my ISP's machine is a Linux.

Nokoogiri works great on my development machine. My ISP does not support Nokogiri on his ... unless I am willing to spend the money to have him install it .. which I don't ... and for political reasons, I can't move to another ISP.

Hpricot has given me lots and lots of problems ...

So I have been reduced to parsing some html myself. I don't want to do it ... but I gotta.

Fearless, your solution seems to work ... but I am clueless as to how and why it works!

11155 · February 8, 2011, 6:34pm

Ralph Shnelvar wrote in post #980369:

Fearless, your solution seems to work ... but I am clueless as to how and why it works!

I'm FAR from a regex wizard, but it's worth noting: [abc] means match any occurrence of a or b or c [^abc] means match any character that is NOT a or b or c ergo [^<] means match anything that is NOT an open bracket [^<]+ means match one or more things are are not open brackets

so /<(h\d)>[^<]+Placeholder2.*?<\/\1>/

matches an open < followed by an h followed by a digit followed by a close >, then any number of characters as long as they are NOT < followed by "Placeholder2" ... etc

Of course, this will break as soon as someone adds attributes to the <h1> tag, such as <h1 class="navbar">, which is why we all like Nokogiri. I'm sorry your ISP doesn't agree!