extract a substring

jean-francois_ferrie · September 21, 2008, 8:28pm

hi

I have a string: my_string="blablablabla<coordinates>substring</coordinates>blabla"

I need to extract the sentence beetween "<coordinates>" and "</

"

How can I do that? Thanks for your help JF

Hank_Beaver · September 21, 2008, 9:03pm

my_string="blablablabla<coordinates>substring</coordinates>blabla" #the parentheses below define the actual match for the overall regex pattern sub_string = /.*<coordinates>(.*)<\/coordinates>.*/.match(my_string) puts sub_string[0]

Regex is the fastest/most effective for one/off text parsing. Another good option is Whytheluckystiff's Hpricot: http://code.whytheluckystiff.net/hpricot/

Hank

rab · September 22, 2008, 12:29am

my_string="blablablabla<coordinates>substring</coordinates>blabla" #the parentheses below define the actual match for the overall regex pattern sub_string = /.*<coordinates>(.*)<\/coordinates>.*/.match(my_string) puts sub_string[0]

Regex is the fastest/most effective for one/off text parsing. Another good option is Whytheluckystiff's Hpricot: http://code.whytheluckystiff.net/hpricot/

Hank

You probably want the regexp to be: /<coordinates>(.*)<\/coordinates>/ so there's less backtracking when the .* first tries to gobble everything.

You might also need something like: /<coordinates\b[^>]*>(.*)<\/coordinates>/ If there can be any attributes on the coordinates tag. Of course, if you really do have XML in my_string, a true parser like Hpricot or REXML will be more reliable than regular expressions. For example, if you had to match against: "blahblah<coordinates>first one</

yadayadayada<coordinates>oops! another one</yakyakyak"

would you want the substring to be: "first one</coordinates>yadayadayada<coordinates>oops! another one" (yeah, I didn't think so

-Rob

Rob Biedenharn http://agileconsultingllc.com Rob@AgileConsultingLLC.com

Con · September 22, 2008, 12:57am

Hi, I would recommend using the Hpricot and you can find the documentation

here:

http://code.whytheluckystiff.net/doc/hpricot

Good luck,

-Conrad

jean-francois_ferrie · September 22, 2008, 12:20pm

hi

/.*<coordinates>(.*)<\/coordinates>.*/ The reg exp you gave works fine. I tested it with rubular

probleme I can retrieve the substring I always get the whole string.

Here is what i did:

irb(main):001:0> st="<Point><coordinates>-0.954850,46.436960,0</

</Point>"

=> "<Point><coordinates>-0.954850,46.436960,0</coordinates></Point>" irb(main):002:0> sub=/.*<coordinates>(.*)<\/coordinates>.*/.match(st) => #<MatchData:0x7f2040045fd0> irb(main):003:0> sub.inspect => "#<MatchData:0x7f2040045fd0>" irb(main):004:0> sub.to_s => "<Point><coordinates>-0.954850,46.436960,0</coordinates></Point>" irb(main):005:0> sub.string => "<Point><coordinates>-0.954850,46.436960,0</coordinates></Point>" irb(main):006:0> st.match(/.*<coordinates>(.*)<\/coordinates>.*/) => #<MatchData:0x7f2040019fc0> irb(main):007:0> st.match(/.*<coordinates>(.*)<\/coordinates>.*/).to_s => "<Point><coordinates>-0.954850,46.436960,0</coordinates></Point>"

thank you for your help

:

Dirk_Groten · September 22, 2008, 12:36pm

Regexp.match(string) will return you a MatchData object, which is not just the match: It can be accessed as an Array. So: sub[0] returns the entire matched string sub[1], sub[2], ... return the values of the matched back references (the ones between parentheses).

sub[1] is therefore the thing you want to use. No need to use to_s.

jean-francois_ferrie · September 22, 2008, 1:44pm

ah ok

thank you all for your help

Topic		Replies	Views
Lazy regexp is not lazy enough rubyonrails-talk	4	130	February 8, 2011
regular expression rubyonrails-talk	7	173	March 31, 2010
Removing a block of text within a string rubyonrails-talk	2	161	December 7, 2006
How do you check if text is in between certain <tags>? rubyonrails-talk	1	135	August 16, 2007
regular expressions in Ruby on rails rubyonrails-talk	0	200	October 24, 2006

extract a substring

Related topics

More Resources