0

I've got a string like this:

<block trace="true" name="AssignResources: Append Resources">

I need to get the word (or the characters to next whitespace) after < (in this case block) and the words before = (here trace and name).

I tried several regex patterns, but all my attempts return the word with the "delimiters" characters included... like ;block.

I'm sure it's not that hard, but I've not found the solution yet.

Anybody's got a hint?
Thanks.

Btw: I want to replace the pattern matches with gsub.

EDIT:

Solved it with following regexes:

1) /\s(\w+)="(.*?)"/ matches all attr and their values in $1 and $2.

2) /<!--.*-->/ matches comments

3) /&lt;([\/|!|\?]?)([A-Za-z0-9]+)[^\s|&gt;|\/]*/ matches all tag names, wheter they're in a closing tag, self closing tag, <?xml>-tag or DTD-tag. $1 includes optional prefixed / ! or ? or nothing and $2 contains the tagname

5 Answers 5

2

Its looks so much like parsing HTML with regex to me

Ruby has very good html parser called Nokogiri

And Here is howto for that

require 'nokogiri'

html=Nokogiri::HTML('<block trace="true" name="AssignResources: Append Resources">')

html.xpath("//*").each do |s|
    puts s.node_name #block
    puts s.keys #trace, name
    puts s.values #true, AssignResources: Append Resources
end
Sign up to request clarification or add additional context in comments.

9 Comments

Hey S.Mark, I already use Nokogiri for that (XML Parsing) and it's great. i will think think about my application flow again - maybe i can do that replacement earlier and with nokogiri. At the time I do that replacement, it's no XML anymore. it's converted to one huge string. that's necessary because it shall be presented as text with having the values of former xml-tag attributes being then html <a>-tags linking to other html pages, defined by the value of the attribute. the replacements via gsub and pattern matching is done to surround parts of a xml tag with different <span>-tags.
And no: doing the syntax highlighting via javascript is no solution in this case. At this moment I've got "prettify" in use. but having documents with more than 2 thousand lines and x times more tags, it's no fun to use. that's why i want to prepare the output already in my parsing app.
syntax highlighting? have you considered using existing library like shjs? shjs.sourceforge.net
yes, I tried it, as I said , using Prettify (code.google.com/p/google-code-prettify). I think the problems are the same: having huge contents to highlight, the site is not usable anymore (30+secs). huge content => 7000+ lines of xml sometimes weird requirements ask for weird solutions ;)
I think regex can't be fast for 7000+ lines of data though.
|
1

You can try:

&lt;([^ ]*)\s([^=]*)=

Comments

0
'&lt;block trace="true" name="AssignResources: Append Resources"&gt;'[/&lt;(\w+)/, 1]
#=> "block"

If you pass a regex and an index i to String#[], it'll return the value of the ith capturing group.

Edit:

In 1.9 you can use /(?<=&lt;)\w+/ to require the presence of the &lt; without matching it. In 1.8 there is no way to do that. The best you can do is to put the part, you don't want to replace, in a capturing group and and access that group in the replacement like this:

"lo&lt;la li".gsub(/(&lt;)(\w+)/, '\1 --\2--')
 #=> "lo&lt; --la-- li"

1 Comment

Thanks for that hint, but I need the regex pattern as parameter to gsub method, to replace all these pattern matches with another string. I'm thinking about how to make it fit to gsub.
0
&lt;block trace="true" name="AssignResources: Append Resources"&gt;

&lt;([^\s]+)\s+([^=]+)="([^"]*)"\s+([^=]+)="([^"]*)"\s*&gt;

#result:

$1 block
$2 trace
$3 true
$4 name
$5 AssignResources: Append Resources

Update: I don't know ruby, but based on the description of gsub here, I believe that something like the following should do the trick.

str = '&lt;block trace="true" name="AssignResources: Append Resources"&gt;'
repl = str.gsub(/&lt;([^\s]+)\s+([^=]+)="([^"]*)"\s+([^=]+)="([^"]*)"\s*&gt;/, 
    "tag name: \\1\n\\2 is \\3 and \\4 is \\5\n")
print repl

1 Comment

Thanks Amarghosh, very nice solution, but I forgot to mention, that I need it as pattern parameter for gsub... But thx anyway.
0

Most probably you should go with Nokigiri or something similar. I couldn't fit it in one gsub but in two:

>> m,r=0,["&lt;blockie ", " tracie=", " namie="]
>> s.gsub(/&lt;.*?([^\s]+)\s/, r[0]).gsub(/\s([^=]+)=/) {|ma| m+=1; r[m]}
=> "&lt;blockie tracie="true" namie="AssignResources: Append Resources"&gt;"

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.