How to replace every occurrence of a pattern in a string using Ruby?

Question

I have an XML file which is too big. To make it smaller, I want to replace all tags and attribute names with shorter versions of the same thing.

So, I implemented this:

string.gsub!(/<(\w+) /) do |match|
    case match
    when 'Image' then 'Img'
    when 'Text'  then 'Txt'
    end
end

puts string

which deletes all opening tags but does not do much else.

What am I doing wrong here?

What am I doing wrong here? Snide, but serious answer 1: not using an XML processor. Snide, but serious answer 2: two problems. Snide, but serious answer 3: those changes are likely going to have a very small decrease in size. Consider a container (gzip) or binary XML-compressor, if really needed. Happy coding. — user166390
– user166390, Commented Dec 15, 2010 at 18:53
@pst: Right you are, sir. Still, I need this script not only for XML but other (partly custom) formats, too, so an XML processor won't cut it. An even more correct remark would actually be "4: using XML in the first place". Something like JSON would solve all my problems in a pinch--but when I proposed this, my bosses rejected it. Sad, but true. — bastibe
– bastibe, Commented Dec 16, 2010 at 9:14
How could I forget #4? :( Happy coding within boss-confinements. — user166390
– user166390, Commented Dec 16, 2010 at 17:42

glenn mcdonald · Accepted Answer · 2010-12-15 18:49:40Z

2

Here's another way:

class String
  def minimize_tags!
    {"image" => "img", "text" => "txt"}.each do |from,to|
      gsub!(/<#{from}\b/i,"<#{to}")
      gsub!(/<\/#{from}>/i,"<\/#{to}>")
    end
    self
  end
end

This will probably be a little easier to maintain, since the replacement patterns are all in one place. And on strings of any significant size, it may be a lot faster than Kevin's way. I did a quick speed test of these two methods using the HTML source of this stackoverflow page itself as the test string, and my way was about 6x faster...

answered Dec 15, 2010 at 18:49

glenn mcdonald

15.5k4 gold badges38 silver badges40 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Kevin Over a year ago

Yeah, this is better than mine.

the Tin Man · Accepted Answer · 2010-12-16 04:44:10Z

2

Here's the beauty of using a parser such as Nokogiri:

This lets you manipulate selected tags (nodes) and their attributes:

require 'nokogiri'

xml = <<EOT
<xml>
  <Image ImagePath="path/to/image">image comment</Image>
  <Text TextFont="courier" TextSize="9">this is the text</Text>
</xml>
EOT

doc = Nokogiri::XML(xml)
doc.search('Image').each do |n| 
  n.name = 'img' 
  n.attributes['ImagePath'].name = 'path'
end
doc.search('Text').each do |n| 
  n.name = 'txt'
  n.attributes['TextFont'].name = 'font'
  n.attributes['TextSize'].name = 'size'
end
print doc.to_xml
# >> <?xml version="1.0"?>
# >> <xml>
# >>   <img path="path/to/image">image comment</img>
# >>   <txt font="courier" size="9">this is the text</txt>
# >> </xml>

If you need to iterate through every node, maybe to do a universal transformation on the tag-name, you can use doc.search('*').each. That would be slower than searching for individual tags, but might result in less code if you need to change every tag.

The nice thing about using a parser is it'll work even if the layout of the XML changes since it doesn't care about whitespace, and will work even if attribute order changes, making your code more robust.

edited Dec 16, 2010 at 4:44

answered Dec 16, 2010 at 4:28

the Tin Man

161k44 gold badges222 silver badges308 bronze badges

2 Comments

bastibe Over a year ago

While this is a very nice solution indeed, I actually want to transform not only tag names and attributes, but selected strings, too. So, sadly, this solution won't work for me.

the Tin Man Over a year ago

@BastiBechtold, "but selected strings, too. So, sadly, this solution won't work for me." Only because you don't know how to do it and because you didn't say that was what you wanted to do in your question. It's actually doable with a parser in a very similar way to what I already demonstrated, because "text nodes" exist, are accessible and changeable. I wrote an answer yesterday doing just that.

Kevin · Accepted Answer · 2010-12-15 17:51:52Z

1

Try this:

string.gsub!(/(<\/?)(\w+)/) do |match|
  tag_mark = $1
  case $2
  when /^image$/i
    "#{tag_mark}Img"
  when /^text$/i
    "#{tag_mark}Txt"
  else
    match
  end
end

edited Dec 15, 2010 at 17:51

answered Dec 15, 2010 at 12:38

Kevin

1,85513 silver badges12 bronze badges

5 Comments

glenn mcdonald Over a year ago

Closing tags won't have a space after the tagname, so this attempt to match both opening and closing tags won't work as written...

Kevin Over a year ago

Thank you @glenn, I've realized the space is not a typo. I made an update to my code.

glenn mcdonald Over a year ago

No, you can't just take the space out, unless you know that there are no longer tags that start with these. E.g. TEXTAREA or IMAGEMAP will get screwed up by your code now.

Kevin Over a year ago

OK, @glenn, I made a better update. Thank you for your checking. I'm feeling like a junior student submitting his homework to the teacher :) I can't image there's more scenarios misses the regexp.

glenn mcdonald Over a year ago

Yes, that looks like it'll work. Seems a little ugly, though: two finnicky lines of code and an extra regexp evaluation for each tag you want to change. Maybe it's better to step back from the original question's premise of using the block argument to gsub, and turn the solution around. I'll add an alternative like that as another answer...

Collectives™ on Stack Overflow

How to replace every occurrence of a pattern in a string using Ruby?

3 Answers 3

1 Comment

2 Comments

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

2 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related