1

I have an XML file which is too big. To make it smaller, I want to replace all tags and attribute names with shorter versions of the same thing.

So, I implemented this:

string.gsub!(/<(\w+) /) do |match|
    case match
    when 'Image' then 'Img'
    when 'Text'  then 'Txt'
    end
end

puts string

which deletes all opening tags but does not do much else.

What am I doing wrong here?

3
  • 3
    What am I doing wrong here? Snide, but serious answer 1: not using an XML processor. Snide, but serious answer 2: two problems. Snide, but serious answer 3: those changes are likely going to have a very small decrease in size. Consider a container (gzip) or binary XML-compressor, if really needed. Happy coding. Commented Dec 15, 2010 at 18:53
  • 1
    @pst: Right you are, sir. Still, I need this script not only for XML but other (partly custom) formats, too, so an XML processor won't cut it. An even more correct remark would actually be "4: using XML in the first place". Something like JSON would solve all my problems in a pinch--but when I proposed this, my bosses rejected it. Sad, but true. Commented Dec 16, 2010 at 9:14
  • How could I forget #4? :( Happy coding within boss-confinements. Commented Dec 16, 2010 at 17:42

3 Answers 3

2

Here's another way:

class String
  def minimize_tags!
    {"image" => "img", "text" => "txt"}.each do |from,to|
      gsub!(/<#{from}\b/i,"<#{to}")
      gsub!(/<\/#{from}>/i,"<\/#{to}>")
    end
    self
  end
end

This will probably be a little easier to maintain, since the replacement patterns are all in one place. And on strings of any significant size, it may be a lot faster than Kevin's way. I did a quick speed test of these two methods using the HTML source of this stackoverflow page itself as the test string, and my way was about 6x faster...

Sign up to request clarification or add additional context in comments.

1 Comment

Yeah, this is better than mine.
2

Here's the beauty of using a parser such as Nokogiri:

This lets you manipulate selected tags (nodes) and their attributes:

require 'nokogiri'

xml = <<EOT
<xml>
  <Image ImagePath="path/to/image">image comment</Image>
  <Text TextFont="courier" TextSize="9">this is the text</Text>
</xml>
EOT

doc = Nokogiri::XML(xml)
doc.search('Image').each do |n| 
  n.name = 'img' 
  n.attributes['ImagePath'].name = 'path'
end
doc.search('Text').each do |n| 
  n.name = 'txt'
  n.attributes['TextFont'].name = 'font'
  n.attributes['TextSize'].name = 'size'
end
print doc.to_xml
# >> <?xml version="1.0"?>
# >> <xml>
# >>   <img path="path/to/image">image comment</img>
# >>   <txt font="courier" size="9">this is the text</txt>
# >> </xml>

If you need to iterate through every node, maybe to do a universal transformation on the tag-name, you can use doc.search('*').each. That would be slower than searching for individual tags, but might result in less code if you need to change every tag.

The nice thing about using a parser is it'll work even if the layout of the XML changes since it doesn't care about whitespace, and will work even if attribute order changes, making your code more robust.

2 Comments

While this is a very nice solution indeed, I actually want to transform not only tag names and attributes, but selected strings, too. So, sadly, this solution won't work for me.
@BastiBechtold, "but selected strings, too. So, sadly, this solution won't work for me." Only because you don't know how to do it and because you didn't say that was what you wanted to do in your question. It's actually doable with a parser in a very similar way to what I already demonstrated, because "text nodes" exist, are accessible and changeable. I wrote an answer yesterday doing just that.
1

Try this:

string.gsub!(/(<\/?)(\w+)/) do |match|
  tag_mark = $1
  case $2
  when /^image$/i
    "#{tag_mark}Img"
  when /^text$/i
    "#{tag_mark}Txt"
  else
    match
  end
end  

5 Comments

Closing tags won't have a space after the tagname, so this attempt to match both opening and closing tags won't work as written...
Thank you @glenn, I've realized the space is not a typo. I made an update to my code.
No, you can't just take the space out, unless you know that there are no longer tags that start with these. E.g. TEXTAREA or IMAGEMAP will get screwed up by your code now.
OK, @glenn, I made a better update. Thank you for your checking. I'm feeling like a junior student submitting his homework to the teacher :) I can't image there's more scenarios misses the regexp.
Yes, that looks like it'll work. Seems a little ugly, though: two finnicky lines of code and an extra regexp evaluation for each tag you want to change. Maybe it's better to step back from the original question's premise of using the block argument to gsub, and turn the solution around. I'll add an alternative like that as another answer...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.