Unix sed expression to find xml value

Question

I have an XML file on my AIX system which has the following tag...

  <g:google_product_category>
Health &amp; Beauty &gt; Personal Care &gt; Cosmetics
    </g:google_product_category>

I am trying to use sed to get the value of this element. So far I can only work out an expression to print the stag tag and end tag...

sed -n '/google_product_category/{s/.*<google_product_category>//
        s/<\/google_product_category.*//;p;}' gpf_20150708063022.xml

Can someone please help me with this?

In general, it is not a particularly good idea to parse/process XML with sed. For the limited context you present, it will be OK, but be cautious. — Jonathan Leffler
– Jonathan Leffler, Commented Jul 8, 2015 at 3:05

zedfoxus · Accepted Answer · 2015-07-08 03:42:42Z

1

Assuming that text was in a file called test.txt, you could use a combination of tr and sed like so:

cat test.txt | tr '\n' ' ' | \
sed -e 's/<g:google_product_category>\(.*\)<\/g:google_product_category>/\1/g'

Result:
 Health &amp; Beauty &gt; Personal Care &gt; Cosmetics

answered Jul 8, 2015 at 3:42

zedfoxus

37.4k5 gold badges68 silver badges66 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Jonathan Leffler Over a year ago

The chances are that there's more in the file than just the three lines, and this is going to leave all the other material around the words that were in the product category tags.

zedfoxus Over a year ago

You are right; that might be the case and this solution might not work.

Jonathan Leffler · Accepted Answer · 2015-07-08 03:53:33Z

0

Original data

The original sample data was:

<g:google_product_category>
      Some text on the next line
</g:google_product_category>

For that data, this sed command works:

sed -n '/^<g:google_product_category>/,/^<\/g:google_product_category>/{
        /google_product_category/d; p; }'

Don't print by default. Between the lines matching the start and end tags (where the tags are not indented), if the line matches google_product_category, delete it; else print it.

Revised data

Since the question has been revised and the new sample data is:

  <g:google_product_category>
Health &amp; Beauty &gt; Personal Care &gt; Cosmetics
    </g:google_product_category>

with leading blanks on the tag lines (and a horribly sloppy layout to boot), then the carets ^ which anchor the match to the start of the line are not appropriate. A revised script, therefore, is:

sed -n '/<g:google_product_category>/,/<\/g:google_product_category>/{
        /google_product_category/d; p; }'

Don't print by default. Between the lines containing the start and end tags (where the tags are may be indented, and may be preceded by or followed by arbitrary material which will be ignored), if the line matches google_product_category, delete it; else print it.

Given a composite and extended data file like this:

<g:google_product_category>
      Some text on the next line
</g:google_product_category>
      <g:google_product_category>
    Health &amp; Beauty &gt; Personal Care &gt; Cosmetics
        </g:google_product_category>

    <g:google_category>
        Garbage, trash, and delectable goodies.
    </g:google_category>

The output from the revised script is:

      Some text on the next line
    Health &amp; Beauty &gt; Personal Care &gt; Cosmetics

edited Jul 8, 2015 at 3:53

answered Jul 8, 2015 at 3:03

Jonathan Leffler

759k145 gold badges961 silver badges1.3k bronze badges

4 Comments

Richie Over a year ago

Hi Jonathan, thanks for your asnwer. The expression you have given me is not producing any output. To test it further I took out the d: so that the lines would not be deleted and without the d: the expression also did not product any output. I also thought it might be something to do with the colan in the search expression. But that did not change anything either. Still no output.

Jonathan Leffler Over a year ago

Works OK for me on Mac OS X with the native sed, and with GNU sed, using the three line fragment of XML you show in the question. Which shell do you use? Did you really copy'n'paste the code, adding a file name after the second single quote, or did you munge it somehow? There's nothing in the script that requires anything more than basic sed; no extended regular expression, or anything tricky. The second semicolon is needed with BSD sed; it isn't needed with GNU sed. It's difficult to see how it can go wrong, really. Note that I assume the tags are at the beginning of the line.

Jonathan Leffler Over a year ago

Try removing the carets ^ and see if that fixes the problem for you. If so, you have blanks or something at the start of the lines. And I see you've changed the sample XML so that the tags are not at the start of the line and the text in the middle is no longer tidy plain text but has XML character entities in them. If you show us an approximation to the real data, you will get an approximation to the real answer.

Richie Over a year ago

Yes sorry about that. I should have been more precise.

Collectives™ on Stack Overflow

Unix sed expression to find xml value

2 Answers 2

2 Comments

Original data

Revised data

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Original data

Revised data

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related