0

I have an XML file on my AIX system which has the following tag...

  <g:google_product_category>
Health &amp; Beauty &gt; Personal Care &gt; Cosmetics
    </g:google_product_category>

I am trying to use sed to get the value of this element. So far I can only work out an expression to print the stag tag and end tag...

sed -n '/google_product_category/{s/.*<google_product_category>//
        s/<\/google_product_category.*//;p;}' gpf_20150708063022.xml 

Can someone please help me with this?

1
  • 1
    In general, it is not a particularly good idea to parse/process XML with sed. For the limited context you present, it will be OK, but be cautious. Commented Jul 8, 2015 at 3:05

2 Answers 2

1

Assuming that text was in a file called test.txt, you could use a combination of tr and sed like so:

cat test.txt | tr '\n' ' ' | \
sed -e 's/<g:google_product_category>\(.*\)<\/g:google_product_category>/\1/g'

Result:
 Health &amp; Beauty &gt; Personal Care &gt; Cosmetics
Sign up to request clarification or add additional context in comments.

2 Comments

The chances are that there's more in the file than just the three lines, and this is going to leave all the other material around the words that were in the product category tags.
You are right; that might be the case and this solution might not work.
0

Original data

The original sample data was:

<g:google_product_category>
      Some text on the next line
</g:google_product_category>

For that data, this sed command works:

sed -n '/^<g:google_product_category>/,/^<\/g:google_product_category>/{
        /google_product_category/d; p; }'

Don't print by default. Between the lines matching the start and end tags (where the tags are not indented), if the line matches google_product_category, delete it; else print it.

Revised data

Since the question has been revised and the new sample data is:

  <g:google_product_category>
Health &amp; Beauty &gt; Personal Care &gt; Cosmetics
    </g:google_product_category>

with leading blanks on the tag lines (and a horribly sloppy layout to boot), then the carets ^ which anchor the match to the start of the line are not appropriate. A revised script, therefore, is:

sed -n '/<g:google_product_category>/,/<\/g:google_product_category>/{
        /google_product_category/d; p; }'

Don't print by default. Between the lines containing the start and end tags (where the tags are may be indented, and may be preceded by or followed by arbitrary material which will be ignored), if the line matches google_product_category, delete it; else print it.

Given a composite and extended data file like this:

<g:google_product_category>
      Some text on the next line
</g:google_product_category>
      <g:google_product_category>
    Health &amp; Beauty &gt; Personal Care &gt; Cosmetics
        </g:google_product_category>

    <g:google_category>
        Garbage, trash, and delectable goodies.
    </g:google_category>

The output from the revised script is:

      Some text on the next line
    Health &amp; Beauty &gt; Personal Care &gt; Cosmetics

4 Comments

Hi Jonathan, thanks for your asnwer. The expression you have given me is not producing any output. To test it further I took out the d: so that the lines would not be deleted and without the d: the expression also did not product any output. I also thought it might be something to do with the colan in the search expression. But that did not change anything either. Still no output.
Works OK for me on Mac OS X with the native sed, and with GNU sed, using the three line fragment of XML you show in the question. Which shell do you use? Did you really copy'n'paste the code, adding a file name after the second single quote, or did you munge it somehow? There's nothing in the script that requires anything more than basic sed; no extended regular expression, or anything tricky. The second semicolon is needed with BSD sed; it isn't needed with GNU sed. It's difficult to see how it can go wrong, really. Note that I assume the tags are at the beginning of the line.
Try removing the carets ^ and see if that fixes the problem for you. If so, you have blanks or something at the start of the lines. And I see you've changed the sample XML so that the tags are not at the start of the line and the text in the middle is no longer tidy plain text but has XML character entities in them. If you show us an approximation to the real data, you will get an approximation to the real answer.
Yes sorry about that. I should have been more precise.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.