3

How do I print the contents of an XML element - from the starting tag to the closing tag - using AWK?

For example, consider the following XML:

<flight>
    <airline>Delta</airline>
    <flightno>22</flightno>
    <origin>Atlanta</origin>
    <destination>Paris</destination>
    <departure>5:40pm</departure>
    <arrival>8:10am</arrival>
</flight>
<city id="AT"> 
       <cityname>Athens</cityname> 
       <state>GA</state>
       <description> Home of the University of Georgia</description>
       <population>100,000</population>
       <location>Located about 60 miles Northeast of Atlanta</location>
       <latitude>33 57' 39" N</latitude>
       <longitude>83 22' 42" W</longitude>
</city>

The desired output could be contents of the city element, from <city...> to </city>.

2 Answers 2

5

Solutions that parse XML with tools like awk and sed are imperfect. You cannot rely on XML always having a human readable layout. For example some web services will omit new-lines, resulting in the entire XML document appearing on one line.

I would recommend using xmllint, which has the ability to select nodes using XPATH, a query language designed for XML.

The following command will select the city tags:

xmllint --xpath "//city" data.xml

XPath is extremely useful. It makes the every part of the XML document addressable:

xmllint --xpath "string(//city[1]/@id)" data.xml

Returns the string "AT".

Poorly formatted XML data

This time return the first occurrence of the "city" tag. xmllint can also be used to pretty print the result:

$ xmllint --xpath "//city[1]" data.xml  | xmllint -format -
<?xml version="1.0"?>
<city id="AT">
  <cityname>Athens</cityname>
  <state>GA</state>
  <description> Home of the University of Georgia</description>
  <population>100,000</population>
  <location>Located about 60 miles Northeast of Atlanta</location>
  <latitude>33 57' 39" N</latitude>
  <longitude>83 22' 42" W</longitude>
</city>

data.xml

In this same data the first "city" tag appears all on one line. This is valid XML.

<data>
  <flight>
    <airline>Delta</airline>
    <flightno>22</flightno>
    <origin>Atlanta</origin>
    <destination>Paris</destination>
    <departure>5:40pm</departure>
    <arrival>8:10am</arrival>
  </flight>
  <city id="AT"> <cityname>Athens</cityname> <state>GA</state> <description> Home of the University of Georgia</description> <population>100,000</population> <location>Located about 60 miles Northeast of Atlanta</location> <latitude>33 57' 39" N</latitude> <longitude>83 22' 42" W</longitude> </city>
  <city id="DUB">
    <cityname>Dublin</cityname>
    <state>Dub</state>
    <description> Dublin</description>
    <population>1,500,000</population>
    <location>Ireland</location>
    <latitude>NA</latitude>
    <longitude>NA</longitude>
  </city>
</data>
Sign up to request clarification or add additional context in comments.

1 Comment

The only issue with this is that you have to fix the xml before you can get data out. E.g. I have a stack of files with no quotes around attribute values. Since xmllint's job is finding issues, it goes nuts.
2
$ awk -v tag='city' '$0~"^<"tag"\\>"{inTag=1} inTag; $0~"^</"tag">"{inTag=0}' file
<city id="AT">
       <cityname>Athens</cityname>
       <state>GA</state>
       <description> Home of the University of Georgia</description>
       <population>100,000</population>
       <location>Located about 60 miles Northeast of Atlanta</location>
       <latitude>33 57' 39" N</latitude>
       <longitude>83 22' 42" W</longitude>
</city>

Using GNU awk above for \> word boundary functionality. With other awks use [^[:alnum:]_] or similar.

To only print the first occurrence:

$ awk -v tag='city' '$0~"^<"tag"\\>"{inTag=1} inTag{print; if ($0~"^</"tag">") exit}' file
<city id="AT">
       <cityname>Athens</cityname>
       <state>GA</state>
       <description> Home of the University of Georgia</description>
       <population>100,000</population>
       <location>Located about 60 miles Northeast of Atlanta</location>
       <latitude>33 57' 39" N</latitude>
       <longitude>83 22' 42" W</longitude>
</city>

2 Comments

If there are two cities, it will print both. I only want the first.
There is a trivial tweak for that but if the question you posted and the representative sample input and expected output you posted to do not actually reflect what you want then update your question appropriately so we're not just spinning our wheels trying to guess what your next requirement change might be.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.