Grep for multiple patterns in a file

Question

I'd like to count number of xml nodes in my xml file(grep or somehow).

....
<countryCode>GBR</countryCode>
<countryCode>USA</countryCode>
<countryCode>CAN</countryCode>
...
<countryCode>CAN</countryCode>
<someNode>USA</someNode>
<countryCode>CAN</countryCode>
<someNode>Otherone</someNode>
<countryCode>GBR</countryCode>
...

How to get count of individual countries like CAN = 3, USA = 1, GBR = 2? Without passing in the names of the countries there might be some more countries?

Update:

There are other nodes beside countrycode

Do you know that each line contains exactly one XML element? There are no lines with two elements? No elements that span multiple lines? Do you know that all equivalent country codes are on identical lines? — Robᵩ
– Robᵩ, Commented Mar 6, 2012 at 16:27

FatalError · Accepted Answer · 2012-03-06 16:12:40Z

My simple suggestion would be to use sort and uniq -c

$ echo '<countryCode>GBR</countryCode>
<countryCode>USA</countryCode>
<countryCode>CAN</countryCode>
<countryCode>CAN</countryCode>
<countryCode>CAN</countryCode>
<countryCode>GBR</countryCode>' | sort | uniq -c
      3 <countryCode>CAN</countryCode>
      2 <countryCode>GBR</countryCode>
      1 <countryCode>USA</countryCode>

Where you'd pipe in the output of your grep instead of an echo. A more robust solution would be to use XPath. If youre XML file looks like

<countries>
  <countryCode>GBR</countryCode>
  <countryCode>USA</countryCode>
  <countryCode>CAN</countryCode>
  <countryCode>CAN</countryCode>
  <countryCode>CAN</countryCode>
  <countryCode>GBR</countryCode>
</countries>

Then you could use:

$ xpath -q -e '/countries/countryCode/text()'  countries.xml  | sort | uniq -c
      3 CAN
      2 GBR
      1 USA

I say it's more robust because using tools designed for parsing flat text will be inherently flaky for dealing with XML. Depending on the context of the original XML file, a different XPath query might work better, which would match them anywhere:

$ xpath -q -e '//countryCode/text()'  countries.xml  | sort | uniq -c
      3 CAN
      2 GBR
      1 USA

Kevin · Accepted Answer · 2012-03-06 16:26:52Z

2

grep can give a total count, but it doesn't do a per-pattern; for that you should use uniq -c:

$ uniq -c <(sort file)
  1 
  1  
  3 <countryCode>CAN</countryCode>
  2 <countryCode>GBR</countryCode>
  1 <countryCode>USA</countryCode>

If you want to get rid of the empty lines and tags, add sed:

$ sed -e '/^[[:space:]]*$/d' -e 's/<.*>\([A-Z]*\)<.*>/\1/g' test | sort | uniq -c
  3 CAN
  2 GBR
  1 USA

To delete lines that don't have a country code, add another command to sed:

$ sed -e '/countryCode/!d' -e '/^[[:space:]]*$/d' -e 's/<.*>\([A-Z]*\)<.*>/\1/g' test | sort | uniq -c
  3 CAN
  2 GBR
  1 USA

edited Mar 6, 2012 at 16:26

answered Mar 6, 2012 at 16:11

Kevin

56.6k15 gold badges107 silver badges139 bronze badges

Comments

Kent · Accepted Answer · 2012-03-06 16:09:57Z

1

quick and dirty (only based on your example text):

awk -F'>|<' '{a[$3]++;}END{for(x in a)print x,a[x]}' file

test:

kent$  cat t.txt
<countryCode>GBR</countryCode>
<countryCode>USA</countryCode>
<countryCode>CAN</countryCode>
<countryCode>CAN</countryCode>
<countryCode>CAN</countryCode>
<countryCode>GBR</countryCode>

kent$  awk -F'>|<' '{a[$3]++;}END{for(x in a)print x,a[x]}' t.txt 
USA 1
GBR 2
CAN 3

answered Mar 6, 2012 at 16:09

Kent

197k36 gold badges248 silver badges317 bronze badges

Comments

ks1322 · Accepted Answer · 2012-03-06 16:15:31Z

1

sed -n "s/<countryCode>\(.*\)<\/countryCode>/\1/p"|sort|uniq -c

answered Mar 6, 2012 at 16:15

ks1322

36.4k16 gold badges124 silver badges177 bronze badges

Comments

Teja · Accepted Answer · 2012-03-06 16:11:37Z

0

cat dummy | sort |cut -c14-16 | sort |tail -6 |awk  '{col[$1]++} END {for (i in col) print i, col[i]}'

Dummy is ur file name and replace 6 in -6 with n-2(n - no of lines in ur data file)

answered Mar 6, 2012 at 16:11

Teja

13.7k38 gold badges103 silver badges164 bronze badges

Comments

ebutusov · Accepted Answer · 2012-03-06 16:12:26Z

0

Something like this maybe:

grep -e 'regex' file.xml | sort | uniq -c

Of course you need to provide regex that matches your needs.

answered Mar 6, 2012 at 16:12

ebutusov

5732 silver badges5 bronze badges

Comments

Zsolt Botykai · Accepted Answer · 2012-03-06 16:14:20Z

0

If your file is set up as you had shown to us, awk can do it like:

awk -F '<\/?countryCode>' '{ a[$2]++} END { for (e in a) { printf("%s\t%i\n",e,a[e]) }' INPUTFILE

If there are more than one <countryCode> tag on a line, you can still set up some pipe to make it into one line, e.g.:

sed 's/<countryCode>/\n<countryCode>/g' INPUTFILE | awk ...

Note if the <countryCode> spans to multiple lines, it does not work as expected.

Anyway, I'd recommend to use xpath for this kind of task (perl's xml::xpath module has a CLI utility for this.

answered Mar 6, 2012 at 16:14

Zsolt Botykai

52k14 gold badges90 silver badges111 bronze badges

Comments

Timothy Martens · Accepted Answer · 2012-03-06 18:47:12Z

0

Quick and simple:

grep countryCode ./file.xml | sort | uniq -c

answered Mar 6, 2012 at 18:47

Timothy Martens

6884 silver badges20 bronze badges

Collectives™ on Stack Overflow

Grep for multiple patterns in a file

8 Answers 8

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

8 Answers 8

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related