Bash: Working with CSV file to build a loop and save the result

Question

Using Bash, I'm wanting to get a list of email addresses from a CSV file to do a recursive grep search on it for a bunch of directories looking for a match in specific metadata XML files, and then also tallying up how many results I find for each address throughout the directory tree (i.e. updating the tally field in the same CSV file).

accounts.csv looks something like this:

updated to more accurately reflect real-world data

email,date,bar,URL,"something else",tally
[email protected],21/04/2015,1.2.3.4,https://blah.com/,"blah blah",5
[email protected],17/06/2015,5.6.7.8,https://blah.com/,"lah yah",0
[email protected],7/08/2017,9.10.11.12,https://blah.com/,"wah wah",1

For example, if we put [email protected] in $email from the list, run

grep -rl "${email}" --include=\*_meta.xml --only-matching | wc -l

on it and then add that result to the tally column.

At the moment I can get the first column of that CSV file (minus the heading/first line) using

awk -F"," '{print $1}' accounts.csv | tail -n +2

but I'm lost how to do the looping and also the writing of the result back to the CSV file...

So for instance, with [email protected] if we run

grep -rl "${email}" --include=\*_meta.xml --only-matching | wc -l

and the result is say 17, how can I update that line to become:

[email protected],7/08/2017,9.10.11.12,https://blah.com/,"wah wah",17

Is this possible with maybe awk or sed?

This is where I'm up to:

#!/bin/bash

# make temporary list of email addresses
awk -F"," '{print $1}' accounts.csv | tail -n +2 > emails.tmp

# loop over each
while read email; do
    # count how many uploads for current email address
    grep -rl "${email}" --include=\*_meta.xml --only-matching | wc -l
done < emails.tmp

XML Metadata looks something like this:

<?xml version="1.0" encoding="UTF-8"?>
<metadata>
  <identifier>SomeTitleNameGoesHere</identifier>
  <mediatype>audio</mediatype>
  <collection>opensource_movies</collection>
  <description>example &lt;br /&gt;</description>
  <subject>testing</subject>
  <title>Some Title Name Goes Here</title>
  <uploader>[email protected]</uploader>
  <addeddate>2017-05-28 06:20:54</addeddate>
  <publicdate>2017-05-28 06:21:15</publicdate>
  <curation>[curator][email protected][/curator][date]20170528062151[/date][comment]checked for malware[/comment]</curation>
</metadata>

The domain-part of an email address may contain a comma (see here), so I don't think you can simply use awk with a comma as field separator. — user1934428
– user1934428, Commented Aug 6, 2021 at 5:24
That's a pretty epic edge-case, and doesn't apply to my data, but sure. :+1: — algalg
– algalg, Commented Aug 6, 2021 at 7:00

algalg · Accepted Answer · 2021-08-06 07:42:40Z

2

how to do the looping and also the writing of the result back to the CSV file

awk does the looping automatically. You can change any field by assigning to it. So to change a tally field (the 6th in each line) you would do $6 = ....
awk is a great tool for many scenarios. You probably can safe a lot of time in the future by investing some minutes in a short tutorial now.

The only non-trivial part is getting the output of grep into awk.

The following script increments each tally by the count of *_meta.xml files containing the given email address:

awk -F, -v OFS=, -v q=\' 'NR>1 {
  cmd = "grep -rlFw " q $1 q " --include=\\*_meta.xml | wc -l";
  cmd | getline c;
  close(cmd);
  $6 = c
} 1' accounts.csv

For simplicity we assume that filenames are free of linebreaks and email addresses are free of '. To reduce possible false positives, I also added the -F and -w option to your grep command.

-F searches literal strings; without it, searching for a.b@c would give false positives for things like axb@c and a-b@c.
-w matches only whole words; without it, searching for b@c would give a false positive for ab@c. This isn't 100% safe, as a-b@c would still give a false positive, but without knowing more about the structure of your xml files we cannot fix this.

edited Aug 6, 2021 at 7:42

algalg

8596 silver badges32 bronze badges

answered Aug 5, 2021 at 19:03

Socowi

27.9k4 gold badges41 silver badges72 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

algalg Over a year ago

I see this successfully computes the tallies, but does it also write those changes to accounts.csv? It didn't work for me. Old data stayed the same.

Socowi Over a year ago

No, it just prints the updated linux content. Either write that output to a new file awk -F, ... accounts.csv > newAccounts.csv which you then rename mv newAccount.csv accounts.csv or use GNU awk's inplace option gawk -i inplace -F, ... accounts.csv

algalg Over a year ago

PS Do you have any good awk tutorials to recommend?

Socowi Over a year ago

No, sorry. I don't know which tutorials are good.

glenn jackman · Accepted Answer · 2021-08-05 20:14:55Z

2

A pipeline to reduce the number of greps:

grep -rHo --include=\*_meta.xml -f <(awk -F, 'NR > 1 {print $1}' accounts.csv) \
| gawk -F, -v OFS=',' '
    NR == FNR {
      # store the filenames for each email
      if (match($0, /^([^:]+):(.+)/, m)) tally[m[2]][m[1]]
      next
    }
    FNR > 1 {$4 = length(tally[$1])}
    1
  ' - accounts.csv

answered Aug 5, 2021 at 20:14

glenn jackman

249k42 gold badges233 silver badges362 bronze badges

2 Comments

Renaud Pacalet Over a year ago

Brilliant! Shouldn't it be $6 = length(tally[$1])?

glenn jackman Over a year ago

It used to be $4

anubhava · Accepted Answer · 2021-08-07 09:55:16Z

2

Here is a solution using single awk command to achieve this. This solution will be highly performant as compared to other solutions because it is scanning each XML file only once for all the email addresses found in first column of the CSV file. Also it is not invoking any external command or spawning a sub0shell anywhere.

This should work in any version of awk.

cat srch.awk

# function to escape regex meta characters
function esc(s,      tmp) {
   tmp = s
   gsub(/[&+.]/, "\\\\&", tmp)
   return tmp
}
BEGIN {FS=OFS=","}
# while processing csv file
NR == FNR {
   # save escaped email address in array em skipping header row
   if (FNR > 1)
      em[esc($1)] = 0
   # save each row in rec array
   rec[++n] = $0
   next
}
# this block will execute for eaxh XML file 
{
   # loop each email and save count of matched email in array em
   # PS: gsub return no of substitutionx
   for (i in em)
      em[i] += gsub(i, "&")
}
END {
   # print header row
   print rec[1]
   # from 2nd row onwards split row into columns using comma
   for (i=2; i<=n; ++i) {
      split(rec[i], a, FS)
      # 6th column is the count of occurrence from array em
      print a[1], a[2], a[3], a[4], a[5], em[esc(a[1])]
   }
}

Use it as:

awk -f srch.awk accounts.csv $(find . -name '*_meta.xml') > tmp && mv tmp accounts.csv

edited Aug 7, 2021 at 9:55

answered Aug 5, 2021 at 18:42

anubhava

790k67 gold badges603 silver badges671 bronze badges

6 Comments

glenn jackman Over a year ago

This will be much more performant than repeatedly recursively grepping a directory tree.

algalg Over a year ago

Could you explain what's going on here, please?

algalg Over a year ago

And sorry, you may also need to update it to reflect the changes to CSV...

anubhava Over a year ago

I have added explanation along with updates to my answer to address your changed requirement. Please check and let me know.

algalg Over a year ago

Is the big code block above an awk code block? It doesn't seem to be Bash at least... I'm struggling to understand how to implement this. Do I save the contents of the code block as srch.awk or something? I think more detail is needed for this answer.

|

Bayou · Accepted Answer · 2021-08-05 19:06:19Z

A script that handles accounts.csv line by line and replaces the data in accounts.new.csv for comparison.

#! /bin/bash

file_old=accounts.csv
file_new=${file_old/csv/new.csv}

delimiter=","
x=1

# Copy file
cp ${file_old} ${file_new}

while read -r line; do
        # Skip first line
        if [[ $x -gt 1 ]]; then
                # Read data into variables
                IFS=${delimiter} read -r address foo bar tally somethingelse <<< ${line}

                cnt=$(find . -name '*_meta.xml' -exec grep -lo "${address}" {} \; | wc -l)
                # Reset tally
                tally=$cnt

                # Change line number $x in new file
                sed "${x}s/.*/${address} ${foo} ${bar} ${tally} ${somethingelse}/; ${x}s/ /${delimiter}/g" \
                        -i ${file_new}
        fi

        ((x++))
done < ${file_old}

The input and ouput:

# Input
$ find . -name '*_meta.xml' -exec cat {} \; | sort | uniq -c
      2 [email protected]
      1 [email protected]
$ cat accounts.csv
email,foo,bar,tally,somethingelse
[email protected],bar1,foo2,-1,blah
[email protected],bar2,foo3,-1,blah
[email protected],bar4,foo5,-1,blah

# output
$ ./test.sh
$ cat accounts.new.csv 
email,foo,bar,tally,somethingelse
[email protected],bar1,foo2,2,blah
[email protected],bar2,foo3,1,blah
[email protected],bar4,foo5,0,blah

Collectives™ on Stack Overflow

Bash: Working with CSV file to build a loop and save the result

updated to more accurately reflect real-world data

4 Answers 4

4 Comments

2 Comments

6 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

updated to more accurately reflect real-world data

4 Answers 4

4 Comments

2 Comments

6 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related