0

Using Bash, I'm wanting to get a list of email addresses from a CSV file to do a recursive grep search on it for a bunch of directories looking for a match in specific metadata XML files, and then also tallying up how many results I find for each address throughout the directory tree (i.e. updating the tally field in the same CSV file).

accounts.csv looks something like this:

updated to more accurately reflect real-world data
email,date,bar,URL,"something else",tally
[email protected],21/04/2015,1.2.3.4,https://blah.com/,"blah blah",5
[email protected],17/06/2015,5.6.7.8,https://blah.com/,"lah yah",0
[email protected],7/08/2017,9.10.11.12,https://blah.com/,"wah wah",1

For example, if we put [email protected] in $email from the list, run

grep -rl "${email}" --include=\*_meta.xml --only-matching | wc -l

on it and then add that result to the tally column.

At the moment I can get the first column of that CSV file (minus the heading/first line) using

awk -F"," '{print $1}' accounts.csv | tail -n +2

but I'm lost how to do the looping and also the writing of the result back to the CSV file...

So for instance, with [email protected] if we run

grep -rl "${email}" --include=\*_meta.xml --only-matching | wc -l

and the result is say 17, how can I update that line to become:

[email protected],7/08/2017,9.10.11.12,https://blah.com/,"wah wah",17

Is this possible with maybe awk or sed?

This is where I'm up to:

#!/bin/bash

# make temporary list of email addresses
awk -F"," '{print $1}' accounts.csv | tail -n +2 > emails.tmp

# loop over each
while read email; do
    # count how many uploads for current email address
    grep -rl "${email}" --include=\*_meta.xml --only-matching | wc -l
done < emails.tmp

XML Metadata looks something like this:

<?xml version="1.0" encoding="UTF-8"?>
<metadata>
  <identifier>SomeTitleNameGoesHere</identifier>
  <mediatype>audio</mediatype>
  <collection>opensource_movies</collection>
  <description>example &lt;br /&gt;</description>
  <subject>testing</subject>
  <title>Some Title Name Goes Here</title>
  <uploader>[email protected]</uploader>
  <addeddate>2017-05-28 06:20:54</addeddate>
  <publicdate>2017-05-28 06:21:15</publicdate>
  <curation>[curator][email protected][/curator][date]20170528062151[/date][comment]checked for malware[/comment]</curation>
</metadata>
2
  • The domain-part of an email address may contain a comma (see here), so I don't think you can simply use awk with a comma as field separator. Commented Aug 6, 2021 at 5:24
  • That's a pretty epic edge-case, and doesn't apply to my data, but sure. :+1: Commented Aug 6, 2021 at 7:00

4 Answers 4

2

how to do the looping and also the writing of the result back to the CSV file

awk does the looping automatically. You can change any field by assigning to it. So to change a tally field (the 6th in each line) you would do $6 = ....
awk is a great tool for many scenarios. You probably can safe a lot of time in the future by investing some minutes in a short tutorial now.

The only non-trivial part is getting the output of grep into awk.

The following script increments each tally by the count of *_meta.xml files containing the given email address:

awk -F, -v OFS=, -v q=\' 'NR>1 {
  cmd = "grep -rlFw " q $1 q " --include=\\*_meta.xml | wc -l";
  cmd | getline c;
  close(cmd);
  $6 = c
} 1' accounts.csv 

For simplicity we assume that filenames are free of linebreaks and email addresses are free of '. To reduce possible false positives, I also added the -F and -w option to your grep command.

  • -F searches literal strings; without it, searching for a.b@c would give false positives for things like axb@c and a-b@c.
  • -w matches only whole words; without it, searching for b@c would give a false positive for ab@c. This isn't 100% safe, as a-b@c would still give a false positive, but without knowing more about the structure of your xml files we cannot fix this.
Sign up to request clarification or add additional context in comments.

4 Comments

I see this successfully computes the tallies, but does it also write those changes to accounts.csv? It didn't work for me. Old data stayed the same.
No, it just prints the updated linux content. Either write that output to a new file awk -F, ... accounts.csv > newAccounts.csv which you then rename mv newAccount.csv accounts.csv or use GNU awk's inplace option gawk -i inplace -F, ... accounts.csv
PS Do you have any good awk tutorials to recommend?
No, sorry. I don't know which tutorials are good.
2

A pipeline to reduce the number of greps:

grep -rHo --include=\*_meta.xml -f <(awk -F, 'NR > 1 {print $1}' accounts.csv) \
| gawk -F, -v OFS=',' '
    NR == FNR {
      # store the filenames for each email
      if (match($0, /^([^:]+):(.+)/, m)) tally[m[2]][m[1]]
      next
    }
    FNR > 1 {$4 = length(tally[$1])}
    1
  ' - accounts.csv

2 Comments

Brilliant! Shouldn't it be $6 = length(tally[$1])?
2

Here is a solution using single awk command to achieve this. This solution will be highly performant as compared to other solutions because it is scanning each XML file only once for all the email addresses found in first column of the CSV file. Also it is not invoking any external command or spawning a sub0shell anywhere.

This should work in any version of awk.

cat srch.awk

# function to escape regex meta characters
function esc(s,      tmp) {
   tmp = s
   gsub(/[&+.]/, "\\\\&", tmp)
   return tmp
}
BEGIN {FS=OFS=","}
# while processing csv file
NR == FNR {
   # save escaped email address in array em skipping header row
   if (FNR > 1)
      em[esc($1)] = 0
   # save each row in rec array
   rec[++n] = $0
   next
}
# this block will execute for eaxh XML file 
{
   # loop each email and save count of matched email in array em
   # PS: gsub return no of substitutionx
   for (i in em)
      em[i] += gsub(i, "&")
}
END {
   # print header row
   print rec[1]
   # from 2nd row onwards split row into columns using comma
   for (i=2; i<=n; ++i) {
      split(rec[i], a, FS)
      # 6th column is the count of occurrence from array em
      print a[1], a[2], a[3], a[4], a[5], em[esc(a[1])]
   }
}

Use it as:

awk -f srch.awk accounts.csv $(find . -name '*_meta.xml') > tmp && mv tmp accounts.csv

6 Comments

This will be much more performant than repeatedly recursively grepping a directory tree.
Could you explain what's going on here, please?
And sorry, you may also need to update it to reflect the changes to CSV...
I have added explanation along with updates to my answer to address your changed requirement. Please check and let me know.
Is the big code block above an awk code block? It doesn't seem to be Bash at least... I'm struggling to understand how to implement this. Do I save the contents of the code block as srch.awk or something? I think more detail is needed for this answer.
|
0

A script that handles accounts.csv line by line and replaces the data in accounts.new.csv for comparison.

#! /bin/bash

file_old=accounts.csv
file_new=${file_old/csv/new.csv}

delimiter=","
x=1

# Copy file
cp ${file_old} ${file_new}

while read -r line; do
        # Skip first line
        if [[ $x -gt 1 ]]; then
                # Read data into variables
                IFS=${delimiter} read -r address foo bar tally somethingelse <<< ${line}

                cnt=$(find . -name '*_meta.xml' -exec grep -lo "${address}" {} \; | wc -l)
                # Reset tally
                tally=$cnt

                # Change line number $x in new file
                sed "${x}s/.*/${address} ${foo} ${bar} ${tally} ${somethingelse}/; ${x}s/ /${delimiter}/g" \
                        -i ${file_new}
        fi

        ((x++))
done < ${file_old}

The input and ouput:

# Input
$ find . -name '*_meta.xml' -exec cat {} \; | sort | uniq -c
      2 [email protected]
      1 [email protected]
$ cat accounts.csv
email,foo,bar,tally,somethingelse
[email protected],bar1,foo2,-1,blah
[email protected],bar2,foo3,-1,blah
[email protected],bar4,foo5,-1,blah

# output
$ ./test.sh
$ cat accounts.new.csv 
email,foo,bar,tally,somethingelse
[email protected],bar1,foo2,2,blah
[email protected],bar2,foo3,1,blah
[email protected],bar4,foo5,0,blah

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.