Bash (or alternative) to find and replace a number of patterns in csv file using another csv file

Question

I have a very large csv file that is too big to open in excel for this operation.

I need to replace a specific string for approx 6000 records out of the 1.5mil in the csv, the string itself is in the comma separated format like so:

ABC,FOO.BAR,123456

With other columns on either side that are of no concern. I only need enough to get enough data to make sure the final data string (the numbers) are unique.

I have another file with the string to replace and the replacement string like (for the above):

"ABC,FOO.BAR,123456","ABC,FOO.BAR,654321"

So in the case above 123456 is being replaced by 654321. A simple (yet maddeningly slow) way to do this is open both docs in notepad++ and find the first string then replace with the second string, but with over 6000 records this isnt great.

I was hoping someone could give advice on a scripting solution? e.g.:

$file1 = base.csv
$file2 = replace.csv

For each row in $file2 {
awk '{sub(/$file2($firstcolumn)/,$file2($Secondcolumn)' $file1 
}

Though Im not entirely sure how to adapt awk to do an operation like this..

EDIT: Sorry I should have been more specific, the data in my replacement csv is only in two columns; two raw strings!

The remaining question: does ABC,FOO.BAR,123456 in your data file (base.csv) represent 3 fields or is it the contents of a single field that is enclosed in "..." in the file? — mklement0
– mklement0, Commented Feb 28, 2017 at 20:43

karakfa · Accepted Answer · 2017-02-28 19:04:28Z

3

it would be easier of course if your delimiter is not used within the fields...

you can do in two steps, create a sed script from the lookup file and use it for the main data file for replacements

for example, (assumes there is no escaped quotes in the fields, may not hold)

$ awk -F'","' '{print "s/" $1 "\"/\"" $2 "/"}' lookup_file > replace.sed
$ sed -f replace.sed data_file

edited Feb 28, 2017 at 19:04

answered Feb 28, 2017 at 18:58

karakfa

67.8k8 gold badges45 silver badges59 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

mklement0 Over a year ago

It's an elegant solution, but comes with a caveat: To make this fully robust, you'd have to escape metacharacters in both the search and replacement strings (in the sample input, only . is a concern) - sed offers no literal string substitution.

karakfa Over a year ago

fields should be quoted in the data file, otherwise the fields with commas will break the integrity. You're right about escaping all meta chars in sed but not sure needed for this file. Another problem area is if there are escaped quotes in the fields as well.

Walter A Over a year ago

Or avoid the tmp file with sed -f <(awk -F'","' '{print "s/" $1 "\"/\"" $2 "/"}' lookup_file) data_file.

AMcNall Over a year ago

I went for this solution in the end as I hadnt used sed so was good to learn something new. Took a very long time to run due to the sample size but worked like a charm, cheers

mklement0 · Accepted Answer · 2017-02-28 20:29:36Z

awk -F\" '
 NR==FNR { subst[$2]=$4; next }
 { 
   for (s in subst) {
     pos = index($0, s)
     if (pos) {
       $0 = substr($0, 1, pos-1) subst[s] substr($0, pos + length(s))
       break
     }
   }
   print
 }
' "$file2" "$file1"  # > "$file1.$$.tmp" && mv "$file1.$$.tmp" "$file1"

^{The part after the # shows how you could replace the input data file with the output.}

The block associated with NR==FNR is only executed for the first input file, the one with the search and replacement strings.
- subst[$2]=$4 builds an associative array (dictionary): the key is the search string, the value the replacement string.
- Fields $2 and $4 are the search string and the replacement string, respectively, because Awk was instructed to break in the input into fields by " (-F\"); note that this assumes that your strings do not contain escaped embedded " chars.
The remaining block then processes the data file:
- For each input line, it loops over the search strings and looks for a match on the current line:
  - Once a match is found, the replacement string is substituted for the search string, and matching stops.
- print simply prints the (possibly modified) line.

Note that since you want literal string replacements, regex-based functions such as sub() are explicitly avoided in favor of literal string-processing functions index() and substr().

As an aside: since you say there are columns on either side in the data file, consider making the search/replacement strings more robust by placing , on either side of them (this could be done inside the awk script).

I went for the sed solution in the end but thanks for your solution, and thanks even more for the great explanation of the different pieces of it!

Mark Reed · Accepted Answer · 2017-03-01 21:35:42Z

2

I would recommend using a language with a CSV parsing library rather than trying to do this with shell tools. For example, Ruby:

require 'csv'
replacements = CSV.open('replace.csv','r').to_h
File.open('base.csv', 'r').each_line do |line|
  replacements.each do |old, new|
    line.gsub!(old) { new }
  end
  puts line
end

Note that Enumerable#to_h requires Ruby v2.1+; replace with this for older Rubys:

replacements = Hash[*CSV.open('replace.csv','r').to_a.flatten]

You only really need CSV for the replacements file; this assumes you can apply the substitutions to the other file as plain text, which speeds things up a bit and avoids having to parse the old/new strings out into fields themselves.

edited Mar 1, 2017 at 21:35

answered Feb 28, 2017 at 19:42

Mark Reed

96k17 gold badges149 silver badges189 bronze badges

Collectives™ on Stack Overflow

Bash (or alternative) to find and replace a number of patterns in csv file using another csv file

3 Answers 3

4 Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related