1

I have a very large csv file that is too big to open in excel for this operation.

I need to replace a specific string for approx 6000 records out of the 1.5mil in the csv, the string itself is in the comma separated format like so:

ABC,FOO.BAR,123456

With other columns on either side that are of no concern. I only need enough to get enough data to make sure the final data string (the numbers) are unique.

I have another file with the string to replace and the replacement string like (for the above):

"ABC,FOO.BAR,123456","ABC,FOO.BAR,654321"

So in the case above 123456 is being replaced by 654321. A simple (yet maddeningly slow) way to do this is open both docs in notepad++ and find the first string then replace with the second string, but with over 6000 records this isnt great.

I was hoping someone could give advice on a scripting solution? e.g.:

$file1 = base.csv
$file2 = replace.csv

For each row in $file2 {
awk '{sub(/$file2($firstcolumn)/,$file2($Secondcolumn)' $file1 
}

Though Im not entirely sure how to adapt awk to do an operation like this..

EDIT: Sorry I should have been more specific, the data in my replacement csv is only in two columns; two raw strings!

1
  • The remaining question: does ABC,FOO.BAR,123456 in your data file (base.csv) represent 3 fields or is it the contents of a single field that is enclosed in "..." in the file? Commented Feb 28, 2017 at 20:43

3 Answers 3

3

it would be easier of course if your delimiter is not used within the fields...

you can do in two steps, create a sed script from the lookup file and use it for the main data file for replacements

for example, (assumes there is no escaped quotes in the fields, may not hold)

$ awk -F'","' '{print "s/" $1 "\"/\"" $2 "/"}' lookup_file > replace.sed
$ sed -f replace.sed data_file 
Sign up to request clarification or add additional context in comments.

4 Comments

It's an elegant solution, but comes with a caveat: To make this fully robust, you'd have to escape metacharacters in both the search and replacement strings (in the sample input, only . is a concern) - sed offers no literal string substitution.
fields should be quoted in the data file, otherwise the fields with commas will break the integrity. You're right about escaping all meta chars in sed but not sure needed for this file. Another problem area is if there are escaped quotes in the fields as well.
Or avoid the tmp file with sed -f <(awk -F'","' '{print "s/" $1 "\"/\"" $2 "/"}' lookup_file) data_file.
I went for this solution in the end as I hadnt used sed so was good to learn something new. Took a very long time to run due to the sample size but worked like a charm, cheers
3
awk -F\" '
 NR==FNR { subst[$2]=$4; next }
 { 
   for (s in subst) {
     pos = index($0, s)
     if (pos) {
       $0 = substr($0, 1, pos-1) subst[s] substr($0, pos + length(s))
       break
     }
   }
   print
 }
' "$file2" "$file1"  # > "$file1.$$.tmp" && mv "$file1.$$.tmp" "$file1"

The part after the # shows how you could replace the input data file with the output.

  • The block associated with NR==FNR is only executed for the first input file, the one with the search and replacement strings.

    • subst[$2]=$4 builds an associative array (dictionary): the key is the search string, the value the replacement string.

    • Fields $2 and $4 are the search string and the replacement string, respectively, because Awk was instructed to break in the input into fields by " (-F\"); note that this assumes that your strings do not contain escaped embedded " chars.

  • The remaining block then processes the data file:

    • For each input line, it loops over the search strings and looks for a match on the current line:

      • Once a match is found, the replacement string is substituted for the search string, and matching stops.
    • print simply prints the (possibly modified) line.

Note that since you want literal string replacements, regex-based functions such as sub() are explicitly avoided in favor of literal string-processing functions index() and substr().

As an aside: since you say there are columns on either side in the data file, consider making the search/replacement strings more robust by placing , on either side of them (this could be done inside the awk script).

1 Comment

I went for the sed solution in the end but thanks for your solution, and thanks even more for the great explanation of the different pieces of it!
2

I would recommend using a language with a CSV parsing library rather than trying to do this with shell tools. For example, Ruby:

require 'csv'
replacements = CSV.open('replace.csv','r').to_h
File.open('base.csv', 'r').each_line do |line|
  replacements.each do |old, new|
    line.gsub!(old) { new }
  end
  puts line
end

Note that Enumerable#to_h requires Ruby v2.1+; replace with this for older Rubys:

replacements = Hash[*CSV.open('replace.csv','r').to_a.flatten]

You only really need CSV for the replacements file; this assumes you can apply the substitutions to the other file as plain text, which speeds things up a bit and avoids having to parse the old/new strings out into fields themselves.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.