1

We have a requirement where we have a csv file with custom delimiter '||' (double-pipes) . We have 40 columns in the file and the file size is approximately between 400 to 500 MB.

We need to sort the file based on 2 columns, first on column 4 and then by column 17.

We found this command using which we can sort for one column, but not able to find a command which can sort based on both columns.

Since we use a delimiter with 2 characters, we are using awk command for sorting.

Command:

awk -F \|\| '{print $4}' abc.csv | sort > output.csv
4
  • This may help: stackoverflow.com/a/42310248/260313 Commented Aug 4, 2021 at 10:44
  • 2
    Please add sample input (no descriptions, no images, no links) and your desired output for that sample input to your question (no comment). Commented Aug 4, 2021 at 10:46
  • 1
    CSV means "Comma-Separated Values" (or "Character-Separated Values" at a stretch). Your data is separated by a 2-char string, not a single char, so it is not CSV by any stretch of the imagination. Why are you using || as the delimiter? Using a regexp metachar like | as your delimiter (and especially using 2 of them!) makes it much harder to do anything with your data (treat it as CSV, read it into a spreadsheet like Excel, match it with regexps, etc.) so - don't do that! Use , (or less usefully ; or tab or some other single, literal char) as the delimiter. Commented Aug 4, 2021 at 12:35
  • 1
    edit your question to include concise, testable sample input and expected output. Make sure to include fields that contain a single |, quoted fields containing escaped quotes, newlines, etc. if they can occur in your data or state it in your question if they can't. As of now we don't even know if you want to sort the fields as numbers or as strings or as versions or anything else. Commented Aug 4, 2021 at 12:52

4 Answers 4

2

If your inputs are not too fancy (no newlines in the middle of a record, for instance), the sort utility can almost do what you want, but it supports only one-character field separators. So || would not work. But wait, if you do not have other | characters in your files, we could just consider | as the field separator and account for the extra empty fields:

sort -t'|' -k7 -k33 foo.csv

We sort by fields 7 (instead of 4) and then 33 (instead of 17) because of these extra empty fields. The formula that gives the new field number is simply 2*N-1 where N is the original field number.

If you do have | characters inside your fields a simple solution is to substitute them all by one unused character, sort, and restore the original ||. Example with tabs:

sed 's/||/\t/g' foo.csv | sort -t$'\t' -k4 -k17 | sed 's/\t/||/g'

If tab is also used in your fields chose any unused character instead. Form feed (\f) or the field separator (ASCII code 28, that is, replace the 3 \t with \x1c) are good candidates.

Sign up to request clarification or add additional context in comments.

Comments

1

Using PROCINFO in gnu-awk you can use this solution to sort on multi-character delimiter:

awk -F '\\|\\|' '{a[$2,$17] = $0} END {
PROCINFO["sorted_in"]="@ind_str_asc"; for (i in a) print a[i]}' file.csv

Comments

0

You could try following awk code. Written as per your shown attempts only. Set OFS as |(this is putting | as output field separator in case you want it ,comma etc then change OFS value accordingly in program) and print 17th field also as per your requirement in awk program. In sort use 1st and 2nd fields to sort it(because now 4th and 17th fields have become 1st and 2nd fields respectively for sort).

awk -F'\\|\\|' -v OFS='\\|' '{print $4,$17}' abc.csv | sort -t'|' -k1.1 -k2.1 > output.csv

1 Comment

That would fail if any field contained a single |. If they could rely on that then they could just double the field numbers and sort on those using a single | as the sort separator and not use awk at all.
0

The sort command works on physical lines, which may or may not be acceptable. CSV files can contain quoted fields which contain newlines, which will throw off sort (and most other Unix line-oriented utilities; it's hard to write a correct Awk script for this scenario, too).

If you need to be able to manipulate arbitrary CSV files, probably look to a dedicated utility, or use a scripting language with proper CSV support. For example, assume you have a file like this:

Title,Number,Arbitrary text
"He said, ""Hello""",2,"There can be
newlines and
stuff"
No problem,1,Simple undramatic single-line CSV

In case it's not obvious, CSV is fundamentally just a text file, with some restrictions on how it can be formatted. To be valid CSV, every record should be comma-separated; any literal commas or newlines in the data needs to be quoted, and any literal quotes need to be doubled. There are many variations; different tools accept slightly different dialects. One common variation is TSV which uses tabs instead of commas as delimiters.

Here is a simple Python script which sorts the above file on the second field.

import csv
import sys

with open("test.csv", "r") as csvfile:
    csvdata = csv.reader(csvfile)
    lines = [line for line in csvdata]
    titles = lines.pop(0)  # comment out if you don't have a header

writer = csv.writer(sys.stdout)
writer.writerow(titles)    # comment out if you don't have a header
writer.writerows(sorted(lines, key=lambda x: x[1]))

Using sys.stdout for output is slightly unconventional; obviously, adapt to suit your needs. The Python csv library documentation is obviously not designed primarily to be friendly for beginners, but it should not be impossible to figure out, and it's not hard to find examples of working code.

In Python, sorted() returns a copy of a list in sorted order. There is also sort() which sorts a list in-place. Both functions accept an optional keyword parameter to specify a custom sort order. To sort on the 4th and 17th fields, use

sorted(lines, key=lambda x: (x[3], x[16]))

(Python's indexing is zero-based, so [3] is the fourth element.)

To use | as a delimiter, specify delimiter='|' in the csv.reader() and csv.writer() calls. Unfortunately, Python doesn't easily let you use a multi-character delimiter, so you might have to preprocess the data to switch to a single-character delimiter which does not occur in the data, or properly quote the fields which contain the character you selected as your delimiter.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.