Revisions to text processing rows to columns for a block of lines Awk

added 402 characters in body

Source Link

edited Nov 17, 2022 at 9:49

9.4k
3
24
27

There is a -n option which honours leading and repeated separators, but is documented as a Debian extension (and also works in my Mint (Ubuntu) distribution). Without -n, empty values are discarded (i.e. delimiters at the start and end of a line are ignored), and multiple adjacent delimiters are merged.

(a) Avoids the bug in the column command, by doing the tabulation internally (also avoiding the extra process, and the memory overhead of having both awk and column store the whole data set in memory).

(a) Avoids the bug in the column command, by doing the tabulation internally (also avoiding the extra process).

There is a -n option which honours leading and repeated separators, but is documented as a Debian extension (and also works in my Mint (Ubuntu) distribution). Without -n, empty values are discarded (i.e. delimiters at the start and end of a line are ignored), and multiple adjacent delimiters are merged.

(a) Avoids the bug in the column command, by doing the tabulation internally (also avoiding the extra process, and the memory overhead of having both awk and column store the whole data set in memory).

EDIT Two: This version has more functionality, and spurns the buggy `column` command.

Source Link

edited Nov 16, 2022 at 20:51

Paul_Pedant

9.4k
3
24
27

The script as posted does not work. It only outputs the first 4 lines -- the last two in List B are omitted.

The issue is that the value of k counts the lines in each list. But it is only stored into max at the start of each list, so the length of the second list is not taken into account.

The fix is to repeat the if(k && k>max) { max=k; } as the first line in the END block, when the last List had been read.

That reveals another bug. Those last two lines are not columnised -- they appear in column 1. The issue there appears to be that column does not recognise a zero-length first column: if I force a . at the start of each value, it columnises List B correctly.

Personally, I would columnise in awk -- save the max length of any entry in each column, and space them out with a %-*s width specifier. That might be what the unused variable idx was meant to be for.

EDIT: Yes, definitely a bug in column. The tab is actioned for Four, but ignored for Three.

$ cat -vet foo
One$
 Two$
^IThree$
q^IFour$
$ column -t -s $'\t' foo | cat -vet
One$
 Two$
Three$
q      Four$

EDIT Two: This version has more functionality.

(a) Avoids the bug in the column command, by doing the tabulation internally (also avoiding the extra process).

(b) Accepts multiple file args (or stdin by default, so it works in a pipeline).

(c) Works for an arbitrary number of output columns, not just two.

(d) Fixes the originally posted bug (where the length of the rightmost column was ignored).

#! /bin/bash

Awk='
BEGIN { Gap = 2; }
/^List/ { ++col; row=0; }
NF { X[++row, col] = $0;
    if (mxrow < row) mxrow = row;
    if (len[col] < length($0)) len[col] = length($0);
}
function Column (Local, r, c) {
    for (r = 1; r <= mxrow; ++r) {
        for (c = 1; c < col; ++c) 
            printf ("%-*s", Gap + len[c], X[r,c]);
        printf ("%-s\n", X[r,c]);
    }
}
END { Column( ); }
'
    awk "${Awk}" "${@:-}"

The script as posted does not work. It only outputs the first 4 lines -- the last two in List B are omitted.

The issue is that the value of k counts the lines in each list. But it is only stored into max at the start of each list, so the length of the second list is not taken into account.

The fix is to repeat the if(k && k>max) { max=k; } as the first line in the END block, when the last List had been read.

That reveals another bug. Those last two lines are not columnised -- they appear in column 1. The issue there appears to be that column does not recognise a zero-length first column: if I force a . at the start of each value, it columnises List B correctly.

Personally, I would columnise in awk -- save the max length of any entry in each column, and space them out with a %-*s width specifier. That might be what the unused variable idx was meant to be for.

EDIT: Yes, definitely a bug in column. The tab is actioned for Four, but ignored for Three.

$ cat -vet foo
One$
 Two$
^IThree$
q^IFour$
$ column -t -s $'\t' foo | cat -vet
One$
 Two$
Three$
q      Four$

The script as posted does not work. It only outputs the first 4 lines -- the last two in List B are omitted.

The issue is that the value of k counts the lines in each list. But it is only stored into max at the start of each list, so the length of the second list is not taken into account.

The fix is to repeat the if(k && k>max) { max=k; } as the first line in the END block, when the last List had been read.

That reveals another bug. Those last two lines are not columnised -- they appear in column 1. The issue there appears to be that column does not recognise a zero-length first column: if I force a . at the start of each value, it columnises List B correctly.

Personally, I would columnise in awk -- save the max length of any entry in each column, and space them out with a %-*s width specifier. That might be what the unused variable idx was meant to be for.

EDIT: Yes, definitely a bug in column. The tab is actioned for Four, but ignored for Three.

$ cat -vet foo
One$
 Two$
^IThree$
q^IFour$
$ column -t -s $'\t' foo | cat -vet
One$
 Two$
Three$
q      Four$

EDIT Two: This version has more functionality.

(a) Avoids the bug in the column command, by doing the tabulation internally (also avoiding the extra process).

(b) Accepts multiple file args (or stdin by default, so it works in a pipeline).

(c) Works for an arbitrary number of output columns, not just two.

(d) Fixes the originally posted bug (where the length of the rightmost column was ignored).

#! /bin/bash

Awk='
BEGIN { Gap = 2; }
/^List/ { ++col; row=0; }
NF { X[++row, col] = $0;
    if (mxrow < row) mxrow = row;
    if (len[col] < length($0)) len[col] = length($0);
}
function Column (Local, r, c) {
    for (r = 1; r <= mxrow; ++r) {
        for (c = 1; c < col; ++c) 
            printf ("%-*s", Gap + len[c], X[r,c]);
        printf ("%-s\n", X[r,c]);
    }
}
END { Column( ); }
'
    awk "${Awk}" "${@:-}"

Edited to prove the second bug is indeed in `column`, not in the awk part of the script.

Source Link

edited Nov 16, 2022 at 10:29

Paul_Pedant

9.4k
3
24
27

The script as posted does not work. It only outputs the first 4 lines -- the last two in List B are omitted.

The issue is that the value of k counts the lines in each list. But it is only stored into max at the start of each list, so the length of the second list is not taken into account.

The fix is to repeat the if(k && k>max) { max=k; } as the first line in the END block, when the last List had been read.

That reveals another bug. Those last two lines are not columnised -- they appear in column 1. The issue there appears to be that column does not recognise a zero-length first column: if I force a . at the start of each value, it columnises List B correctly.

Personally, I would columnise in awk -- save the max length of any entry in each column, and space them out with a %-*s width specifier. That might be what the unused variable idx was meant to be for.

EDIT: Yes, definitely a bug in column. The tab is actioned for Four, but ignored for Three.

$ cat -vet foo
One$
 Two$
^IThree$
q^IFour$
$ column -t -s $'\t' foo | cat -vet
One$
 Two$
Three$
q      Four$