text file -- how to sort adjacent lines that have the same level of indentation

Question

UPDATE: The root problem has been solved by fixing a number of Sequelize migrations that always run before mysqldump is called, as discussed in the comments that are below the article that is linked in the next paragraph. However, the core technical challenge is still interesting.

I have a problem with mysqldump that might be solved by configuring mysqldump differently, but probably will be solved by just piping the output through a shell script.

Basically, mysqldump always outputs the tables in the same order, but it list all columns (other than id) for each table in random order.

So, the first run might output this...

create TABLE `ONE` ( 
  `id` int NOT NULL AUTO_INCREMENT,
  `column_a` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `column_b` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `column_c` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB;
create TABLE `TWO` ( 
  `id` int NOT NULL AUTO_INCREMENT,
  `column_x` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `column_y` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `column_z` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB;

...and on the second run, it might produce something like this:

create TABLE `ONE` ( 
  `id` int NOT NULL AUTO_INCREMENT,
  `column_c` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `column_b` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `column_a` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB;
create TABLE `TWO` ( 
  `id` int NOT NULL AUTO_INCREMENT,
  `column_y` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `column_x` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `column_z` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB;

I'd like to pipe the result through a shellscript that always sorts the lines in the same way. What script would achieve this? The script needs to run on a build agent that runs on Ubuntu, so if it is possible and practical to use standard GNU tools like awk then that would be superior to using custom tools.

Are you certain about this behavior? I've created a sample database and I ran mysqldump in a loop and it always produces columns in the same order. You can see my test here (this assumes your sample tables ONE and TWO have been loaded into a database named example). — larsks
– larsks, Commented Feb 6, 2024 at 4:23
Otherwise, this question is probably a dupe of "Sort alphabetically lines between 2 patterns in Bash", with pattern 1 as ^CREATE TABLE.*( and pattern 2 as ^). — larsks
– larsks, Commented Feb 6, 2024 at 4:43
Thanks @larks. I'm certain about the behaviour, but it might be a side-effect of recreating the DB and rerunning the migrations between each dump. (See commands on the mysql-orientated version of this post.) I'll check out the answer you linked to above. — steven_noble
– steven_noble, Commented Feb 6, 2024 at 4:50
Those solutions are interesting, but all assume there is just one instance of the start pattern and the end pattern 🤔 — steven_noble
– steven_noble, Commented Feb 6, 2024 at 4:54
please update the question with the code you've tried and the (wrong) results generated by said code — markp-fuso
– markp-fuso, Commented Feb 6, 2024 at 13:30

markp-fuso · Accepted Answer · 2024-02-07 14:59:12Z

Assumptions:

the input is guaranteed to have a very specific format that follows exactly the sample provided in the question, namely ...
each clause (of the create table command) resides on a separate line (eg, we won't see 2 columns listed on a single line)
there is only one level of indents
all column clauses contain the string NOT NULL or NULL
we do not sort the column clause containing the string AUTO-INCREMENT
we do not sort the PRIMARY KEY clause
all sortable column clauses show up between the AUTO-INCREMENT column clause and the PRIMARY KEY clause
NOTE: we are not going to address any of the slew of other options available with the mysql 'create table' command, ie, we're not going to build a full-fledged parser

One approach:

if a line matches the format of a 'sortable column clause' then we add it to an array
if a line does not match the format of a 'sortable column clause' then we a) dump anything currently in the array to stdout and then b) print the current line to stdout

One GNU awk implementation of this approach:

awk '
BEGIN                { PROCINFO["sorted_in"]="@val_str_asc"     # sort arrays by value, sorted as string in asc[ending] order
                       delete lines                             # designate variable "lines" as an array
                     }

/^[[:space:]]/    &&                                            # if line starts with white space (ie, it is indented) and ...
!/AUTO_INCREMENT/ &&                                            # line does not contain string "AUTO_INCREMENT" and ...
/ NULL/              { lines[++cnt] = $0                        # line contains a single white space + string "NULL", then save current line in array
                       next                                     # skip to next line of input
                     }

cnt                  { for (i in lines)                         # loop through indices of array in @val_str_asc sorted order
                           print lines[i]                       # print array value to stdout
                       delete lines                             # reset array
                       cnt = 0                                  # reset counter (aka array index)
                     }
1                                                               # print current line of input to stdout
' sample.sql

NOTES:

sample.sql is an exact copy of the 2nd pair of create table commands from OP's question (ie, the example with the out-of-order column_X rows)
requires GNU awk for PROCINFO["sort_in"] support which in turn is used to sort the contents of the array; see gawk predefined sorting orders for details

This generates:

create TABLE `ONE` (
  `id` int NOT NULL AUTO_INCREMENT
  `column_a` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `column_b` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `column_c` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB;
create TABLE `TWO` (
  `id` int NOT NULL AUTO_INCREMENT,
  `column_x` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `column_y` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `column_z` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB;

ufopilot · Accepted Answer · 2024-02-07 06:39:00Z

1

GNU AWK

awk '
    /AUTO_INCREMENT,$/ || /^ *PRIMARY KEY/{blk=!blk;}
    /AUTO_INCREMENT,$/ {print}
    blk{a[$0]; next}
    {
        n=asorti(a,b)
        for(i=1; i<=n; i++) print b[i]
        delete a
    }1
' file

answered Feb 7, 2024 at 6:39

ufopilot

3,9852 gold badges13 silver badges14 bronze badges

Comments

Ed Morton · Accepted Answer · 2024-02-07 14:52:18Z

Assuming you don't REALLY want to compare id and PRIMARY_KEY against column_a etc. (as you would do if you were really sorting all the lines at a given indentation level) as that'd be comparing apples to oranges, then you could just sort the lines that match some regexp or are between lines that match 2 regexps, e.g. using any awk and any sort we could sort the lines containing COLLATE:

$ awk '/COLLATE/{print | "sort"; next} {close("sort")} 1' file
create TABLE `ONE` (
  `id` int NOT NULL AUTO_INCREMENT,
  `column_a` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `column_b` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `column_c` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB;
create TABLE `TWO` (
  `id` int NOT NULL AUTO_INCREMENT,
  `column_x` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `column_y` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `column_z` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB;

or the lines between each AUTO_INCREMENT ... PRIMARY KEY pair:

$ awk '
    /PRIMARY KEY/    { f=0; close("sort") }
    { if (f) print | "sort"; else print }
    /AUTO_INCREMENT/ { f=1 }
' file
create TABLE `ONE` (
  `id` int NOT NULL AUTO_INCREMENT,
  `column_a` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `column_b` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `column_c` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB;
create TABLE `TWO` (
  `id` int NOT NULL AUTO_INCREMENT,
  `column_x` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `column_y` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `column_z` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB;

With GNU awk for PROCINFO["sorted_in"] if there can be no duplicate lines in a block of COLLATE lines we could do:

$ awk '
    BEGIN { PROCINFO["sorted_in"] = "@ind_str_asc" }
    /COLLATE/ { vals[$0]; next }
    { for (val in vals) print val; delete vals; print }
' file
create TABLE `ONE` (
  `id` int NOT NULL AUTO_INCREMENT,
  `column_a` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `column_b` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `column_c` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB;
create TABLE `TWO` (
  `id` int NOT NULL AUTO_INCREMENT,
  `column_x` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `column_y` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `column_z` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB;

or with GNU awk for PROCINFO["sorted_in"] if there can be duplicate lines in a block of COLLATE lines:

$ awk '
    BEGIN { PROCINFO["sorted_in"] = "@val_str_asc" }
    /COLLATE/ { vals[length(vals)+1] = $0; next }
    { for (i in vals) print vals[i]; delete vals; print }
' file
create TABLE `ONE` (
  `id` int NOT NULL AUTO_INCREMENT,
  `column_a` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `column_b` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `column_c` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB;
create TABLE `TWO` (
  `id` int NOT NULL AUTO_INCREMENT,
  `column_x` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `column_y` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `column_z` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB;

The first command spawns a subshell to call sort once per block of COLLATE lines while the GNU version doesn't spawn any subshells so it'll be faster but not as portable.

Daweo · Accepted Answer · 2024-02-06 10:59:34Z

0

how to sort adjacent lines that have the same level of indentation

I would harness GNU AWK for this task following way, let file.txt content be

ABLE
  CHARLIE
  BAKER
    123
  EASY
  DOG
  FOX

then

awk 'BEGIN{PROCINFO["sorted_in"]="@val_str_asc";indent=-1}{match($0,/[^[:space:]]/)}RSTART==indent{arr[NR]=$0;next}{indent=RSTART;for(i in arr){print arr[i]};delete arr;arr[NR]=$0}END{for(i in arr){print arr[i]}}' file.txt

gives output

ABLE
  BAKER
  CHARLIE
    123
  DOG
  EASY
  FOX

Explanation: I inform GNU AWK that Scanning Order should be values-as-strings, ascending and set to indent to -1 (value never reached). For each line I use match string function to detect indent size, by finding position of 1st non-white-space character. If indent is same as previous I only add current line to array arr under key being number of line, otherwise I save indent size in variable indent, output all from array in predefined order, cleanse array (so it become empty) and add current line to array. After all lines are processed (END) I output content of array arr.

(tested in GNU Awk 5.1.0)

answered Feb 6, 2024 at 10:59

Daweo

38.2k3 gold badges17 silver badges32 bronze badges

5 Comments

markp-fuso Over a year ago

for your sample data it appears that 123 is associated with BAKER so in the sorted output I would expect 123 to remain with BAKER (as opposed to moving to CHARLIE)

markp-fuso Over a year ago

if you run your code against OP's sample data this code will generate a syntactically invalid create table command (ie, PRIMARY KEY (does not go at top of list) then column_a - column_c (correctly sorted) then id (with trailing comma but no follow on clause)

markp-fuso Over a year ago

consider reformatting the awk script for easier readability; a quick-n-dirty method: awk -o- 'current_awk_script'

steven_noble Over a year ago

This looks to be very close to a solution. When I run it against my actual dump.sql, it sorts the lines that I mentioned need sorting. This reveals that issue mentioned by @markp-fuso that the trailing commas are now not on the correct lines. Also, it sorts the PRIMARY_KEY line, which needs to keep its current location. (BTW I've solved this problem in ways that are more MySQL focused that text manipulation focused, but hopefully this discussion is still interesting and useful.)

Ed Morton Over a year ago

You'd need to implement a recursive-descent parser (or write a convoluted solution with loops, variables, and arrays to mimic the functionality that recursion provides) to solve the problem described in this answer as you need to iteratively sort parent+descendants blocks from deepest indent to highest to avoid child 123 moving from it's parent BAKER to CHARLIE. There's examples of recursive functions written in awk at stackoverflow.com/a/46063483/1745001, stackoverflow.com/a/42736174/1745001, and stackoverflow.com/a/32020697/1745001 if you're interested.

Collectives™ on Stack Overflow

text file -- how to sort adjacent lines that have the same level of indentation

4 Answers 4

1 Comment

Comments

Comments

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

Comments

Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related