0

I am in the process of converting the schema of an existing database to Postgresql. I want to automate as much of this as possible, to avoid manual errors.

The original database uses CLUSTERED indices, however PG does not (really) have clustered indices. I want to write a bash script to replace all occurences of CLUSTERED indices, to a postgresql equivalent.

Essentially, I want to SUBSTITUTE lines like this:

CREATE clustered INDEX idx_foobar ON foobar (f1, f2, f3, f4,f5);

with a 2 line replacement like this:

CREATE INDEX idx_foobar ON foobar (f1, f2, f3, f4,f5); CLUSTER foobar;

I think I have worked out the matching logic, I just need help with the regex, as I am not very familiar. The matching logic that seems to work is as follows:

  1. Find a line that starts with CREATE clustered INDEX (the line may begin with one or more non-newline whitespaces)
  2. store the name of the table (it follows one or more whitespace after the ON keyword)
  3. Remove the word clustered from the line matched in 1 above to create substitute text
  4. Append "\nCLUSTER $tablename" to the substitute text in step 3 above
  5. Replace matched line in step1 with the substitute text (obtained in step 4)

Could someone help me in incorporating this logic into a bash script so I can pass it the file to be processed?

Incidentally, I thought I could possibly use sed to do this, but I don't know if it will be easier (i.e. easier to understand) to write a bash script, instead of attempting to do this as a one liner in sed - but I am open to suggestions.

4 Answers 4

1
sed --posix "/CREATE clustered INDEX/ {
   s/ *clustered */ /
   s/ON *\([^( ]*\) *(.*$/& CLUSTER \1;/
   }"

--posix to be available for non GNU also I make another regex than bob Schuster (very good one) just to have an alternative that allow more modification on the line if needed for other purpose like inserting comment in script.

here is the session on a cygwin bash (version oneline)

$ cat sample.txt
CREATE clustered INDEX idx_foobar ON foobar (f1, f2, f3, f4,f5);
blabla;

$ sed --posix "/CREATE clustered INDEX/ {s/ *clustered */ /;s/ON *\([^( ]*\) *(.*$/& CLUSTER \1;/;}" sample.txt
CREATE INDEX idx_foobar ON foobar (f1, f2, f3, f4,f5); CLUSTER foobar;
blabla;
Sign up to request clarification or add additional context in comments.

5 Comments

Hi NeronLeVelu, thanks for your example; however, I tried it and does not seem to work (the second line containing CLUSTER $tablename is not displayed): echo CREATED clustered INDEX idx_foo ON foobar(f1, f2, f3, f4,f5);' | sed --posix "/CREATE clustered INDEX/ {s/ *clusterd *//;s/ON *([^( ]*) *.*$/&; CLUSTER \1;/;}" CREATED clustered INDEX idx_foo ON foobar(f1, f2, f3, f4,f5);`
error interface keyboard/chair, I forgot the last "e" in clustered and ( so here you are: sed --posix "/CREATE clustered INDEX/ {s/ *clustered *//;s/ON *([^( ]*) *(.*$/&; CLUSTER \1;/;}"
Haha NeronLeVelu, I must have the same error interface - I missed that completely too. However, did you test the corrected command at the command line using echo?. I think you will find that it still doesn't work. I must be doing something wrong, I can't believe that all of the answers I have received so far do not work at the command line?!. Please try at the command line and let me know if you are able to get the desired result, using the command line. Thanks
i just test from a cygwin session (bash by default) and with a cat of a file, so not an echo, maybe your echo is modifying the content.[session here after]$ sed --posix "/CREATE clustered INDEX/ {s/ *clustered *//;s/ON *([^( ]*) *(.*$/&; CLUSTER \1;/;}" sample.txt CREATEINDEX idx_foobar ON foobar (f1, f2, f3, f4,f5);; CLUSTER foobar; blabla; [end session] [sample.txt here after]CREATE clustered INDEX idx_foobar ON foobar (f1, f2, f3, f4,f5); blabla; [end of smaple.txt]
Isn't the Postgres syntax more like cluster table using index?
0

You could try sed, for example:

sed -r 's/^\s*(CREATE\s*)clustered\s*(INDEX.*ON\s*)(\w*)(\s+\(.*;)$/\1\2\3\4\nCLUSTER \3;/gi' original.txt > updated.txt

I followed your guidelines which is why the regex is a bit bulky, but you can revise the regex based on the actual content of your input file and whether you want to preserve extraneous spaces.

One good place to experiment with regex is: http://regex101.com

1 Comment

Hi Bob, thanks for your answer. I tried a sample SQL line on it, but as you can see from below, it appears to have the same problem as NeronLeVelu's example in that the last line (CREATE $tablename) is not generated: ` echo 'CREATED clustered INDEX idx_foo ON foobar(f1, f2, f3, f4, f5);' | sed -r 's/^\s*(CREATE\s*)clustered\s*(INDEX.*ON\s*)(\w*)(\s+(.*;)$/\1\2\3\4\nCLUSTER \3;/gi' CREATED clustered INDEX idx_foo ON foobar(f1, f2, f3, f4, f5); `
0

This might work for you (GNU sed):

sed -r 's/^(\s*CREATE) (cluster)ed(.* (\S+) \(.*\);)\s*$/\1\3\n\U\2 \L\4;/' file

Comments

0

Be wary that clustering in Postgres isn't necessarily the same as in the original database you're using (I presume SQL Server?). Per the docs:

Clustering is a one-time operation: when the table is subsequently updated, the changes are not clustered. That is, no attempt is made to store new or updated rows according to their index order. (If one wishes, one can periodically recluster by issuing the command again. (...))

http://www.postgresql.org/docs/current/static/sql-cluster.html

This means that replacing create clustered index on table (...); with create index on table (...); cluster table; isn't going to work the way you're expecting.

In light of that, stick to removing clustered using sed, or make sure you add the additional using index part. If the latter, you'll also want to add an extra cluster table at the very end of the import, to actually cluster the data.

Methinks you ought to remove the clustered references altogether, and worry about adding them at the very end of your import, either manually or by generating an additional SQL file as part of or prior to the removal script.

3 Comments

Thanks for your input Denis. I was under the impression (from the Postgresql doc) that CLUSTER table is an abbreviated form for CLUSTER table USING index. From the docs: Cluster the table employees on the basis of its index employees_ind: CLUSTER employees USING employees_ind; Cluster the employees table using the same index that was used before: CLUSTER employees; Is my understanding wrong?
It's an abbreviation if -- and only if -- the table is already clustered. You first need to issue a cluster ... using .... Only then can you use the abbreviated form, which is really syntactic sugar to avoid relying the index name over and over again when reclustering.
Also, note that writing this kind of script in ruby, python, or whatever other language is a lot less error prone and readable. Stick to shell if and only if you need unusual levels of portability. (And avoid bashisms, so it conforms to posix /bin/sh, when you do.)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.