-1

I have a large SQL script file generated from XML. In some value insert rows, a incorrect number of values is passed, causing the error Column count doesn't match value count. I'd like to track down possible syntax errors. As the SQL script file contains 300k rows, I'd like to write a script for that.

Is there any way to check the numbers of values within a statement like:

INSERT INTO table (
 one,
 two,
 three
)
VALUES (123, 'lorem', 'ipsum');

Any help is greatly appreciated.

3
  • Don't SQL string values need to be single quoted? Commented Nov 14, 2016 at 18:46
  • Also, does the generated file have \r\n line endings? Commented Nov 14, 2016 at 18:47
  • I had different question but of a similar nature not too long ago, try taking a look and see if you can piece something together from this stackoverflow.com/questions/38807810/… @glennjackman yup, string values should be surrounded by a single quote in SQL. Commented Nov 14, 2016 at 18:53

1 Answer 1

0

SQL is very difficult to parse. If your data is pretty simple and your SQL is pretty regular, you might be able to get away with using awk in the way you're hoping, see next. Personally, I would probably inspect the database for inserted values, and scan the script for them, or vice-versa. Or insert a bunch of print statements and see where the error message is interposed.

Hoping for the best in awk, let's give it the old college try:

$ cat dat
INSERT INTO table (
 one,
 two,
 three
)
VALUES (123, 'lorem', 'ipsum');
INSERT INTO table (
 one,
 three
)
VALUES (123, 'lorem', 'ipsum');

$ tr -d \\n < dat | sed 's/;/&\
/g' | awk -F '[()]' 'split($2, cols, /, /) != split($4, vals, /, /) {print}'
INSERT INTO table ( one, three)VALUES (123, 'lorem', 'ipsum');

With tr, we delete the newlines. With sed, we put each SQL statement (ending with ;) on a line. With awk, we split each line using parentheses as delimiters, so that the columns are in $2 and the values are in $4. The split command returns how many fields each of them has, using the comma as a delimiter in both cases. If they don't match, print the line. The last line displayed is the output, because the column name two is missing.

This could return some false positives, which in your case might not be terrible. If the data has semicolons or commas, the splitting will be wrong. If the INSERT doesn't mention column names, it will be wrong. If there are non-insert statements, you'll have to filter them out, or deal with them differently.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks a lot! This really helped. I used both of your suggestions. I had to split the INSERT statements as some of them had a size of ~3000 lines.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.