How to import huge sql dump in text format using LOAD DATA separated by ;

Question

I have a file containing 40 gb of rows data, but it is in raw text format (no commands in sql syntax), it's always from one to unlimited rows in one line separated by ; and rows are array in () brackets with values separated with ,

examples of lines:

(1,'text',NULL,NULL);(2,'string',NULL,1);

(12,'date',123,NULL);(2,'foo',11,15);

Is there a way to import this data using mysqlimport or LOAD DATA statement without parsing data with programming languages? If not, what are the ways to parse it fast, preferrably in a few minutes, not days, because when I edit this dump manually with EmEditor (which supposed to be streaming editor but it lags anyway) it takes half a hour to save even a small change...

I have tried ENCLOSED BY option but it does not have brackets option

LOAD DATA
    INFILE "path"
    INTO TABLE test
    CHARACTER SET utf8
    FIELDS 
        TERMINATED BY ',' 
        ENCLOSED BY ')'
    LINES
        TERMINATED BY ';'

Bill Karwin · Accepted Answer · 2022-03-28 16:03:05Z

1

MySQL has no data import tool that understands this format. LOAD DATA can't parse it.

I agree trying to edit a 40GB file is beyond the capacity of almost any editor (emacs users can keep their superiority to themselves at this point).

You could write some code to parse it, but I think an easier solution would be to use sed which is a true streaming editor (the name sed literally means "stream editor").

sed -e 's/;/;\n/g' myfile.txt |
  sed -e '/;/s/^/INSERT INTO `test` VALUES /g' |
  mysql ...options...

Of course, try this on a small sample file first, to make sure it works the way you expect it to.

answered Mar 28, 2022 at 16:03

Bill Karwin

567k87 gold badges709 silver badges869 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Luuk · Accepted Answer · 2022-03-28 16:11:32Z

1

You could use a tool like gawk.

gawk -F; "{ for(i=1;i<=NF;i++) if($i!=\"\") print \"INSERT INTO table VALUES \"$i\";\" }NR%1000==1{ print \"COMMIT; START TRANSACTION;\"}" export.txt >import.sql

This will create an output file like:

INSERT INTO table VALUES (1,'text',NULL,NULL);
INSERT INTO table VALUES (2,'string',NULL,1);
COMMIT; START TRANSACTION;
INSERT INTO table VALUES (12,'date',123,NULL);
INSERT INTO table VALUES (2,'foo',11,15);

With the COMMIT; START TRANSACTION every 1000 lines (increase that number when needed)

gawk can be downloaded (i.e.) here: http://gnuwin32.sourceforge.net/packages/gawk.htm

Only downside on Windows is that you need to escape those ", using a \.
I cannot guess if this is quick enough, increasing the 1000 might influence import time.

edited Mar 28, 2022 at 16:11

answered Mar 28, 2022 at 16:01

Luuk

15.4k5 gold badges28 silver badges44 bronze badges

Comments

Rick James · Accepted Answer · 2022-03-29 05:01:35Z

Do it in 2 steps: Use some shell script to turn ; into \n and toss the parens. That should generate a decent file for LOAD DATA to work with.

It might be as simple as

 tr -d '()'  <input.txt  |  tr ';' '\n'  >load.csv

Then use LOAD DATA INFILE with load.csv.

Test:

echo "(1,'text',NULL,NULL);(2,'string',NULL,1);" | tr -d '()' | tr ';' '\n'
1,'text',NULL,NULL
2,'string',NULL,1

The two previous answers will be perhaps 10 times as slow because of using one row per INSERT. Also, by doing it in 2 steps gives you a good chance to look at the intermediate file before throwing it at MySQL. Also, tools like awk and sed are inherently "line-oriented" and would need to suck in all 40GB as the only line; this may cause them to croak. Meanwhile tr simply fiddles with bytes; even the need to use tr twice should not cause the CPU to break a sweat, nor tu use more than a trivial amount of RAM.

RARE Kpop Manifesto · Accepted Answer · 2022-03-29 08:27:56Z

@BillKarwin : You sure it's all that terrible ?

% time ( echo; for a in {1..7}; do       
         nice gcat "${m3t}"; done | pvE0 \
  \
  | mawk2 'BEGIN { 
       RS = "^$" 
       FS = ORS } $!-_=$-_=\
                            \
     sprintf("\n\t%.0s rows::
               \f%.f\fbytes::\f%.f%.0s\n\n",
            __=length($(_<_)),
            NF-(""==$NF),__,FS=RS)'  ) 


      in0: 12.9GiB 0:00:04 [2.86GiB/s] [2.86GiB/s] [                                        <=>    ]

     rows::
               87459925
                       bytes::
                              13884812851


( echo; for a in {1..7}; do; nice gcat "${m3t}"; done | pvE 0.1 in0 | mawk2 ;   

3.49s user 8.24s system 111% cpu 10.487 total

40GB might be a bit challenging, but 12.9GB at a time sure isn't much problem, and 10.5secs per 12.9 GB chunk is at semi respectable, hopefully.

As for OP's original ask, something like this already suffices :

echo "(1,'text',NULL,NULL);(2,'string',NULL,1);" \
\
| mawk2 NF=NF FS='[()]+' OFS='' RS=';' 

1,'text',NULL,NULL
2,'string',NULL,1

Tested and confirmed to work on gawk 5.1.1, mawk 1.3.4, mawk 1.996, and nawk

— The 4Chan Teller

Collectives™ on Stack Overflow

How to import huge sql dump in text format using LOAD DATA separated by ;

4 Answers 4

Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related