0

I have a file containing 40 gb of rows data, but it is in raw text format (no commands in sql syntax), it's always from one to unlimited rows in one line separated by ; and rows are array in () brackets with values separated with ,

examples of lines:

(1,'text',NULL,NULL);(2,'string',NULL,1);
(12,'date',123,NULL);(2,'foo',11,15);

Is there a way to import this data using mysqlimport or LOAD DATA statement without parsing data with programming languages? If not, what are the ways to parse it fast, preferrably in a few minutes, not days, because when I edit this dump manually with EmEditor (which supposed to be streaming editor but it lags anyway) it takes half a hour to save even a small change...

I have tried ENCLOSED BY option but it does not have brackets option

LOAD DATA
    INFILE "path"
    INTO TABLE test
    CHARACTER SET utf8
    FIELDS 
        TERMINATED BY ',' 
        ENCLOSED BY ')'
    LINES
        TERMINATED BY ';'

4 Answers 4

1

MySQL has no data import tool that understands this format. LOAD DATA can't parse it.

I agree trying to edit a 40GB file is beyond the capacity of almost any editor (emacs users can keep their superiority to themselves at this point).

You could write some code to parse it, but I think an easier solution would be to use sed which is a true streaming editor (the name sed literally means "stream editor").

sed -e 's/;/;\n/g' myfile.txt |
  sed -e '/;/s/^/INSERT INTO `test` VALUES /g' |
  mysql ...options...

Of course, try this on a small sample file first, to make sure it works the way you expect it to.

Sign up to request clarification or add additional context in comments.

Comments

1

You could use a tool like gawk.

gawk -F; "{ for(i=1;i<=NF;i++) if($i!=\"\") print \"INSERT INTO table VALUES \"$i\";\" }NR%1000==1{ print \"COMMIT; START TRANSACTION;\"}" export.txt >import.sql

This will create an output file like:

INSERT INTO table VALUES (1,'text',NULL,NULL);
INSERT INTO table VALUES (2,'string',NULL,1);
COMMIT; START TRANSACTION;
INSERT INTO table VALUES (12,'date',123,NULL);
INSERT INTO table VALUES (2,'foo',11,15);

With the COMMIT; START TRANSACTION every 1000 lines (increase that number when needed)

gawk can be downloaded (i.e.) here: http://gnuwin32.sourceforge.net/packages/gawk.htm

  • Only downside on Windows is that you need to escape those ", using a \.
  • I cannot guess if this is quick enough, increasing the 1000 might influence import time.

Comments

0

Do it in 2 steps: Use some shell script to turn ; into \n and toss the parens. That should generate a decent file for LOAD DATA to work with.

It might be as simple as

 tr -d '()'  <input.txt  |  tr ';' '\n'  >load.csv

Then use LOAD DATA INFILE with load.csv.

Test:

echo "(1,'text',NULL,NULL);(2,'string',NULL,1);" | tr -d '()' | tr ';' '\n'
1,'text',NULL,NULL
2,'string',NULL,1

The two previous answers will be perhaps 10 times as slow because of using one row per INSERT. Also, by doing it in 2 steps gives you a good chance to look at the intermediate file before throwing it at MySQL. Also, tools like awk and sed are inherently "line-oriented" and would need to suck in all 40GB as the only line; this may cause them to croak. Meanwhile tr simply fiddles with bytes; even the need to use tr twice should not cause the CPU to break a sweat, nor tu use more than a trivial amount of RAM.

Comments

0

@BillKarwin : You sure it's all that terrible ?

% time ( echo; for a in {1..7}; do       
         nice gcat "${m3t}"; done | pvE0 \
  \
  | mawk2 'BEGIN { 
       RS = "^$" 
       FS = ORS } $!-_=$-_=\
                            \
     sprintf("\n\t%.0s rows::
               \f%.f\fbytes::\f%.f%.0s\n\n",
            __=length($(_<_)),
            NF-(""==$NF),__,FS=RS)'  ) 


      in0: 12.9GiB 0:00:04 [2.86GiB/s] [2.86GiB/s] [                                        <=>    ]

     rows::
               87459925
                       bytes::
                              13884812851


( echo; for a in {1..7}; do; nice gcat "${m3t}"; done | pvE 0.1 in0 | mawk2 ;   

3.49s user 8.24s system 111% cpu 10.487 total

40GB might be a bit challenging, but 12.9GB at a time sure isn't much problem, and 10.5secs per 12.9 GB chunk is at semi respectable, hopefully.

As for OP's original ask, something like this already suffices :

echo "(1,'text',NULL,NULL);(2,'string',NULL,1);" \
\
| mawk2 NF=NF FS='[()]+' OFS='' RS=';' 

1,'text',NULL,NULL
2,'string',NULL,1

Tested and confirmed to work on gawk 5.1.1, mawk 1.3.4, mawk 1.996, and nawk

The 4Chan Teller

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.