0

I have a very lare csv file containing three columns. Now I want to load these columns as fast as possible into a matlab matrix.

Currently what I do is this

    fid = fopen(inputfile, 'rt');
    g = textscan(fid,'%s','delimiter','\r\n');
    tdata = g{1};
    fclose(fid);
    
    results = zeros([numel(tdata)-4], 3);
    tic
    display('start reading data...');
    for r = 4:numel(tdata)
        if ~mod(r, 100) 
            display(['data row: ' num2str(r) ' / ' num2str(numel(tdata))]);
        end
        entries = strsplit(tdata{r}, ',');
        results(r-3,1) = str2double(strrep(entries{1},',', '.'));
        results(r-3,2) = str2double(strrep(entries{2},',', '.'));
        results(r-3,3) = str2double(strrep(entries{3},',', '.'));
    end

This however takes ~30 seconds for 200 000 lines. This means 150 µs per line. This is really slow. The code is not accepted by parfor.

Now I would like to know what causes the bottleneck in the for loop and how I can speed it up.

Here the measured times:

str2double 578253 calls 29.631s

strsplit 192750 calls 13.388s

EDIT: The content has this structure in the file

  0.000000,  -0.00271,   5394147
  0.000667,  -0.00271,   5394148
  0.001333,  -0.00271,   5394149
  0.002000,  -0.00271,   5394150
2
  • Have you looked at code profiling? You can also use the "Run and Time" option in the GUI to determine the slow step. Commented Apr 25, 2017 at 14:03
  • I added the results to the post Commented Apr 25, 2017 at 14:09

2 Answers 2

1

I think a lot can be improved by calling textscan differently.

You do this:

g = textscan(fid,'%s','delimiter','\r\n');

But then call tdata = g{1};

If textscan is called correctly it should already split all your data, and give it back as numbers.

Try this:

g=textscan(fid,'%f,%f,%f,'delimiter','\r\n')

It should give you back three cell arrays with in the columns your values. To convert to a matrix you can use:

g=cell2mat(g)

I imported 200k lines in 0.12 seconds.

It seems your code has some other workarounds. You start at r=4, it seems you have 3 lines that you don't want to read. so after fopen you can call 3 times

[~] =fgetl(fid) 

to get to the interesting part of your file.

You also first split the line with ',' as seperator. But the replace all ',' by '.'. That will not do anything, all ',' are already gone since they were used as seperators.

Sign up to request clarification or add additional context in comments.

3 Comments

That does not work. g=textscan(fid,'%f,%f,%f\r\n') gives back only a single line.
It works with g=textscan(fid,'%f,%f,%f','delimiter','\r\n'); dataOutput=cell2mat(g); Now the resulting time is 0.5 seconds. A speed up of 60 times.
Thats quite an improvement! I corrected the answer.
1

If you used csvread you wouldn't need to use str2double or strsplit, which you say are the slow lines... it's likely much quicker for a csv.

You would be able to replace all the above code by:

results = csvread(inputfile);

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.