matlab: speed up for loop with string analysis

Question

I have a very lare csv file containing three columns. Now I want to load these columns as fast as possible into a matlab matrix.

Currently what I do is this

    fid = fopen(inputfile, 'rt');
    g = textscan(fid,'%s','delimiter','\r\n');
    tdata = g{1};
    fclose(fid);
    
    results = zeros([numel(tdata)-4], 3);
    tic
    display('start reading data...');
    for r = 4:numel(tdata)
        if ~mod(r, 100) 
            display(['data row: ' num2str(r) ' / ' num2str(numel(tdata))]);
        end
        entries = strsplit(tdata{r}, ',');
        results(r-3,1) = str2double(strrep(entries{1},',', '.'));
        results(r-3,2) = str2double(strrep(entries{2},',', '.'));
        results(r-3,3) = str2double(strrep(entries{3},',', '.'));
    end

This however takes ~30 seconds for 200 000 lines. This means 150 µs per line. This is really slow. The code is not accepted by parfor.

Now I would like to know what causes the bottleneck in the for loop and how I can speed it up.

Here the measured times:

str2double 578253 calls 29.631s

strsplit 192750 calls 13.388s

EDIT: The content has this structure in the file

  0.000000,  -0.00271,   5394147
  0.000667,  -0.00271,   5394148
  0.001333,  -0.00271,   5394149
  0.002000,  -0.00271,   5394150

Have you looked at code profiling? You can also use the "Run and Time" option in the GUI to determine the slow step. — qbzenker
– qbzenker, Commented Apr 25, 2017 at 14:03

Gelliant · Accepted Answer · 2017-04-26 11:54:10Z

1

I think a lot can be improved by calling textscan differently.

You do this:

g = textscan(fid,'%s','delimiter','\r\n');

But then call tdata = g{1};

If textscan is called correctly it should already split all your data, and give it back as numbers.

Try this:

g=textscan(fid,'%f,%f,%f,'delimiter','\r\n')

It should give you back three cell arrays with in the columns your values. To convert to a matrix you can use:

g=cell2mat(g)

I imported 200k lines in 0.12 seconds.

It seems your code has some other workarounds. You start at r=4, it seems you have 3 lines that you don't want to read. so after fopen you can call 3 times

[~] =fgetl(fid)

to get to the interesting part of your file.

You also first split the line with ',' as seperator. But the replace all ',' by '.'. That will not do anything, all ',' are already gone since they were used as seperators.

edited Apr 26, 2017 at 11:54

answered Apr 25, 2017 at 14:56

Gelliant

1,8451 gold badge13 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Matthias Pospiech Over a year ago

That does not work. g=textscan(fid,'%f,%f,%f\r\n') gives back only a single line.

Matthias Pospiech Over a year ago

It works with g=textscan(fid,'%f,%f,%f','delimiter','\r\n'); dataOutput=cell2mat(g); Now the resulting time is 0.5 seconds. A speed up of 60 times.

Gelliant Over a year ago

Thats quite an improvement! I corrected the answer.

Wolfie · Accepted Answer · 2017-04-25 14:20:36Z

1

If you used csvread you wouldn't need to use str2double or strsplit, which you say are the slow lines... it's likely much quicker for a csv.

You would be able to replace all the above code by:

results = csvread(inputfile);

answered Apr 25, 2017 at 14:20

Wolfie

30.7k7 gold badges30 silver badges60 bronze badges

Collectives™ on Stack Overflow

matlab: speed up for loop with string analysis

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related