Matlab cell arrays of strings: finding repeated strings and manipulating corresponding data

Question

I have a large set of data (~1 million entries), stored as a cell with a number of columns and many, many rows. My issue is that I need to identify entries that occur at the same time, and then manipulate other columns so as to remove the rows with repeated dates without losing all of the information.

An example of a subset of such data could be initialized as so;

data = {'10:30', 100; '10:30', 110; '10:31', 115;'10:32', 110}

That is, I have a cell with one column of strings (representing time), and another column (many in the real data) of doubles.

My code should notice the repeated 10:30 (there could me many such repeats), then be able to take in the corresponding doubles (100 and 110) as inputs for some function, f(100,110), and then remove the repeated row from the data.

I.e. if the function were, say, to average, I should have an output that looks something like

data =
       '10:30' [105]
       '10:31' [115]
       '10:32' [110]

This would be fairly simple if loops were fast enough, but with my data set, there is no point in even attempting a solution involving looping through.

I have gotten as far as

[uniqueElements, firstUniquePosition, commonSets] = unique(data(:,1));

after much fiddling around, which yields some information that appears useful,

uniqueElements = 

    '10:30'
    '10:31'
    '10:32'

firstUniquePosition =

     1
     3
     4

commonSets =

     1
     1
     2
     3

but I can't quite figure out how to make a vectorised statement that allows me to manipulate the elements with common dates.

I imagine it will involve cellfun at some point, but I don't know enough of matlab's functionality to implement it yet without a push in the right direction.

Which Matlab version are you using? If it's one of the recent versions there are convenient functions that can be used. — Setsu
– Setsu, Commented Apr 21, 2015 at 19:10
@Setsu, I'm using R2013a. Is that recent enough for these functions? I have this issue sorted, but there may be more later. — user4816166
– user4816166, Commented Apr 22, 2015 at 7:12
@Cecilia I'm not sure, really; the data is so large that I can't just look at it to summarise its nature. I know that in a subset of about 120000 entries, there were about 55000 unique dates. Pairs and triples were quite common. — user4816166
– user4816166, Commented Apr 22, 2015 at 7:12

Robert Seifert · Accepted Answer · 2015-04-21 20:29:11Z

5

That is a job for accumarray:

[times,~,subs] = unique(data(:,1));
idx = 1:size(data,1);
meanOfCommonTimes = accumarray(subs(:),idx(:),[],@(x) mean( [data{x,2}] ))

output = [times num2cell(meanOfCommonTimes)]

output = 

    '10:30'    [105]
    '10:31'    [115]
    '10:32'    [110]

Talking about 1 million elements and performance: consider storing your time data as numeric values using datenum function.

times = datenum(data(:,1),'hh:mm');

and also keep your data in double arrays:

vals = cell2mat(data(:,2));

The calculations will then be up to 10 times faster!

[~,~, subs] = unique(times);
meanOfCommonTimes = accumarray(subs(:),vals(:),[],@mean);

But be aware. that also the conversion takes quite some time. If you do a lot of calculations later on it could wort it.

Benchmark

function [t] = bench()
    data = {'10:30', 100; '10:30', 110; '10:31', 115;'10:32', 110};
    data = [repmat(data, 200000, 1)]; % I use a matrix rather than a cell array for the simplicity of randomly generating example data

    % functions to compare
    fcns = {
        @() thewaywewalk(data);
        @() Cecilia(data);
    };

    thewayw = timeit(fcns{1})
    Ceci = timeit(fcns{2})
end

function Z = Cecilia(data)
    [uniqueElements, ~, commonSets] = unique(data(:,1));

    num_unique = length(uniqueElements);
    Z = zeros(num_unique, 1);
    for ii = 1:num_unique
        Z(ii) = mean([data{commonSets==ii, 2}]);
    end
end
function Z = thewaywewalk(data)
    [~,~,subs] = unique(data(:,1));
    idx = 1:size(data,1);
    Z = accumarray(subs(:),idx(:),[],@(x) mean( [data{x,2}] ));
end

The results are almost equal for an array with 800000 rows.

thewayw =  1.1483
Ceci = 1.0957

But again, accumarray would highly profit from a conversion to double before, but the performance of the loop should stay the same in this case.

edited Apr 21, 2015 at 20:29

answered Apr 21, 2015 at 19:12

Robert Seifert

25.3k13 gold badges75 silver badges117 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Cecilia Over a year ago

Nice solution. Just a note, the first output row's second element should be 105.

Setsu Over a year ago

Great solution with accumarray, though there is one gotcha the OP should note about. This solution assumes that the function f() can be safely applied to all rows, including those that are unique. mean works here because it is a no-op on scalars, but depending on the function the OP wants to apply this may not be the case.

user4816166 Over a year ago

Excellent; this solves my issue simply and efficiently. I do have many calculations to do in weeks to come, so I may very well convert the data before hand, as you suggest; your benchmarks to that end are well appreciated. It's amazing how simple this is with accumarray; reading the documentation made it immediately clear it was the way to go, if only I had known of its existence. Thanks so much for your prompt and thorough response.

Cecilia · Accepted Answer · 2015-04-21 21:15:16Z

1

Method

How bad a loop will perform depends on how many unique dates you have, not how many datapoints. If your number of unique dates is small, you can do the following

data = {'10:30', 100; '10:30', 110; '10:31', 115;'10:32', 110};
[uniqueElements, firstUniquePosition, commonSets] = unique(data(:,1));

num_unique = length(uniqueElements);
mean_of_times = zeros(num_unique, 1);
for ii = 1:num_unique
    mean_of_times(ii) = mean([data{commonSets==ii, 2}]);
end

output = [uniqueElements num2cell(mean_of_times)]

output = 

    '10:30'    [105]
    '10:31'    [115]
    '10:32'    [110]

Benchmarking

So how bad is a for loop? I tested up to 20,000 unique dates with 100 times as many rows, for a total of 2,000,000 rows. Here's the plot from my experiment. It might be a bit tricky to make out but the accumarray points are all down by the x-axis.

Number of unique dates vs time

Here's the experiment code

figure; hold on;
kk = 100; %Make 100 times as many rows as dates
for jj = 5000:5000:20000
    dates = 1:jj;
    times = rand(jj*kk, 1);
    % I use a matrix rather than a cell array for the simplicity of randomly generating example data
    data = [repmat(dates, 1, kk)' times];
    data = data(randperm(jj*kk), :); %Shuffle data rows

    [uniqueElements,~,commonSets] = unique(data(:,1));

    %thewaywewalk's solution using accumarray
    tic;
    idx = 1:size(data,1);
    accumarray(commonSets(:),idx(:),[],@(x) mean( [data(x,2)] ));
    stopwatch = toc;
    plot(jj, stopwatch, 'b.'); 

    %my solution using a for loop
    tic;
    num_unique = length(uniqueElements);
    mean_of_times = zeros(num_unique, 1);
    for ii = 1:num_unique
        mean_of_times(ii) = mean([data(commonSets==ii, 2)]);
    end
    stopwatch = toc;
    plot(jj, stopwatch, 'r.'); 
end

This experiment is only testing for 1% unique dates compared to rows. The for loop would be even slower for more unique dates. thewaywewalk's answer benchmarks against a dataset with 3 unique dates. Even with such a small number of similar dates, accumarray and the for loop have similar runtimes. Conclusion? Use accumarray.

edited Apr 21, 2015 at 21:15

answered Apr 21, 2015 at 19:22

Cecilia

4,7613 gold badges39 silver badges80 bronze badges

2 Comments

Robert Seifert Over a year ago

I tend to say there is a something wrong with your benchmark. I use a matrix rather than a cell array for the simplicity of randomly generating example data is a rather unfair assumption, as accumarray would be much faster without cell arrays. Anyway, have a look at my benchmark.

Cecilia Over a year ago

@thewaywewalk My benchmarking code had a major error in it. After correction, 'accumarray' is far better than the for loop, even without using a cell array. I think your benchmark doesn't show a large time difference because the example dataset you use only has 3 unique times. Interesting question all around.

Collectives™ on Stack Overflow

Matlab cell arrays of strings: finding repeated strings and manipulating corresponding data

2 Answers 2

Benchmark

3 Comments

Method

Benchmarking

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Benchmark

3 Comments

Method

Benchmarking

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related