4

I have a large set of data (~1 million entries), stored as a cell with a number of columns and many, many rows. My issue is that I need to identify entries that occur at the same time, and then manipulate other columns so as to remove the rows with repeated dates without losing all of the information.

An example of a subset of such data could be initialized as so;

data = {'10:30', 100; '10:30', 110; '10:31', 115;'10:32', 110}

That is, I have a cell with one column of strings (representing time), and another column (many in the real data) of doubles.

My code should notice the repeated 10:30 (there could me many such repeats), then be able to take in the corresponding doubles (100 and 110) as inputs for some function, f(100,110), and then remove the repeated row from the data.

I.e. if the function were, say, to average, I should have an output that looks something like

data =
       '10:30' [105]
       '10:31' [115]
       '10:32' [110]

This would be fairly simple if loops were fast enough, but with my data set, there is no point in even attempting a solution involving looping through.

I have gotten as far as

[uniqueElements, firstUniquePosition, commonSets] = unique(data(:,1));

after much fiddling around, which yields some information that appears useful,

uniqueElements = 

    '10:30'
    '10:31'
    '10:32'

firstUniquePosition =

     1
     3
     4

commonSets =

     1
     1
     2
     3

but I can't quite figure out how to make a vectorised statement that allows me to manipulate the elements with common dates.

I imagine it will involve cellfun at some point, but I don't know enough of matlab's functionality to implement it yet without a push in the right direction.

4
  • Which Matlab version are you using? If it's one of the recent versions there are convenient functions that can be used. Commented Apr 21, 2015 at 19:10
  • How many unique dates do you have? Commented Apr 21, 2015 at 19:10
  • @Setsu, I'm using R2013a. Is that recent enough for these functions? I have this issue sorted, but there may be more later. Commented Apr 22, 2015 at 7:12
  • @Cecilia I'm not sure, really; the data is so large that I can't just look at it to summarise its nature. I know that in a subset of about 120000 entries, there were about 55000 unique dates. Pairs and triples were quite common. Commented Apr 22, 2015 at 7:12

2 Answers 2

5

That is a job for accumarray:

[times,~,subs] = unique(data(:,1));
idx = 1:size(data,1);
meanOfCommonTimes = accumarray(subs(:),idx(:),[],@(x) mean( [data{x,2}] ))

output = [times num2cell(meanOfCommonTimes)]

output = 

    '10:30'    [105]
    '10:31'    [115]
    '10:32'    [110]

Talking about 1 million elements and performance: consider storing your time data as numeric values using datenum function.

times = datenum(data(:,1),'hh:mm');

and also keep your data in double arrays:

vals = cell2mat(data(:,2));

The calculations will then be up to 10 times faster!

[~,~, subs] = unique(times);
meanOfCommonTimes = accumarray(subs(:),vals(:),[],@mean);

But be aware. that also the conversion takes quite some time. If you do a lot of calculations later on it could wort it.


Benchmark

function [t] = bench()
    data = {'10:30', 100; '10:30', 110; '10:31', 115;'10:32', 110};
    data = [repmat(data, 200000, 1)]; % I use a matrix rather than a cell array for the simplicity of randomly generating example data

    % functions to compare
    fcns = {
        @() thewaywewalk(data);
        @() Cecilia(data);
    };

    thewayw = timeit(fcns{1})
    Ceci = timeit(fcns{2})
end

function Z = Cecilia(data)
    [uniqueElements, ~, commonSets] = unique(data(:,1));

    num_unique = length(uniqueElements);
    Z = zeros(num_unique, 1);
    for ii = 1:num_unique
        Z(ii) = mean([data{commonSets==ii, 2}]);
    end
end
function Z = thewaywewalk(data)
    [~,~,subs] = unique(data(:,1));
    idx = 1:size(data,1);
    Z = accumarray(subs(:),idx(:),[],@(x) mean( [data{x,2}] ));
end

The results are almost equal for an array with 800000 rows.

thewayw =  1.1483
Ceci = 1.0957

But again, accumarray would highly profit from a conversion to double before, but the performance of the loop should stay the same in this case.

Sign up to request clarification or add additional context in comments.

3 Comments

Nice solution. Just a note, the first output row's second element should be 105.
Great solution with accumarray, though there is one gotcha the OP should note about. This solution assumes that the function f() can be safely applied to all rows, including those that are unique. mean works here because it is a no-op on scalars, but depending on the function the OP wants to apply this may not be the case.
Excellent; this solves my issue simply and efficiently. I do have many calculations to do in weeks to come, so I may very well convert the data before hand, as you suggest; your benchmarks to that end are well appreciated. It's amazing how simple this is with accumarray; reading the documentation made it immediately clear it was the way to go, if only I had known of its existence. Thanks so much for your prompt and thorough response.
1

Method

How bad a loop will perform depends on how many unique dates you have, not how many datapoints. If your number of unique dates is small, you can do the following

data = {'10:30', 100; '10:30', 110; '10:31', 115;'10:32', 110};
[uniqueElements, firstUniquePosition, commonSets] = unique(data(:,1));

num_unique = length(uniqueElements);
mean_of_times = zeros(num_unique, 1);
for ii = 1:num_unique
    mean_of_times(ii) = mean([data{commonSets==ii, 2}]);
end

output = [uniqueElements num2cell(mean_of_times)]

output = 

    '10:30'    [105]
    '10:31'    [115]
    '10:32'    [110]

Benchmarking

So how bad is a for loop? I tested up to 20,000 unique dates with 100 times as many rows, for a total of 2,000,000 rows. Here's the plot from my experiment. It might be a bit tricky to make out but the accumarray points are all down by the x-axis.

Number of unique dates vs time

Here's the experiment code

figure; hold on;
kk = 100; %Make 100 times as many rows as dates
for jj = 5000:5000:20000
    dates = 1:jj;
    times = rand(jj*kk, 1);
    % I use a matrix rather than a cell array for the simplicity of randomly generating example data
    data = [repmat(dates, 1, kk)' times];
    data = data(randperm(jj*kk), :); %Shuffle data rows

    [uniqueElements,~,commonSets] = unique(data(:,1));

    %thewaywewalk's solution using accumarray
    tic;
    idx = 1:size(data,1);
    accumarray(commonSets(:),idx(:),[],@(x) mean( [data(x,2)] ));
    stopwatch = toc;
    plot(jj, stopwatch, 'b.'); 

    %my solution using a for loop
    tic;
    num_unique = length(uniqueElements);
    mean_of_times = zeros(num_unique, 1);
    for ii = 1:num_unique
        mean_of_times(ii) = mean([data(commonSets==ii, 2)]);
    end
    stopwatch = toc;
    plot(jj, stopwatch, 'r.'); 
end

This experiment is only testing for 1% unique dates compared to rows. The for loop would be even slower for more unique dates. thewaywewalk's answer benchmarks against a dataset with 3 unique dates. Even with such a small number of similar dates, accumarray and the for loop have similar runtimes. Conclusion? Use accumarray.

2 Comments

I tend to say there is a something wrong with your benchmark. I use a matrix rather than a cell array for the simplicity of randomly generating example data is a rather unfair assumption, as accumarray would be much faster without cell arrays. Anyway, have a look at my benchmark.
@thewaywewalk My benchmarking code had a major error in it. After correction, 'accumarray' is far better than the for loop, even without using a cell array. I think your benchmark doesn't show a large time difference because the example dataset you use only has 3 unique times. Interesting question all around.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.