MATLAB find number of replicates of substring array, in cell array of strings

Question

I have a MATLAB cell array of strings and a second array with partial strings:

base = {'a','b','c','d'}
all2 = {'a1','b1','c1','d1','a2','b2','c2','d2','q8','r15'}

The output is:

base = 

    'a'    'b'    'c'    'd'


all2 = 

    'a1'    'b1'    'c1'    'd1'    'a2'    'b2'    'c2'    'd2'    'q8'    'r15'

Problem/Requirement

If any of 'a1','b1','c1','d1' AND any of 'a2','b2','c2','d2' are present in the all2 array, then return a variable numb=2.

If any of 'a1','b1','c1','d1' AND any of 'a2','b2','c2','d2' AND any of 'a3','b3','c3','d3' are present in the all2 array, then return a variable numb=3.

Attempts

1.

Based on strfind(this approach), I tried matches = strfind(all2,base); but I got this error:

`Error using strfind`

`Input strings must have one row.`
....

2.

This other approach using strfind seemed better but just gave me

fun = @(s)~cellfun('isempty',strfind(all2,s));
out = cellfun(fun,base,'UniformOutput',false)
idx = all(horzcat(out{:}));
idx(1,1) 

out = 

[1x10 logical]    [1x10 logical]    [1x10 logical]    [1x10 logical]


ans =

     0

Neither of these attempts have worked. I think my logic is incorrect.

3.

This answer allows to find all indices of an array of partial strings in an array of strings. It returns:

base = regexptranslate('escape', base);
matches = false(size(all2));
for k = 1:numel(all2)
    matches(k) = any(~cellfun('isempty', regexp(all2{k}, base)));
end
matches

Output:

matches =

     1     1     1     1     1     1     1     1     0     0

My problem with this approach: How do I use the output matches to calculate numb=2? I am not sure if this is the most relevant logic for my specific question since it only gives matching indices.

Question

Is there a way to do this in MATLAB?

EDIT

Additional Information:

The array all2 WILL always be contiguous. A scenario of all2 = {'a1','b1','c1','d1','a3','b3','c3','d3','q8','r15'} is not possible.

What should happen when numbers aren't contiguous? Like all2 = {'a1' 'a3' 'a4'}; Should that return numb = 3? — gnovice
– gnovice, Commented Apr 19, 2017 at 20:17
@gnovice I assume you meant all2 = {'a1', 'a3', 'a4'};. If so, then you are correct. If, all2 = {'a1', 'a3' ,'a4'} then the return should be numb=3. Using my example in the OP: If any of 'a1',... AND any of 'a2',... AND any of 'a3',... are present in the all2 array, then return a variable numb=3. — edesz
– edesz, Commented Apr 19, 2017 at 20:23
Yes, there is no a2. However, there are still a3 and a4. So both contiguous and non-contiguous are required. — edesz
– edesz, Commented Apr 19, 2017 at 20:25

sco1 · Accepted Answer · 2017-04-19 21:05:15Z

2

Using a regex to find the unique suffixes to the base elements:

base = {'a','b','c','d'};
all2 = {'a1','b1','c1','d1','a2','b2','c2','d2', 'a4', 'q8','r15'};

% Use sprintf to build the expression so we can concatenate all the values
% of base into a single string; this is the [c1c2c3] metacharacter.
% Assumes the values of base are going to be one character
%
% This regex looks for one or more digits preceeded by a character from
% base and returns only the digits that match this criteria.
regexstr = sprintf('(?<=[%s])(\\d+)', [base{:}]);

% Use once to eliminate a cell array level
test = regexp(all2, regexstr, 'match', 'once');

% Convert the digits to a double array
digits = str2double(test);

% Return the number of unique digits. With isnan() we can use logical indexing
% to ignore the NaN values
num = numel(unique(digits(~isnan(digits))));

Which returns:

>> num

num =

     3

If you need continuous digits then something like this should be valid:

base = {'a','b','c','d'};
all2 = {'a1','b1','c1','d1','a2','b2','c2','d2', 'a4', 'q8','r15'};

regexstr = sprintf('(?<=[%s])(\\d+)', [base{:}]);
test = regexp(all2, regexstr, 'match', 'once');
digits = str2double(test);

% Find the unique digits, with isnan() we can use logical indexing to ignore the
% NaN values
unique_digits = unique(digits(~isnan(digits)));

% Because unique returns sorted values, we can use this to find where the
% first difference between digits is greater than 1. Append Inf at the end to
% handle the case where all values are continuous.
num = find(diff([unique_digits Inf]) > 1, 1);  % Thanks @gnovice :)

Which returns:

>> num

num =

     2

Breaking down the regexp and sprintf lines: Because we know that base only consists of single characters, we can use the [c1c2c3] metacharacter, which will match any character inside the brackets. So if we have '[rp]ain' we'll matche 'rain' or 'pain', but not 'gain'.

base{:} returns what MATLAB calls a comma-separated list. Adding the brackets concatenates the result into a single character array.

Without brackets:

>> base{:}

ans =

    'a'


ans =

    'b'


ans =

    'c'


ans =

    'd'

With brackets:

>> [base{:}]

ans =

    'abcd'

Which we can insert into our expression string with sprintf. This gives us (?<=[abcd])(\d+), which matches one or more digits preceeded by one of either a, b, c, d.

edited Apr 19, 2017 at 21:05

answered Apr 19, 2017 at 20:34

sco1

12.2k5 gold badges30 silver badges49 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

edesz Over a year ago

I think this solution works. I am still thinking about the contiguous/non-contiguous part. I updated the OP but I may need to delete that. Still thinking....

sco1 Over a year ago

I've added a solution with a continuous restriction

edesz Over a year ago

Yes, it needs to be contiguous. Ok,a thanks for separating these out. This is very specific and I had not initially thought of the 2 cases. This works for me.

edesz Over a year ago

Ok, except for the first 3 lines, I seem to understand the other lines....they seem clear. Just the first 3 are a little confusing. Mainly line 1...could you explain how you used sprintf to assemble the expression required by regexp?

sco1 Over a year ago

I've added a breakdown of the regex, hopefully it is helpful.

|

Collectives™ on Stack Overflow

MATLAB find number of replicates of substring array, in cell array of strings

1 Answer 1

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related