MATLAB - Extracting numbers from a cell array of strings

Question

I want to extract a number from a text file. First I read the file and import is as a cell array of the form:

A = {
        '1   0   0   0   -   0:  0.000741764'
        '2   0   0   0   -   0:          100'
        '3   0   0   0   -   0:          100'
        '4   0   0   0   -   0:          100'
        '5   0   0   0   -   0:   0.00124598'
        '6   0   0   0   -   0:  0.000612725'
        '7   0   0   0   -   0:  0.000188365'
        '8   0   0   0   -   0:            0'
        '9   0   0   0   -   0:            0'
        '10   0   0   0   -   0:            0'
        '11   0   0   0   -   0:            0'
        '12   0   0   0   -   0:            0'};

I need to get the number on the right, based on the value of the left integers. For example I need to know the values corresponding to 3 and 6 (100 and 0.000612725):

'3   0   0   0   -   0:          100'
'6   0   0   0   -   0:  0.000612725'

This is my code:

clear all
close all
clc

A = {
        '1   0   0   0   -   0:  0.000741764'
        '2   0   0   0   -   0:          100'
        '3   0   0   0   -   0:          100'
        '4   0   0   0   -   0:          100'
        '5   0   0   0   -   0:   0.00124598'
        '6   0   0   0   -   0:  0.000612725'
        '7   0   0   0   -   0:  0.000188365'
        '8   0   0   0   -   0:            0'
        '9   0   0   0   -   0:            0'
        '10   0   0   0   -   0:            0'
        '11   0   0   0   -   0:            0'
        '12   0   0   0   -   0:            0'};

THREE = 3;
SIX = 6;

M  = cellfun(@str2num, A, 'UniformOutput', false);
Values = cell2mat(M);

Index_3 = find(Values(:,1) == SIX);
Index_6 = find(Values(:,1) == SIX);

sp_3 = strsplit(A{Index_3},':');
sp_6 = strsplit(A{Index_6},':');

VALUE_3 = str2double(sp_3(end));
VALUE_6 = str2double(sp_6(end));

But I get an ERROR:

Error using cat
Dimensions of matrices being concatenated are not consistent.
Error in cell2mat (line 84)
            m{n} = cat(1,c{:,n});
Error in test (line 23)
Values = cell2mat(M);

,because:

M = 
    [1x4   double]
    [1x104 double]
    [1x104 double]
    [1x104 double]
    [1x4   double]
    [1x4   double]
    [1x4   double]
    [1x4   double]
    [1x4   double]
    [1x4   double]
    [1x4   double]
    [1x4   double]

I tried:

str2double

instead, but I get NaN for all values in M.

rayryeng · Accepted Answer · 2015-06-01 14:54:18Z

This is a perfect case for using regular expressions. Regular expressions are powerful tools that seek patterns in text. In your case, what you'd want to find first are the numbers that begin the string. Next, you want to find the corresponding numbers at the end of the string. You also mentioned in your comments that you may get numbers in exponential notation (something like 2.50652e-007). That can also easily be handled, and I'm going to add this as another entry in your cell array to demonstrate that this works.

How I'm going to proceed is that I'm going to process the entire cell array. I'm doing this because I'm sure you'll need to look at other numbers, not just the third and sixth entry so if we do this first, then it'll be quite easy for you to get other things you need.

We can extract both the beginning and ending values in two regular expression regexp calls to extract the beginning and ending like so:

%// Your code to define A and also new entry with exponential notation
A = {
        '1   0   0   0   -   0:  0.000741764'
        '2   0   0   0   -   0:          100'
        '3   0   0   0   -   0:          100'
        '4   0   0   0   -   0:          100'
        '5   0   0   0   -   0:   0.00124598'
        '6   0   0   0   -   0:  0.000612725'
        '7   0   0   0   -   0:  0.000188365'
        '8   0   0   0   -   0:            0'
        '9   0   0   0   -   0:            0'
        '10   0   0   0   -   0:            0'
        '11   0   0   0   -   0:            0'
        '12   0   0   0   -   0:            0',
        '13   0   0   0   -   0: 2.50652e-007'};

%// Begin new code
beginStr = regexp(A, '^\d+', 'match');
endStr = regexp(A, '(\d*\.?\d+(e-\d+)?)$', 'match');

Looks a bit complicated, but easy to explain. regexp takes in two parameters by default: A string or cell array of strings (such as your case) and a pattern to search for. I also chose the flag 'match' because I want the actual strings returned. By default, regexp returns the indices of where a match occurred.

The first regexp calls looks for a sequence of numbers that appear at the beginning of the string. \d+ means to look for one or more numbers and ^ means to look at the beginning of the string, so combine both of these to say that you're looking for a sequence of numbers at the beginning of the string. I'm assuming that the beginning of the string is an integer so we can get away with this. What will be returned is a cell array where each entry is another cell array of matches. If this works out, we should get a cell array with a bunch of 1 x 1 cells with each being the number at the beginning.

The second regexp call looks for a sequence of numbers such that there is optionally a bunch of numbers (\d*), followed by an optional decimal point (\.?), followed by at least 1 number (\d+) and then optionally we look for a e character, - character and another bunch of numbers after this point (\d+). Take note that this is all grouped together via (e-\d+)?, which means that this exponential stuff is optional. Also, this entire pattern all appears at the end of the string, hence the parenthesis grouping all of these tokens together and ending with a $ which means look at the end of the string. The * character means to look for zero or more occurrences and the ? character means to look for zero or one occurrence. Also to be consistent, the + character means to look for one or more occurrence.

Take note that the . character in regular expressions means a wildcard or any character. If you explicitly want to match with the decimal point, you need to add a \ before the . character. Therefore, the regular expression is to find patterns at the end of the string where we may optionally have a bunch of numbers before an optional decimal point and then there is at least one number that follows these two optional things. This will be like the output of the first regexp call but having the numbers at the end of the string.

Let's double-check using celldisp:

>> format compact
>> celldisp(beginStr)
beginStr{1}{1} =
1
beginStr{2}{1} =
2
beginStr{3}{1} =
3
beginStr{4}{1} =
4
beginStr{5}{1} =
5
beginStr{6}{1} =
6
beginStr{7}{1} =
7
beginStr{8}{1} =
8
beginStr{9}{1} =
9
beginStr{10}{1} =
10
beginStr{11}{1} =
11
beginStr{12}{1} =
12
beginStr{13}{1} =
13
>> celldisp(endStr)
endStr{1}{1} =
0.000741764
endStr{2}{1} =
100
endStr{3}{1} =
100
endStr{4}{1} =
100
endStr{5}{1} =
0.00124598
endStr{6}{1} =
0.000612725
endStr{7}{1} =
0.000188365
endStr{8}{1} =
0
endStr{9}{1} =
0
endStr{10}{1} =
0
endStr{11}{1} =
0
endStr{12}{1} =
0
endStr{13}{1} =
2.50652e-007

Looks fine to me! Now you have the final task of converting the numbers into double. We can use a cellfun call like what you've done to do that for us:

beginNumbers = cellfun(@(x) str2double(x{1}), beginStr);
endNumbers = cellfun(@(x) str2double(x{1}), endStr);

beginNumbers and endNumbers will contain our converted numbers for us. Let's put these into a matrix and show what this looks like:

out = [beginNumbers endNumbers];
format long g;

I use format long g to show as many significant digits as possible. This is what we get:

>> out

out =

                         1               0.000741764
                         2                       100
                         3                       100
                         4                       100
                         5                0.00124598
                         6               0.000612725
                         7               0.000188365
                         8                         0
                         9                         0
                        10                         0
                        11                         0
                        12                         0
                        13               2.50652e-07

Cool! Now if you want the third and sixth numbers, just do:

>> third = out(3,:)

third =

     3   100

>> sixth = out(6,:)

sixth =

                         6               0.000612725

The above gets the entire line for you, but if you specifically want the corresponding numbers that go with the ID, just do:

>> third = out(3,2)

third =

   100

>> sixth = out(6,2)

sixth =

               0.000612725

I wouldn't have added this much information to all my answers combined :D Well explained! +1
@SanthanSalai - Thanks :) Regular expressions I feel that they always need a lot of explanation. It's sort of a black art that many people don't quite understand (including me!) and so questions that use them need explanation because those two regexp calls are quite powerful, but the pattern to use for each call looks like a really hard foreign language. I feel that if you can explain what's happening, not only does it benefit me with reinforcing the concept but also with everyone else so that they can get a glimpse at how powerful regular expressions are. Thanks for the vote :)
@ViharChervenkov - Done. Take a look. It required a slight modification to the second regular expression. I also added in an example to show that it works.

Collectives™ on Stack Overflow

MATLAB - Extracting numbers from a cell array of strings

1 Answer 1

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related