2

I have a fairly simply regex expression but for some reason it's not capturing all the instances.

My dataframe looks like this (including all the 74 rows because I don't know where the problem occurs):

Name
P0824AK03.VAK03_TK02_QE_A122_M
P0824AK03.VAK03_TK02_QE_A122_M
P0824AK03.VAK03_TK02_QE_A122_M
P0824AK03.VAK03_TK02_QE_A122_M
P0824AK03.VAK03_TK02_QE_A122_M
P0824AK03.VAK03_TK02_QE_A122_M
P0824AK03.VAK03_TK02_QE_A122_M
P0824AK03.VAK03_TK02_QE_A122_M
P0824AK03.VAK03_TK02_QE_A122_M
P0824AK03.VAK03_TK02_QE_A122_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M

If I pass

In [57]: len(df['Name'])

I get

Out [57]: 74

I created a regex expression as follows:

p = re.compile('_[A-z][0-9][0-9][0-9]_')

I want to create a column where the snippet that looks a bit like '_A122_' or '_A100_' etc is the value. I want to use regex because I later want to apply this piece of code to a larger set where the snippet does not always appear at the same position.

When I use the following command, the result is a list of the form I was looking for:

In [55]: p.findall(str(df['Name']))
Out[55]: 
['_A100_',
 '_A122_',
 '_A100_',
 '_A100_',
 '_A122_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A122_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A122_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A122_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A122_',
 '_A100_',
 '_A122_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A122_',
 '_A100_',
 '_A100_',
 '_A122_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A122_']

The problem is, this list is "too short". Using len(p.findall(str(df['Name']))), I get 60 as the result. I cannot see which 14 rows it's missing!

I'm not used to regex expressions so maybe it's a super obvious mistake but I'd really appreciate any help.

(I guess I could do a for-loop and create the new column cell by cell, but I'd really rather avoid that since I will apply this code to bigger datasets later and don't want it to take a million years to run)

6
  • Are you using pandas? If yes, add the tag to the question. Commented May 15, 2017 at 11:45
  • Yes, sorry I should have said. I'm using pandas dataframe Commented May 15, 2017 at 11:46
  • 3
    BTW, [A-z] matches more than just letters, you need to use [A-Za-z]. Commented May 15, 2017 at 11:47
  • Thanks Wiktor, however in this case that doesn't change the result, I still don't get 74 items in my list Commented May 15, 2017 at 11:52
  • I got TypeError: expected string or bytes-like object when I tried that. But the question has been answered now and the problem is solved :) Commented May 15, 2017 at 12:11

1 Answer 1

3

IIUC you can use .str.extract() in order to extract a substring that matches your RegEx:

In [55]: df.Name.str.extract(r'(_[a-zA-Z]\d{3}_)', expand=False)
Out[55]:
0     _A122_
1     _A122_
2     _A122_
3     _A122_
4     _A122_
5     _A122_
6     _A122_
7     _A122_
8     _A122_
9     _A122_
       ...
64    _A100_
65    _A100_
66    _A100_
67    _A100_
68    _A100_
69    _A100_
70    _A100_
71    _A100_
72    _A100_
73    _A100_
Name: Name, dtype: object

PS you should NOT use str(df['Name']) as the string representation of Pandas DF will be shorten:

In [58]: pd.options.display.max_rows = 4

In [59]: df
Out[59]:
                              Name
0   P0824AK03.VAK03_TK02_QE_A122_M
1   P0824AK03.VAK03_TK02_QE_A122_M
..                             ...
72  P0824AK03.VAK03_TK02_QE_A100_M
73  P0824AK03.VAK03_TK02_QE_A100_M

[74 rows x 1 columns]

In [60]: str(df['Name'])
Out[60]: '0     P0824AK03.VAK03_TK02_QE_A122_M\n1     P0824AK03.VAK03_TK02_QE_A122_M\n                   ...              \n72
    P0824AK03.VAK03_TK02_QE_A100_M\n73    P0824AK03.VAK03_TK02_QE_A100_M\nName: Name, dtype: object'
Sign up to request clarification or add additional context in comments.

2 Comments

It worked, thank you. Do you think the problem was in my faulty using of str(df['Name']) ?
@J.Dahlgren, you are welcome. Yes, usually we want to use Pandas/NumPy/SciPy vectorized functions when working with Pandas data frames. As i've shown in my answer Pandas will shorten str(df['Name']) - in order NOT to flood your screen with data. Imagine if you have a DF with a few. billion rows - the output to the screen would take too long and str(df['Name']) could end up with a MemoryError...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.