Pandas: select rows based on a condition applied to string

Question

Working with a dictionary of dataframes, each key is an integer 0, ..., 999, and each value is a dataframe like this:

     A         B
1    10010001  17
2    10020001  5
3    10020002  11
4    10020003  2
5    10030001  86
...

I need to iterate through the entire dictionary, and to put together in a new dataframe all lines that have the 3rd and 4th digit in column A equals to 02. In my example, only lines 2, 3, and 4 would form the new dataframe. All values of column A are strings.

What could be the most efficient way of doing this within pandas?

No they don't. At the moment they represent n different dataframes, but at the end of this task there will be just one "selected" dataframe. — FaCoffee
– FaCoffee, Commented Nov 21, 2016 at 16:21

wflynny · Accepted Answer · 2016-11-21 16:31:12Z

2

How about something like the following, where dis your dict:

pd.concat((v[v.A.str[2:4] == '02'] for v in d.itervalues()))

With your a dict consisting of your sample dataframe repeated 3 times and keys 0-2

d = dict(zip(range(3), [df]*3))

this yields:

          A   B
2  10020001   5
3  10020002  11
4  10020003   2
2  10020001   5
3  10020002  11
4  10020003   2
2  10020001   5
3  10020002  11
4  10020003   2

This should be more memory efficient than creating a list of rows or using a list comprehension because it uses a generator expression instead. It also should be faster than using regex due to direct indexing (assuming your data values are standardized).

If you don't like the index of the combined array, you could always reset_index(). For example:

y = pd.concat((v[v.A.str[2:4] == '02'] for v in d.itervalues()))
y.reset_index.drop('index', axis=1)

          A   B
0  10020001   5
1  10020002  11
2  10020003   2
3  10020001   5
4  10020002  11
5  10020003   2
6  10020001   5
7  10020002  11
8  10020003   2

edited Nov 21, 2016 at 16:31

answered Nov 21, 2016 at 16:24

wflynny

18.6k6 gold badges50 silver badges69 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

FaCoffee Over a year ago

Uhm...nice snippet, but this repeats the rows - something I don't need.

wflynny Over a year ago

By "repeats the rows", do you mean the index values are repeated or that the actual rows of the dataframe are repeated? For the former, use reset_index(). For the later, the rows are repeated because I just copied your sample dataframe 3 times, so it should be repeated.

FaCoffee Over a year ago

Oh yes, sorry, I was only referring to reset_index().

cggarvey · Accepted Answer · 2016-11-21 16:47:00Z

2

The first line creates an indexer that checks the 3rd and 4th characters of the A column and returns a boolean indexer of True/Falses for anything with "02".

The second line creates a new dataframe from the original after applying that indexer.

indexer = df['A'].apply(lambda x: x[2:4] == '02')
results = df.loc[indexer]

Edit: Here's the solution above adapted to a dictionary of dataframes.

frames = list()
for k in dictionary.keys():
    df = dictionary[k]
    indexer = df['A'].apply(lambda x: x[2:4] == '02')
    results = df.loc[results]
    frames.append(results)
output = pd.concat(frames)

edited Nov 21, 2016 at 16:47

answered Nov 21, 2016 at 16:38

cggarvey

5855 silver badges9 bronze badges

3 Comments

SpiderWasp42 Over a year ago

I think you should extend your code to iterate through all the dataframes in the dictionary. Otherwise this code simply stores the results from the last dataframe explored in the dictionary when iterated upon.

cggarvey Over a year ago

Yep, thanks. Forgot about that until after I posted. See the edit above that addresses that part of the spec.

SpiderWasp42 Over a year ago

No problem! Looks perfect now. Cheers!

Sam · Accepted Answer · 2016-11-21 16:22:25Z

1

try this:

keep = [] #hold all the rows you want to keep
for key in frame_dict.keys():
    frame = frame_dict[key]
    keep.append(
        frame[frame['A'].astype(str).str.contains('^\d\d02', regex=True)].copy()
    ) #append the rows matching regex for start of word (^), digit (\d), digit (\d), 02 
final = pd.concat(keep) #concatenate the matching rows

answered Nov 21, 2016 at 16:22

Sam

4,09023 silver badges27 bronze badges

Collectives™ on Stack Overflow

Pandas: select rows based on a condition applied to string

3 Answers 3

3 Comments

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related