0

Working with a dictionary of dataframes, each key is an integer 0, ..., 999, and each value is a dataframe like this:

     A         B
1    10010001  17
2    10020001  5
3    10020002  11
4    10020003  2
5    10030001  86
...

I need to iterate through the entire dictionary, and to put together in a new dataframe all lines that have the 3rd and 4th digit in column A equals to 02. In my example, only lines 2, 3, and 4 would form the new dataframe. All values of column A are strings.

What could be the most efficient way of doing this within pandas?

2
  • Do the dictioanry keys have any role to play in this? Commented Nov 21, 2016 at 16:20
  • No they don't. At the moment they represent n different dataframes, but at the end of this task there will be just one "selected" dataframe. Commented Nov 21, 2016 at 16:21

3 Answers 3

2

How about something like the following, where dis your dict:

pd.concat((v[v.A.str[2:4] == '02'] for v in d.itervalues()))

With your a dict consisting of your sample dataframe repeated 3 times and keys 0-2

d = dict(zip(range(3), [df]*3))

this yields:

          A   B
2  10020001   5
3  10020002  11
4  10020003   2
2  10020001   5
3  10020002  11
4  10020003   2
2  10020001   5
3  10020002  11
4  10020003   2

This should be more memory efficient than creating a list of rows or using a list comprehension because it uses a generator expression instead. It also should be faster than using regex due to direct indexing (assuming your data values are standardized).


If you don't like the index of the combined array, you could always reset_index(). For example:

y = pd.concat((v[v.A.str[2:4] == '02'] for v in d.itervalues()))
y.reset_index.drop('index', axis=1)

          A   B
0  10020001   5
1  10020002  11
2  10020003   2
3  10020001   5
4  10020002  11
5  10020003   2
6  10020001   5
7  10020002  11
8  10020003   2
Sign up to request clarification or add additional context in comments.

3 Comments

Uhm...nice snippet, but this repeats the rows - something I don't need.
By "repeats the rows", do you mean the index values are repeated or that the actual rows of the dataframe are repeated? For the former, use reset_index(). For the later, the rows are repeated because I just copied your sample dataframe 3 times, so it should be repeated.
Oh yes, sorry, I was only referring to reset_index().
2

The first line creates an indexer that checks the 3rd and 4th characters of the A column and returns a boolean indexer of True/Falses for anything with "02".

The second line creates a new dataframe from the original after applying that indexer.

indexer = df['A'].apply(lambda x: x[2:4] == '02')
results = df.loc[indexer]

Edit: Here's the solution above adapted to a dictionary of dataframes.

frames = list()
for k in dictionary.keys():
    df = dictionary[k]
    indexer = df['A'].apply(lambda x: x[2:4] == '02')
    results = df.loc[results]
    frames.append(results)
output = pd.concat(frames)

3 Comments

I think you should extend your code to iterate through all the dataframes in the dictionary. Otherwise this code simply stores the results from the last dataframe explored in the dictionary when iterated upon.
Yep, thanks. Forgot about that until after I posted. See the edit above that addresses that part of the spec.
No problem! Looks perfect now. Cheers!
1

try this:

keep = [] #hold all the rows you want to keep
for key in frame_dict.keys():
    frame = frame_dict[key]
    keep.append(
        frame[frame['A'].astype(str).str.contains('^\d\d02', regex=True)].copy()
    ) #append the rows matching regex for start of word (^), digit (\d), digit (\d), 02 
final = pd.concat(keep) #concatenate the matching rows

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.