Splitting data in Pandas/Python

Question

I'm new to Python and Pandas so bear with me.

I have a big data that looks like:

1       E                    1                  NaN
2       T  2004-09-21 01:15:53                  NaN
3       U                   30                  NaN
4       N                   32                  NaN
5       V                    1  2004-09-14 16:26:00
6       V                   -1  2004-09-14 16:53:00
7       V                    1  2004-09-14 17:08:00
...................................................
18      E                    1                  Nan
19      T  2004-10-21 02:13:43                  Nan
20      U                   35                  Nan
21      N                   40                  Nan
22      V                    1  2004-10-19 14:50:00
23      V                    1  2004-10-20 15:31:00
24      V                    1  2004-10-21 13:49:00
25      V                    1  2004-10-21 20:57:00
26      V                    1  2004-10-21 22:11:00
...................................................

How can I split this into individual little data sets, lets say x(i) , where i=0,...,N, and for example x(0) looks like:

 1       E                    1                  NaN
 2       T  2004-09-21 01:15:53                  NaN
 3       U                   30                  NaN
 4       N                   32                  NaN
 5       V                    1  2004-09-14 16:26:00
 6       V                   -1  2004-09-14 16:53:00
 7       V                    1  2004-09-14 17:08:00
 ...................................................
 17      V                    1  2004-09-16 12:38:01

I guess I should use some loop command for going from E to E, but I'm not quite sure how to divide it into individual sets.

Your message makes it very unclear what the split condition you want is. Is it every N rows? Do you want a new group starting each time you see an E? Is the "...." where you want to separate, or is that just your way of saying "I'm skipping rows here"? Etc. — DSM
– DSM, Commented Oct 10, 2015 at 18:20
@DSM I'm sorry, you are totally right. I want individual sets starting with 'E'. So, one set starts with 'E' and ends with line before another 'E'. As you can see x(0) starts with 'E' and ends with 17th line, just before another 'E' and so on. — VlS
– VlS, Commented Oct 10, 2015 at 19:00

DSM · Accepted Answer · 2015-10-10 19:28:45Z

1

You can use groupby here, using the compare-cumsum-groupby pattern (here let's say that the column with the Es is called "letter"):

>>> grouped = df.groupby((df["letter"] == "E").cumsum())
>>> frames = [g for k,g in grouped]
>>> for frame in frames:
...     print(frame)
...     print("--")
...     
  letter
0      E
1      T
2      U
--
  letter
3      E
4      M
--
  letter
5      E
--
  letter
6      E
--

This works because we can compare everything to E, creating a Series of booleans:

>>> df["letter"] == "E"
0     True
1    False
2    False
3     True
4    False
5     True
6     True
Name: letter, dtype: bool

and then if we take the cumulative sum of that we get

>>> (df["letter"] == "E").cumsum()
0    1
1    1
2    1
3    2
4    2
5    3
6    4
Name: letter, dtype: int32

where each new group has its own number. Reading the split-apply-combine section of the documentation is probably a good idea-- you might not even need to break everything up into subframes if the operation you want to perform on the groups is already supported.

answered Oct 10, 2015 at 19:28

DSM

355k67 gold badges606 silver badges504 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

VlS Over a year ago

Wow, thanks. That's it! I have everything in 'frames'. Your help means alot, have a nice day.

hellpanderr · Accepted Answer · 2015-10-10 18:17:30Z

0

You can first find indexes of rows with that value, then you can iterate through them and get slices of the dataframe:

def break_df_by(df, column, value):  
    ret = []
    indexes = df[df.ix[:, column] == value].index #[2, 5, 10] in the example below
    for n,i in enumerate(indexes):
        if n == 0: #if we want values before first 'E'
            ret.append(df[:i])
        elif n == len(indexes) - 1: #if we want values after last 'E'
            ret.append(df[i:])
        elif n > 0:
            ret.append(df[indexes[n-1]:i])

    return ret

An example, dataframe is:

Result:

break_df_by(df, 1, 'E')

[   1
 0  T
 1  V,    1
 2  E
 3  T
 4  V,    1
 5  E
 6  T
 7  U
 8  N
 9  V,     1
 10  E
 11  T
 12  U
 13  N]

answered Oct 10, 2015 at 18:17

hellpanderr

5,9563 gold badges42 silver badges50 bronze badges

1 Comment

VlS Over a year ago

Thanks for your reply. Finding indexes of rows is great idea. I'll take a look and try on my dataset. I really appreciate your help.

Collectives™ on Stack Overflow

Splitting data in Pandas/Python

2 Answers 2

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related