1

I'm new to Python and Pandas so bear with me.

I have a big data that looks like:

1       E                    1                  NaN
2       T  2004-09-21 01:15:53                  NaN
3       U                   30                  NaN
4       N                   32                  NaN
5       V                    1  2004-09-14 16:26:00
6       V                   -1  2004-09-14 16:53:00
7       V                    1  2004-09-14 17:08:00
...................................................
18      E                    1                  Nan
19      T  2004-10-21 02:13:43                  Nan
20      U                   35                  Nan
21      N                   40                  Nan
22      V                    1  2004-10-19 14:50:00
23      V                    1  2004-10-20 15:31:00
24      V                    1  2004-10-21 13:49:00
25      V                    1  2004-10-21 20:57:00
26      V                    1  2004-10-21 22:11:00
...................................................

How can I split this into individual little data sets, lets say x(i) , where i=0,...,N, and for example x(0) looks like:

 1       E                    1                  NaN
 2       T  2004-09-21 01:15:53                  NaN
 3       U                   30                  NaN
 4       N                   32                  NaN
 5       V                    1  2004-09-14 16:26:00
 6       V                   -1  2004-09-14 16:53:00
 7       V                    1  2004-09-14 17:08:00
 ...................................................
 17      V                    1  2004-09-16 12:38:01

I guess I should use some loop command for going from E to E, but I'm not quite sure how to divide it into individual sets.

2
  • 1
    Your message makes it very unclear what the split condition you want is. Is it every N rows? Do you want a new group starting each time you see an E? Is the "...." where you want to separate, or is that just your way of saying "I'm skipping rows here"? Etc. Commented Oct 10, 2015 at 18:20
  • @DSM I'm sorry, you are totally right. I want individual sets starting with 'E'. So, one set starts with 'E' and ends with line before another 'E'. As you can see x(0) starts with 'E' and ends with 17th line, just before another 'E' and so on. Commented Oct 10, 2015 at 19:00

2 Answers 2

1

You can use groupby here, using the compare-cumsum-groupby pattern (here let's say that the column with the Es is called "letter"):

>>> grouped = df.groupby((df["letter"] == "E").cumsum())
>>> frames = [g for k,g in grouped]
>>> for frame in frames:
...     print(frame)
...     print("--")
...     
  letter
0      E
1      T
2      U
--
  letter
3      E
4      M
--
  letter
5      E
--
  letter
6      E
--

This works because we can compare everything to E, creating a Series of booleans:

>>> df["letter"] == "E"
0     True
1    False
2    False
3     True
4    False
5     True
6     True
Name: letter, dtype: bool

and then if we take the cumulative sum of that we get

>>> (df["letter"] == "E").cumsum()
0    1
1    1
2    1
3    2
4    2
5    3
6    4
Name: letter, dtype: int32

where each new group has its own number. Reading the split-apply-combine section of the documentation is probably a good idea-- you might not even need to break everything up into subframes if the operation you want to perform on the groups is already supported.

Sign up to request clarification or add additional context in comments.

1 Comment

Wow, thanks. That's it! I have everything in 'frames'. Your help means alot, have a nice day.
0

You can first find indexes of rows with that value, then you can iterate through them and get slices of the dataframe:

def break_df_by(df, column, value):  
    ret = []
    indexes = df[df.ix[:, column] == value].index #[2, 5, 10] in the example below
    for n,i in enumerate(indexes):
        if n == 0: #if we want values before first 'E'
            ret.append(df[:i])
        elif n == len(indexes) - 1: #if we want values after last 'E'
            ret.append(df[i:])
        elif n > 0:
            ret.append(df[indexes[n-1]:i])

    return ret

An example, dataframe is:

    1
0   T
1   V
2   E
3   T
4   V
5   E
6   T
7   U
8   N
9   V
10  E
11  T
12  U
13  N

Result:

break_df_by(df, 1, 'E')

[   1
 0  T
 1  V,    1
 2  E
 3  T
 4  V,    1
 5  E
 6  T
 7  U
 8  N
 9  V,     1
 10  E
 11  T
 12  U
 13  N]

1 Comment

Thanks for your reply. Finding indexes of rows is great idea. I'll take a look and try on my dataset. I really appreciate your help.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.