1

I have a table like so after performing some data scraping on a pdf:

index colA           colB                     colC colD colE colF colG
-------------------------------------------------------------------
 1    ABCD  veryLongTextThatShouldNotCutOff   12   x    x    x    x
 2    ABCD  veryLongText                      14   x    x    x    x
 4    NaN   ThatShouldNotCutOff               NaN  NaN  NaN  NaN  NaN
 5    ABCD  veryLongTextThatShouldNotCutOff   12   x    x    x    x
 6    ABCD  veryLongTextThatShouldNotCutOff   x    19   x    x    x
 7    ABCD  veryLongText                      21   x    x    x    x
 9    NaN   ThatShouldNotCutOff               NaN  NaN  NaN  NaN  NaN
 10   ABCD  veryLongTextThatShouldNotCutOff   13   x    x    x    x
...

I need to clean the data up so that information in rows with index 2 and 4 are merged together like so, and with the indexes in the right order again.

index colA           colB                     colC colD colE colF colG
-------------------------------------------------------------------
 1    ABCD  veryLongTextThatShouldNotCutOff   12   x    x    x    x
 2    ABCD  veryLongTextThatShouldNotCutOff   14   x    x    x    x
 3    ABCD  veryLongTextThatShouldNotCutOff   12   x    x    x    x
 4    ABCD  veryLongTextThatShouldNotCutOff   x    19   x    x    x
 5    ABCD  veryLongTextThatShouldNotCutOff   21   x    x    x    x
 6    ABCD  veryLongTextThatShouldNotCutOff   13   x    x    x    x
...

Essentially, some rows with colA as NaN needs to be "concatenated" with the previous rows.

How do I do it using pandas?

1 Answer 1

3

You could use the non NaN values in colA to set up a group and merge the colB. Then drop the NaN rows:

group = df['colA'].notna().cumsum()
df['colB'] = df.groupby(group)['colB'].transform('sum')

df = df.dropna(subset=['colA'])

output:

   index  colA                             colB  colC colD colE colF colG
0      1  ABCD  veryLongTextThatShouldNotCutOff  12.0    x    x    x    x
1      2  ABCD  veryLongTextThatShouldNotCutOff  14.0    x    x    x    x
3      5  ABCD  veryLongTextThatShouldNotCutOff  12.0    x    x    x    x

Sign up to request clarification or add additional context in comments.

5 Comments

colB now has Nan, but he needed concatination instead of that nan.
thanks @GedasMiksenas that was a mistake of mine, I needed transform
is there a way to do without groupby? I don't think it is the right approach. I've updated my qn for more clarity
@mozway sorry, check again. I feel that a sliding window approach is more appropriate in my case, but I'm not sure how to do that in pandas
@yameroyo I don't see a fundamental difference (except having a few more rows)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.