Python - Row calculation using Pandas

Question

I am working on large table using python (using pandas library).

I would like to perform various kind of vector operations such as Correlation with each rows of the table.

It might be a simple problem, but for me it is difficult to deal with the DataFrame structure. I do not have a good idea about how to convert each row (or column) into lists (or numpy arrays).

Even counting the number of rows does not seem to be a simple problem, because function like df.count() seems to ignore null data.

Simple data table and the expected result table are like below. In this case, I would like to calculate sum of each row pairs.

The size of real table is much bigger (more than 1000 rows and columns) and contains some null values.

Data.csv:

Label Col1 Col2
Row1 1 2
Row2 3 4
Row3 5 6

Output.csv:

Label Col3
Row1,Row2 4,6
Row1,Row3 6,8
Row2,Row3 8,10

What do you mean with null values? Is there an empty or a NaN value or is the value just equal to zero. What do you want the output to be like if there is such a null value? — albert
– albert, Commented Nov 25, 2015 at 9:05
@albert Sorry for my poor explanation. You may think it as NaN value. As my real dataset is converted from image(to float value), there are some values. But in this case it does not matter because I remove them as a preprocessing. — ToBeSpecific
– ToBeSpecific, Commented Nov 25, 2015 at 9:11
You could get number of rows with shape method: df.shape[0] will be amount of rows. — Anton Protopopov
– Anton Protopopov, Commented Nov 25, 2015 at 9:30
@Anton Protopopov Thank you for your advice. I confirmed that shape[0] returned the number of rows(except label) including null values. Also, shape[1] returned the number of columns, including lables and null values. — ToBeSpecific
– ToBeSpecific, Commented Nov 25, 2015 at 9:38

Community · Accepted Answer · 2017-05-23 11:44:41Z

1

Pandas is a lot faster and more natural when working with columns. Thus, I would propose to transpose DF first, and then just sum columns

Link: Invert index and columns in a pandas DataFrame

edited May 23, 2017 at 11:44

CommunityBot

11 silver badge

answered Nov 25, 2015 at 13:38

Severin Pappadeux

20.4k4 gold badges45 silver badges71 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

ToBeSpecific Over a year ago

Thank you for your advice. I knew that transposing table can be easily done by T function, but I did not know that working with columns is much better. I'd better find solution using DataFrame.

Severin Pappadeux Over a year ago

@ToBeSpecific yes, pandas DataFrame is basically collection/list of columns of different type, as opposed to relational DB which is collection of rows. Each column is fast to operate on for typical stat functions like sum(), mean(), variance() etc. Another advice for computational purposes coerce columns into native numeric type (like float64 or int). Check stackoverflow.com/questions/18434208/…

ToBeSpecific Over a year ago

Though my goal is to use a little more complex functions, such as euclidean distance between each vectors, it seems to be applied in the same way. In fact, I thought that working with rows is more natural before. Thank you very much for your tips.

Anton Protopopov · Accepted Answer · 2015-11-25 10:10:02Z

1

Part of the solution, because you'll have duplicated rows with slightly different names so you couldn't apply drop_duplicates method of dataframes:

import pandas as pd
from io import StringIO

data = """
Label Col1 Col2
Row1 1 2
Row2 3 4
Row3 5 6
"""

df1 = pd.DataFrame()

for row in range(df.shape[0]):
   df1 = pd.concat([df1, df.ix[row,:] + df[df['Label'] != df.Label[row]]])

df1.reset_index(drop=True, inplace=True)

In [103]: df1
Out[103]:
      Label Col1 Col2
0  Row1Row2    4    6
1  Row1Row3    6    8
2  Row2Row1    4    6
3  Row2Row3    8   10
4  Row3Row1    6    8
5  Row3Row2    8   10

answered Nov 25, 2015 at 10:10

Anton Protopopov

31.9k13 gold badges93 silver badges96 bronze badges

3 Comments

ToBeSpecific Over a year ago

I thought that I have to convert DataFrame into lists, but it seems that it can be done directly with DataFrame. Although some more work would be needed, thank you very much for your answer.

Anton Protopopov Over a year ago

@ToBeSpecific show your code when you'll finish that.

ToBeSpecific Over a year ago

At present, I could only do this by making lists from DataFrame and run double for-loop for each rows, which seems like a bruteforce method. I would like to find the solution using DataFrame itself.

Collectives™ on Stack Overflow

Python - Row calculation using Pandas

2 Answers 2

3 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related