2

I am working on large table using python (using pandas library).

I would like to perform various kind of vector operations such as Correlation with each rows of the table.

It might be a simple problem, but for me it is difficult to deal with the DataFrame structure. I do not have a good idea about how to convert each row (or column) into lists (or numpy arrays).

Even counting the number of rows does not seem to be a simple problem, because function like df.count() seems to ignore null data.

Simple data table and the expected result table are like below. In this case, I would like to calculate sum of each row pairs.

The size of real table is much bigger (more than 1000 rows and columns) and contains some null values.


Data.csv:

Label Col1 Col2
Row1 1 2
Row2 3 4
Row3 5 6

Output.csv:

Label Col3
Row1,Row2 4,6
Row1,Row3 6,8
Row2,Row3 8,10
4
  • What do you mean with null values? Is there an empty or a NaN value or is the value just equal to zero. What do you want the output to be like if there is such a null value? Commented Nov 25, 2015 at 9:05
  • @albert Sorry for my poor explanation. You may think it as NaN value. As my real dataset is converted from image(to float value), there are some values. But in this case it does not matter because I remove them as a preprocessing. Commented Nov 25, 2015 at 9:11
  • 1
    You could get number of rows with shape method: df.shape[0] will be amount of rows. Commented Nov 25, 2015 at 9:30
  • @Anton Protopopov Thank you for your advice. I confirmed that shape[0] returned the number of rows(except label) including null values. Also, shape[1] returned the number of columns, including lables and null values. Commented Nov 25, 2015 at 9:38

2 Answers 2

1

Pandas is a lot faster and more natural when working with columns. Thus, I would propose to transpose DF first, and then just sum columns

Link: Invert index and columns in a pandas DataFrame

Sign up to request clarification or add additional context in comments.

3 Comments

Thank you for your advice. I knew that transposing table can be easily done by T function, but I did not know that working with columns is much better. I'd better find solution using DataFrame.
@ToBeSpecific yes, pandas DataFrame is basically collection/list of columns of different type, as opposed to relational DB which is collection of rows. Each column is fast to operate on for typical stat functions like sum(), mean(), variance() etc. Another advice for computational purposes coerce columns into native numeric type (like float64 or int). Check stackoverflow.com/questions/18434208/…
Though my goal is to use a little more complex functions, such as euclidean distance between each vectors, it seems to be applied in the same way. In fact, I thought that working with rows is more natural before. Thank you very much for your tips.
1

Part of the solution, because you'll have duplicated rows with slightly different names so you couldn't apply drop_duplicates method of dataframes:

import pandas as pd
from io import StringIO

data = """
Label Col1 Col2
Row1 1 2
Row2 3 4
Row3 5 6
"""

df1 = pd.DataFrame()

for row in range(df.shape[0]):
   df1 = pd.concat([df1, df.ix[row,:] + df[df['Label'] != df.Label[row]]])

df1.reset_index(drop=True, inplace=True)

In [103]: df1
Out[103]:
      Label Col1 Col2
0  Row1Row2    4    6
1  Row1Row3    6    8
2  Row2Row1    4    6
3  Row2Row3    8   10
4  Row3Row1    6    8
5  Row3Row2    8   10

3 Comments

I thought that I have to convert DataFrame into lists, but it seems that it can be done directly with DataFrame. Although some more work would be needed, thank you very much for your answer.
@ToBeSpecific show your code when you'll finish that.
At present, I could only do this by making lists from DataFrame and run double for-loop for each rows, which seems like a bruteforce method. I would like to find the solution using DataFrame itself.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.