2

I'm new into python, does somebody have an idea what would be a good approach? I could just script it, but it's probably faster to use a package.

I have this .csv file (gigabytes large):

name,   value,  time
A,   1, 10
B,   2, 10
C,   3, 10
C,   3, 10 (should ignore duplicates, or non complete (A,B,C) entries
A,   4, 12 (should be sorted by time, this entry should be at the end, after time==11)
B,   5, 12
C,   6, 12
B,   7, 11 (order of A,B,C might be different)
C,   8, 11
A,   9, 11

convert it to a new .csv file containing:

time,   A,  B,  C
10, 1,  2,  3
11, 9,  7,  8
12, 4,  5,  6
4
  • What OS are you working on? Commented Apr 10, 2018 at 13:08
  • 1
    What's your code so far? Commented Apr 10, 2018 at 13:08
  • A good approach would be to research how you can parse CSV with python, and figure out an algorithm that will do what you want. Hope this helps! Commented Apr 10, 2018 at 13:10
  • Aside from filtering, the operation you're trying to do is converting long-form data to wide. Commented Apr 10, 2018 at 13:12

2 Answers 2

6

I think need drop_duplicates with pivot:

df = df.drop_duplicates().pivot('time','name','value')
print (df)
name  A  B  C
time         
10    1  2  3
11    9  7  8
12    4  5  6
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks, very useful! I was stuck at just reading the csv file. Together with the addition of @divyang this solved my question.
Very useful command this pivot(), the documentation has a very similar example pandas.pydata.org/pandas-docs/stable/generated/…
@Sheldon Glad can help!
2

Since I can't comment I would like to add to @jezrael answer that you would also want to drop incomplete or NaN values. By using df.dropna

import numpy as np
import pandas as pd
A = 'a'
B = 'b'
C = 'c'
df = pd.DataFrame([[A,   1, 10],
                [B,   2, 10],
                [C,   3, 10],
                [C,   3, 10],
                [A,   4, 12],
                [B,   5, 12],
                [C,   6, 12],
                [B,   7, 11],
                [C,   8, 11],
                [A,   9, 11],
                [np.nan, 10, 0]], columns = ["name","value", "time"])
df.dropna(inplace=True)
df.drop_duplicates(inplace=True)
df = df.pivot('time','name','value')
print(df)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.