Python on data analysis from CSV file

Question

I'm a Python beginner. I had inspired by some Python courses. This is the example CSV file below.

Name	Location	Number
Andrew Platt Andrew	A B C	100
Steven Thunder Andrew	A B C	50
Jeff England Steven	A B C	30
Andrew England Jeff	A B C	30

I want to get a result like that

['Andrew', 180
'Platt', 100
'Steven', 50
'Jeff', 60
'England', 60
'Andrew Platt', 100
'Platt Andrew', 100
'Steven Thunder', 50
'Thunder Andrew', 50
........]

Logic:

One-word name, e.g. 'Andrew', as it shows rows 1, 2 and 4, so the result is 180 (100+50+30)
Two-word name, e.g. 'Andrew Platt', as it shows row 1 only, so the result is 100
Export result to a new CSV file

import csv
#from itertools import chain

#find one-word
filename=open('sample.csv', 'r')
file = csv.DictReader(filename)
one_word=[]
for col in file:
    one_word.append(col['Name'].split()) #find one-word
print(one_word)
#list(chain.from_iterable(one_word)) #this is another code I learned

#get result
#find two-word
#get result
#combine
#sorted by value
#export to a new CSV file

My problem is how to get value, i.e. 180..., which means I need to match the word, then get 'Number' and sum them all?

Note: the location is useless, it's just a coding practice.

Updated: Maybe make 2 lists, i.e. one-word and two-word, then zip them

Zach Young · Accepted Answer · 2022-10-18 16:21:01Z

1

Looking at your expected result, I'm not sure how you get:

'Andrew Platt', 100
'Platt Andrew', 50

I see "Andrew Platt" and "Platt Andrew" in the first row, but both two-word combos should have the same value of 100, yes?

import csv
from collections import Counter
from itertools import combinations
from pprint import pprint

one_words = Counter()
two_words = Counter()

with open("input.csv", newline="") as f:
    reader = csv.DictReader(f)
    for row in reader:
        items = row["Name"].split(" ")

        # Unique one-word
        for item in set(items):
            one_words[item] += int(row["Number"])

        for two_word in combinations(items, 2):
            # Skip combos like [Andrew Andrew]
            if len(set(two_word)) == 1:
                continue

            print(f"row is {type(row)}")
            print(f"two_word is {type(two_word)}")
            print(f"two_words is {type(two_words)}")

            two_words[" ".join(two_word)] += int(row["Number"])


pprint(one_words)
pprint(two_words)

I got:

Counter({'Andrew': 180,
         'Platt': 100,
         'Steven': 80,
         'England': 60,
         'Jeff': 60,
         'Thunder': 50})
Counter({'Andrew Platt': 100,
         'Platt Andrew': 100,
         'Steven Thunder': 50,
         'Steven Andrew': 50,
         'Thunder Andrew': 50,
         'Jeff England': 30,
         'Jeff Steven': 30,
         'England Steven': 30,
         'Andrew England': 30,
         'Andrew Jeff': 30,
         'England Jeff': 30})

My debug-print statements print:

row is <class 'dict'>
two_word is <class 'tuple'>
two_words is <class 'collections.Counter'>

edited Oct 18, 2022 at 16:21

answered Oct 15, 2022 at 16:28

Zach Young

11.4k4 gold badges38 silver badges57 bronze badges

Sign up to request clarification or add additional context in comments.

11 Comments

Peter Over a year ago

'Platt Andrew', 100. This is typo, my wrong, corrected

Peter Over a year ago

I tried running it in another data set. It shows tuple indices must be integers or slices, not str Data type: Name: object Number: int64

Zach Young Over a year ago

@Peter I bet that’s because you’re trying to access a column by name, but you’re using the regular reader (not DictReader) which means you need to access columns by their 0-based position. Take a look, and if thst isn’t it, please include the line number of the exception.

Peter Over a year ago

I got an error message. string indices must be integers it's says two_word[" ".]........

Zach Young Over a year ago

In my code, two_word is a two-item sequence. two_words (with an “s”) is the Counter that can accessed by key (like a dict). But… I’m not sure what’s going on because that error message says two_word (no “s”) is a string… did you redefine that variable?

|

endive1783 · Accepted Answer · 2022-10-14 06:49:45Z

0

You need to get the unique names, and find combinations of two names. Then you can find if each name (1 or 2 words) is included in the first column.

import pandas as pd
import numpy as np
import itertools
#this is your data
df = pd.DataFrame([['Andrew Platt Andrew', 'Steven Thunder Andrew', 'Jeff England Steven',
              'Andrew England Jeff'], [100,50,30,30]] ).transpose()
df.columns = ['names','x']

#get the unique names that appear in the columns
names = df.names.apply(lambda x : x.split(' '))
one_words = np.unique(names.sum())

#get all combinations of two names
two_words = [a+' '+b for a,b in itertools.combinations(one_words, 2)]


#fill the dictionnaries with the values 
d_1 = {w : df.loc[df.names.str.contains(w),'x'].sum() for w in one_words}
d_2 = {w : df.loc[df.names.str.contains(w),'x'].sum() for w in two_words}

d = d_1 | d_2 #merge the disctionnaries

The output :

{'Andrew': 180,
 'England': 60,
 'Jeff': 60,
 'Platt': 100,
 'Steven': 80,
 'Thunder': 50,
 'Andrew England': 30,
 'Andrew Jeff': 0,
 'Andrew Platt': 100,
 'Andrew Steven': 0,
 'Andrew Thunder': 0,
 'England Jeff': 30,
 'England Platt': 0,
 'England Steven': 30,
 'England Thunder': 0,
 'Jeff Platt': 0,
 'Jeff Steven': 0,
 'Jeff Thunder': 0,
 'Platt Steven': 0,
 'Platt Thunder': 0,
 'Steven Thunder': 50}

answered Oct 14, 2022 at 6:49

endive1783

1,0751 gold badge12 silver badges21 bronze badges

4 Comments

Peter Over a year ago

Thanks for your answer, but what if names is float, how to do modify in this case.

Shaig Hamzaliyev Over a year ago

names cannot be float I mean if it is a normal name:)

Peter Over a year ago

I tried with another data set. But it shows error: nothing to repeat at position 0

endive1783 Over a year ago

A quick fix is to use names = df.names.apply(lambda x : str(x).split(' ')), But feel free to edit the question with your actual data if this does not solve it !

Collectives™ on Stack Overflow

Python on data analysis from CSV file

2 Answers 2

11 Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

11 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related