Python loop through list of csv and check for value?

Question

I have five .csv's that have the same fields in the same order that need to be processed as such:

Get list of files
Make each file into a dataframe
Check if a column of letter-number combinations has a specific value (different for each file) eg: check if the number PT333 is in column1 for the file name data1:

column1   column2    column3    
PT389     LA       image.jpg
PT372     NY       image2.jpg

If the column has a specific value, print which value it has and the filename/variable name that i've assigned to that file, and then rename that dataframe to output1

I tried to do this, but I don't know how to make it loop and do the same thing for each file. At the moment it returns the number, but I also want it to return the data frame name, and I also want it to loop through all the files (a to e) to check for all the values in the numbers list.

This is what I have:

import os
import glob
import pandas as pd
from glob import glob
from os.path import expanduser

home = expanduser("~")
os.chdir(home + f'/files/')

data = glob.glob('data*.csv')
data

# If you have tips on how to loop through these rather than 
# have a line for each one, open to feedback
a = pd.read_csv(data[0], encoding='ISO-8859-1', error_bad_lines=False)
b = pd.read_csv(data[1], encoding='ISO-8859-1', error_bad_lines=False)
c = pd.read_csv(data[2], encoding='ISO-8859-1', error_bad_lines=False)
d = pd.read_csv(data[3], encoding='ISO-8859-1', error_bad_lines=False)
e = pd.read_csv(data[4], encoding='ISO-8859-1', error_bad_lines=False)
filenames = [a,b,c,d,e]
filelist= ['a','b','c','d','e']

# I am aware that this part is repetitive. Unsure how to fix this,
# I keep getting errors
# Any help appreciated
numbers = ['PT333', 'PT121', 'PT111', 'PT211', 'PT222']
def type():
    for i in a.column1:
        if i == numbers[0]:
            print(numbers[0])
        elif i == numbers[1]:
            print(numbers[1])
        elif i == numbers[2]:
            print(numbers[2])
        elif i == numbers[3]:
            print(numbers[3])
        elif i == numbers[4]:
            print(numbers[4])
type()

Also happy to take any constructive criticism as to how to repeat less code and make things smoother. TIA

If all you need is just to check if that particular string exist in the file, it might be easier to just read the file content and return the file name if target is in content. Also, in terms of critique - look into DRY. There are lots of opportunities to put your code into loops or nested loops. — r.ook
– r.ook, Commented May 29, 2020 at 16:38

r.ook · Accepted Answer · 2020-05-29 17:56:09Z

1

Give this a try

for file in glob.glob('data*.csv'):       # loop through each file
    df = pd.read_csv(file,                # create the DataFrame of the file
             encoding='ISO-8859-1', 
             error_bad_lines=False)
    result = df.where( \                  # Check where the DF contains these numbers
                 df.isin(numbers)) \
                .melt()['value'] \        # melt the DF to be a series of 'value'
                .dropna() \               # Remove any nans (non match)
                .unique().tolist()        # Return the unique values as a list.
    if result:                            # If there are any results 
        print(file, ', '.join(result)     # print the file name, and the results

Remove the comments and trailing spaces if you are copying and pasting the code. for the result line, in case you run into SyntaxError.

As mentioned you should be able to do the same without DataFrame as well:

for file in glob.glob('data*.csv'):
    data = file.read()
    for num in numbers:
        if num in data:
            print(file, num)

edited May 29, 2020 at 17:56

answered May 29, 2020 at 17:18

r.ook

13.9k2 gold badges26 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

r.ook Over a year ago

Share the love, pay it forward :)

Community · Accepted Answer · 2020-06-20 09:12:55Z

1

Also happy to take any constructive criticism as to how to repeat less code and make things smoother.

I hope you don't mind that i started with code restructure. it makes explaining the next steps easier

loading the Files Array

Using list builder allows us to iterate through the files and load them into an a list in 1 line. It also has a lot of memory and time benefits.

files = [pd.read_csv(entry, encoding='ISO-8859-1', error_bad_lines=False) for entry in data]

Type Function

First we need an argument so that we can give call this function for any given file. Along with the list we can loop over it with a for each loop.

Calling the Type Function on Multiple Files

We use for each loops again here

for file in files:
    type(file)

Result


import os
import glob
import pandas as pd
from glob import glob
from os.path import expanduser

home = expanduser("~")
os.chdir(home + f'/files/')

#please note that i am use glob instead of glob.glob here.
data = glob('data*.csv')
files = [pd.read_csv(entry, encoding='ISO-8859-1', error_bad_lines=False) for entry in data]


numbers = ['PT333', 'PT121', 'PT111', 'PT211', 'PT222']

def type(file):
    for value in file.column1:
        if value in numbers:
            print(value)

for file in files:
    type(file)

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered May 29, 2020 at 17:08

W-B

1,2978 silver badges21 bronze badges

3 Comments

Henrietta Shalansky Over a year ago

This is super comprehensive and great- thanks. One thing - do you know how I can print the data frame name along with the value from the 'numbers' list it returns? @CWB

W-B Over a year ago

@HenriettaShalansky Glad to help. Before I change the code, I'd like know what your goal is. Are you trying to count how many times a value is in each csv's column1?

Henrietta Shalansky Over a year ago

Your solution helped, i figured how to do it, and na - I'm trying to check if any of the values in the 'numbers' is in column1, if it is - it means the dataframe is of a certain category, and I then need to process them each differently

user5138047 · Accepted Answer · 2020-05-29 16:51:39Z

0

I would suggest changing the type function, and calling it slightly differently

    def type(x):
        for i in x.column1:
            if i == numbers[0]:
                print(i, numbers[0])
            elif i == numbers[1]:
                print(i, numbers[1])
            elif i == numbers[2]:
                print(i, numbers[2])
            elif i == numbers[3]:
                print(i, numbers[3])
            elif i == numbers[4]:
                print(i, numbers[4])

    for j in filenames:
            type(j)

answered May 29, 2020 at 16:51

user5138047

394 bronze badges

1 Comment

Henrietta Shalansky Over a year ago

Thank you, this helped how do also print the dataframe name that the 'number' belongs to?

Collectives™ on Stack Overflow

Python loop through list of csv and check for value?

3 Answers 3

1 Comment

loading the Files Array

Type Function

Calling the Type Function on Multiple Files

Result

3 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

loading the Files Array

Type Function

Calling the Type Function on Multiple Files

Result

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related