2

I have five .csv's that have the same fields in the same order that need to be processed as such:

  • Get list of files
  • Make each file into a dataframe
  • Check if a column of letter-number combinations has a specific value (different for each file) eg: check if the number PT333 is in column1 for the file name data1:
column1   column2    column3    
PT389     LA       image.jpg
PT372     NY       image2.jpg
  • If the column has a specific value, print which value it has and the filename/variable name that i've assigned to that file, and then rename that dataframe to output1

I tried to do this, but I don't know how to make it loop and do the same thing for each file. At the moment it returns the number, but I also want it to return the data frame name, and I also want it to loop through all the files (a to e) to check for all the values in the numbers list.

This is what I have:

import os
import glob
import pandas as pd
from glob import glob
from os.path import expanduser

home = expanduser("~")
os.chdir(home + f'/files/')

data = glob.glob('data*.csv')
data

# If you have tips on how to loop through these rather than 
# have a line for each one, open to feedback
a = pd.read_csv(data[0], encoding='ISO-8859-1', error_bad_lines=False)
b = pd.read_csv(data[1], encoding='ISO-8859-1', error_bad_lines=False)
c = pd.read_csv(data[2], encoding='ISO-8859-1', error_bad_lines=False)
d = pd.read_csv(data[3], encoding='ISO-8859-1', error_bad_lines=False)
e = pd.read_csv(data[4], encoding='ISO-8859-1', error_bad_lines=False)
filenames = [a,b,c,d,e]
filelist= ['a','b','c','d','e']

# I am aware that this part is repetitive. Unsure how to fix this,
# I keep getting errors
# Any help appreciated
numbers = ['PT333', 'PT121', 'PT111', 'PT211', 'PT222']
def type():
    for i in a.column1:
        if i == numbers[0]:
            print(numbers[0])
        elif i == numbers[1]:
            print(numbers[1])
        elif i == numbers[2]:
            print(numbers[2])
        elif i == numbers[3]:
            print(numbers[3])
        elif i == numbers[4]:
            print(numbers[4])
type()

Also happy to take any constructive criticism as to how to repeat less code and make things smoother. TIA

1
  • If all you need is just to check if that particular string exist in the file, it might be easier to just read the file content and return the file name if target is in content. Also, in terms of critique - look into DRY. There are lots of opportunities to put your code into loops or nested loops. Commented May 29, 2020 at 16:38

3 Answers 3

1

Give this a try

for file in glob.glob('data*.csv'):       # loop through each file
    df = pd.read_csv(file,                # create the DataFrame of the file
             encoding='ISO-8859-1', 
             error_bad_lines=False)
    result = df.where( \                  # Check where the DF contains these numbers
                 df.isin(numbers)) \
                .melt()['value'] \        # melt the DF to be a series of 'value'
                .dropna() \               # Remove any nans (non match)
                .unique().tolist()        # Return the unique values as a list.
    if result:                            # If there are any results 
        print(file, ', '.join(result)     # print the file name, and the results

Remove the comments and trailing spaces if you are copying and pasting the code. for the result line, in case you run into SyntaxError.

As mentioned you should be able to do the same without DataFrame as well:

for file in glob.glob('data*.csv'):
    data = file.read()
    for num in numbers:
        if num in data:
            print(file, num)
Sign up to request clarification or add additional context in comments.

1 Comment

Share the love, pay it forward :)
1

Also happy to take any constructive criticism as to how to repeat less code and make things smoother.

I hope you don't mind that i started with code restructure. it makes explaining the next steps easier

loading the Files Array

Using list builder allows us to iterate through the files and load them into an a list in 1 line. It also has a lot of memory and time benefits.

files = [pd.read_csv(entry, encoding='ISO-8859-1', error_bad_lines=False) for entry in data]

more on comprehension

Type Function

First we need an argument so that we can give call this function for any given file. Along with the list we can loop over it with a for each loop.

Calling the Type Function on Multiple Files

We use for each loops again here

for file in files:
    type(file)

more on python for loops


def type(file):
    for value in file.column1:
        if value in numbers:
            print(value)

Result


import os
import glob
import pandas as pd
from glob import glob
from os.path import expanduser

home = expanduser("~")
os.chdir(home + f'/files/')

#please note that i am use glob instead of glob.glob here.
data = glob('data*.csv')
files = [pd.read_csv(entry, encoding='ISO-8859-1', error_bad_lines=False) for entry in data]


numbers = ['PT333', 'PT121', 'PT111', 'PT211', 'PT222']

def type(file):
    for value in file.column1:
        if value in numbers:
            print(value)

for file in files:
    type(file)

3 Comments

This is super comprehensive and great- thanks. One thing - do you know how I can print the data frame name along with the value from the 'numbers' list it returns? @CWB
@HenriettaShalansky Glad to help. Before I change the code, I'd like know what your goal is. Are you trying to count how many times a value is in each csv's column1?
Your solution helped, i figured how to do it, and na - I'm trying to check if any of the values in the 'numbers' is in column1, if it is - it means the dataframe is of a certain category, and I then need to process them each differently
0

I would suggest changing the type function, and calling it slightly differently

    def type(x):
        for i in x.column1:
            if i == numbers[0]:
                print(i, numbers[0])
            elif i == numbers[1]:
                print(i, numbers[1])
            elif i == numbers[2]:
                print(i, numbers[2])
            elif i == numbers[3]:
                print(i, numbers[3])
            elif i == numbers[4]:
                print(i, numbers[4])

    for j in filenames:
            type(j)

1 Comment

Thank you, this helped how do also print the dataframe name that the 'number' belongs to?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.