0

I have a column of values, which are part of a dataframe df.

Value 
6.868061881
6.5903628020000005
6.472865833999999
6.427754219
6.40081742
6.336348032
6.277545389
6.250755132

These values have been put together from several CSV files. Now I'm trying to backtrack and find the original CSV file which contains the values. This is my code. The problem is each row of the CSV file contains alphanumeric entries and I'm comparing only for numeric ones (as Values above). So the code isn't working.

for item in df['Value']:
    for file in dirs:
        csv_file = csv.reader(open(file))
        for row in csv_file:
            for column in row:
                if str(column) == str(item):
                    print (file)

Plus, I'm trying to optimize the # loops. How do I approach this?

3
  • 2
    "isn't working"? I suppose you're getting a type mismatch error due to alphanumeric / numeric? What if you simply cast both to string? if str(column) == str(item)? Or, you could check types before doing the comparison: if all(map(type,[column,item])) and column == item: that way you're only comparing like types. Commented May 10, 2019 at 17:42
  • As David Zemens asks, what's the specific problem you're having? Also, do you care about finding all these values or just one of them? Commented May 10, 2019 at 17:43
  • @DavidZemens: Typecasting did it! Also, can we vectorize the loops? Commented May 10, 2019 at 17:51

3 Answers 3

3

Assuming dirs is a list of file paths to CSV files:

csv_dfs = {file: pd.read_csv(file) for file in dirs}
csv_df = pd.concat(csv_dfs)

If you're just looking in the 'Values' column, this is pretty straightforward:

print csv_df[csv_df['Values'].isin(df['Values'])]

Because we made the dataframe from a dictionary of the files, where the keys are filenames, the printed values will have the original filename in the index.


In a comment, you asked how to just get the filenames. Because of the way we constructed the dataframe's index, the following should work to get a series of the filenames:

csv_df[csv_df['Values'].isin(df['Values'])].reset_index()['level_0']

Note, if you're not sure what column in the CSVs you're matching, then you can loop it:

for col in df.columns:
    print csv_df[csv_df[col].isin(df['Values'])]
Sign up to request clarification or add additional context in comments.

3 Comments

I get my required information as 36x196. How do I extract only the first column? It does the job otherwise!
What do you mean by "extract only the first column"? Extract the first column of what? Do you mean you only want to perform the search on one column that's in all the CSVs?
Exactly. The Values column also exists in all the CSVs. I want to get back only the filename. Extract the first column of the output (which equals the filename)
0

A few suggestions:

Make sure you're comparing like types, e.g.:

if str(column) == str(item):

Or, you could check types before doing the comparison:

if all(map(type,[column,item])) and column == item: 

Or, dump your CSV into a DataFrame. This approach reduces the number of loops since you don't need to iterate the rows/lines in the file, just the columns:

from pandas import read_csv

for item in df['Value']:
    for file in dirs:
        csv_frame = read_csv(file)
        for column in csv_frame.columns:
            if item in csv_frame[column]:
               print(file)

3 Comments

I tried your reduced code, with a counter below the print(file). The counter reads 4886 upon completion. The items in df['Value'] are all distinct, as in they exist in only one CSV file. So, the counter should return the length of the Values column? When I run my edited code with a counter, it gives 21.
I suppose the counter should return 4886 if all the items exist in one of the files, yes.
for more diagnostic, print('item {0} was found in column {1} of file {2}'.format(item, column, file).
0

File I/O will generally take more time than processing data in memory. So, if you want to optimize your code , it will be better to loop through the csv files once, instead of for every item in your dataframe. I suggest the following -

val_list = df['Values'].values
for file in dirs:
    csv_df = pd.read_csv(file)
    df_contains = csv_df.isin(val_list)
    if np.any(df_contains.values):
        print(file)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.