Finding a specific value in csv files Python

Question

I have a column of values, which are part of a dataframe df.

Value 
6.868061881
6.5903628020000005
6.472865833999999
6.427754219
6.40081742
6.336348032
6.277545389
6.250755132

These values have been put together from several CSV files. Now I'm trying to backtrack and find the original CSV file which contains the values. This is my code. The problem is each row of the CSV file contains alphanumeric entries and I'm comparing only for numeric ones (as Values above). So the code isn't working.

for item in df['Value']:
    for file in dirs:
        csv_file = csv.reader(open(file))
        for row in csv_file:
            for column in row:
                if str(column) == str(item):
                    print (file)

Plus, I'm trying to optimize the # loops. How do I approach this?

"isn't working"? I suppose you're getting a type mismatch error due to alphanumeric / numeric? What if you simply cast both to string? if str(column) == str(item)? Or, you could check types before doing the comparison: if all(map(type,[column,item])) and column == item: that way you're only comparing like types. — David Zemens
– David Zemens, Commented May 10, 2019 at 17:42
As David Zemens asks, what's the specific problem you're having? Also, do you care about finding all these values or just one of them? — ASGM
– ASGM, Commented May 10, 2019 at 17:43
@DavidZemens: Typecasting did it! Also, can we vectorize the loops? — srkdb
– srkdb, Commented May 10, 2019 at 17:51

ASGM · Accepted Answer · 2019-05-10 18:31:50Z

3

Assuming dirs is a list of file paths to CSV files:

csv_dfs = {file: pd.read_csv(file) for file in dirs}
csv_df = pd.concat(csv_dfs)

If you're just looking in the 'Values' column, this is pretty straightforward:

print csv_df[csv_df['Values'].isin(df['Values'])]

Because we made the dataframe from a dictionary of the files, where the keys are filenames, the printed values will have the original filename in the index.

In a comment, you asked how to just get the filenames. Because of the way we constructed the dataframe's index, the following should work to get a series of the filenames:

csv_df[csv_df['Values'].isin(df['Values'])].reset_index()['level_0']

Note, if you're not sure what column in the CSVs you're matching, then you can loop it:

for col in df.columns:
    print csv_df[csv_df[col].isin(df['Values'])]

edited May 10, 2019 at 18:31

answered May 10, 2019 at 17:51

ASGM

11.5k1 gold badge37 silver badges54 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

srkdb Over a year ago

I get my required information as 36x196. How do I extract only the first column? It does the job otherwise!

ASGM Over a year ago

What do you mean by "extract only the first column"? Extract the first column of what? Do you mean you only want to perform the search on one column that's in all the CSVs?

srkdb Over a year ago

Exactly. The Values column also exists in all the CSVs. I want to get back only the filename. Extract the first column of the output (which equals the filename)

David Zemens · Accepted Answer · 2019-05-10 17:51:41Z

0

A few suggestions:

Make sure you're comparing like types, e.g.:

if str(column) == str(item):

Or, you could check types before doing the comparison:

if all(map(type,[column,item])) and column == item:

Or, dump your CSV into a DataFrame. This approach reduces the number of loops since you don't need to iterate the rows/lines in the file, just the columns:

from pandas import read_csv

for item in df['Value']:
    for file in dirs:
        csv_frame = read_csv(file)
        for column in csv_frame.columns:
            if item in csv_frame[column]:
               print(file)

answered May 10, 2019 at 17:51

David Zemens

53.8k12 gold badges86 silver badges132 bronze badges

3 Comments

srkdb Over a year ago

I tried your reduced code, with a counter below the print(file). The counter reads 4886 upon completion. The items in df['Value'] are all distinct, as in they exist in only one CSV file. So, the counter should return the length of the Values column? When I run my edited code with a counter, it gives 21.

David Zemens Over a year ago

I suppose the counter should return 4886 if all the items exist in one of the files, yes.

David Zemens Over a year ago

for more diagnostic, print('item {0} was found in column {1} of file {2}'.format(item, column, file).

Abhineet Gupta · Accepted Answer · 2019-05-10 18:06:08Z

0

File I/O will generally take more time than processing data in memory. So, if you want to optimize your code , it will be better to loop through the csv files once, instead of for every item in your dataframe. I suggest the following -

val_list = df['Values'].values
for file in dirs:
    csv_df = pd.read_csv(file)
    df_contains = csv_df.isin(val_list)
    if np.any(df_contains.values):
        print(file)

answered May 10, 2019 at 18:06

Abhineet Gupta

6294 silver badges12 bronze badges

Collectives™ on Stack Overflow

Finding a specific value in csv files Python

3 Answers 3

3 Comments

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related