using pandas.read_csv() for malformed csv data

Question

This is a conceptual question, so no code or reproduceable example.

I am processing data pulled from a database which contains records from automated processes. The regular record contains 14 fields, with a unique ID, and 13 fields containing metrics, such as the date of creation, the time of execution, the customer ID, the type of job, and so on. The database accumulates records at the rate of dozens a day, and a couple of thousand per month.

Sometimes, the processes result in errors, which result in malformed rows. Here is an example:

id1,m01,m02,m03,m04,m05,m06,m07,m08,m09,m10,m11,m12,m13 /*regular record, no error, 14 fields*/
id2,m01,m02,m03,m04,m05,m06,m07,m08,m09,m10,m11,m12,"DELETE error, failed" /*error in column 14*/
id3,m01,m02,"NO SUCH JOB error, failed" /*error in column 4*/
id4,m01,m02,m03,m04,m05,m06,"JOB failed, no time recorded" /*error in column 7*/

The requirements are to (1) populate a dashboard from the metrics, and (2) catalog the types of errors. The ideal solution uses read_csv with on_bad_lines set to some function that returns a dataframe. My hacky solution is to munge the data by hand, row by row, and create two data frames from the output. The presence of the bad lines can be reliably detected by the use of the keyword "failed." I have written the logic that collects the "failed" messages and produces a stacked bar chart by date. It works, but I'd rather use a total Pandas solution.

Is it possible to use pd.read_csv() to return 2 dataframes? If so, how would this be done? Can you point me to any example code? Or am I totally off base? Thanks.

Sorry, the data is very, VERY proprietary. Just imagine that it contains cloud account numbers, usernames, passwords, database names (eg, postgres, seql server, oracle), sql commands (INSERT, DELETE, UPDATE, SELECT), contract IDs, cloud regions, created datetimes, access datetimes, running times, and so on. The domain is circumscribed and the data is very regular (including the failed processes). — ccc31807
– ccc31807, Commented Jun 27, 2022 at 15:36
What's wrong with on_bad_lines()? I'm stuck automating the process of extracting the malformed records and shoving them into a dashboard. I can do it by hand, and it works, but I don't know how to do it with Pandas. — ccc31807
– ccc31807, Commented Jun 27, 2022 at 15:38
This is not possible with just pd.read_csv, this is likely easier just to use csv to seperate out into two csv's, then if you want, use pandas on those — juanpa.arrivillaga
– juanpa.arrivillaga, Commented Jun 27, 2022 at 15:51

Winver · Accepted Answer · 2022-06-27 16:08:57Z

1

You can load your csv file on a Dataframe and apply a filter :

df = pd.read_csv("your_file.csv", header = None)
df_filter = df.apply(lambda row: row.astype(str).str.contains('failed').any(), axis=1) 

df[df_filter.values] #this gives a dataframe of "failed" rows
df[~df_filter.values] #this gives a dataframe of "non failed" rows

You need to make sure that your keyword does not appear on your data.

PS : There might be more optimized ways to do it

answered Jun 27, 2022 at 16:08

Winver

458 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

ccc31807 Over a year ago

This gives an error on the bad lines. I have to read the entire file without errors, keeping both the good lines and the bad lines, and process the two sets directly.

ccc31807 Over a year ago

ANSWER: When all else fails, check to see if the power is turned on! I'm in a big corporate environment. My Pandas is woefully out of date, and updating it will require involvement of numerous committees, security, change config, division heads, etc. Of course it didn't work (but on my home machine it worked perfectly).

sitting_duck · Accepted Answer · 2022-06-27 16:35:30Z

0

This approach reads the entire CSV into a single column. Then uses a mask that identifies failed rows to break out and create good and failed dataframes.

Read the entire CSV into a single column

import io

dfs = pd.read_fwf(sim_csv, widths=[999999], header=None)

Build a mask identifying the failed rows

fail_msk = dfs[0].str.contains('failed')

Use that mask to split out and build separate dataframes

df_good = pd.read_csv(io.StringIO('\n'.join(dfs[~fail_msk].squeeze())), header=None)
df_fail = pd.read_csv(io.StringIO('\n'.join(dfs[fail_msk].squeeze())), header=None)

answered Jun 27, 2022 at 16:35

sitting_duck

3,7901 gold badge17 silver badges20 bronze badges

1 Comment

ccc31807 Over a year ago

Yes, this would work. However, I would very much like to have two data frames after the process without further data mangling. Second best would be to have the main file (with the vast majority of records) as a data frame, and process the failed lines by hand.

Collectives™ on Stack Overflow

using pandas.read_csv() for malformed csv data

2 Answers 2

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related