2

This is a conceptual question, so no code or reproduceable example.

I am processing data pulled from a database which contains records from automated processes. The regular record contains 14 fields, with a unique ID, and 13 fields containing metrics, such as the date of creation, the time of execution, the customer ID, the type of job, and so on. The database accumulates records at the rate of dozens a day, and a couple of thousand per month.

Sometimes, the processes result in errors, which result in malformed rows. Here is an example:

id1,m01,m02,m03,m04,m05,m06,m07,m08,m09,m10,m11,m12,m13 /*regular record, no error, 14 fields*/
id2,m01,m02,m03,m04,m05,m06,m07,m08,m09,m10,m11,m12,"DELETE error, failed" /*error in column 14*/
id3,m01,m02,"NO SUCH JOB error, failed" /*error in column 4*/
id4,m01,m02,m03,m04,m05,m06,"JOB failed, no time recorded" /*error in column 7*/

The requirements are to (1) populate a dashboard from the metrics, and (2) catalog the types of errors. The ideal solution uses read_csv with on_bad_lines set to some function that returns a dataframe. My hacky solution is to munge the data by hand, row by row, and create two data frames from the output. The presence of the bad lines can be reliably detected by the use of the keyword "failed." I have written the logic that collects the "failed" messages and produces a stacked bar chart by date. It works, but I'd rather use a total Pandas solution.

Is it possible to use pd.read_csv() to return 2 dataframes? If so, how would this be done? Can you point me to any example code? Or am I totally off base? Thanks.

6
  • 1
    would you mind sharing a minimal [mock] data sample? Commented Jun 27, 2022 at 15:27
  • What's wrong with the on_bad_lines solution for you? Commented Jun 27, 2022 at 15:33
  • Sorry, the data is very, VERY proprietary. Just imagine that it contains cloud account numbers, usernames, passwords, database names (eg, postgres, seql server, oracle), sql commands (INSERT, DELETE, UPDATE, SELECT), contract IDs, cloud regions, created datetimes, access datetimes, running times, and so on. The domain is circumscribed and the data is very regular (including the failed processes). Commented Jun 27, 2022 at 15:36
  • What's wrong with on_bad_lines()? I'm stuck automating the process of extracting the malformed records and shoving them into a dashboard. I can do it by hand, and it works, but I don't know how to do it with Pandas. Commented Jun 27, 2022 at 15:38
  • 1
    This is not possible with just pd.read_csv, this is likely easier just to use csv to seperate out into two csv's, then if you want, use pandas on those Commented Jun 27, 2022 at 15:51

2 Answers 2

1

You can load your csv file on a Dataframe and apply a filter :

df = pd.read_csv("your_file.csv", header = None)
df_filter = df.apply(lambda row: row.astype(str).str.contains('failed').any(), axis=1) 

df[df_filter.values] #this gives a dataframe of "failed" rows
df[~df_filter.values] #this gives a dataframe of "non failed" rows

You need to make sure that your keyword does not appear on your data.

PS : There might be more optimized ways to do it

Sign up to request clarification or add additional context in comments.

2 Comments

This gives an error on the bad lines. I have to read the entire file without errors, keeping both the good lines and the bad lines, and process the two sets directly.
ANSWER: When all else fails, check to see if the power is turned on! I'm in a big corporate environment. My Pandas is woefully out of date, and updating it will require involvement of numerous committees, security, change config, division heads, etc. Of course it didn't work (but on my home machine it worked perfectly).
0

This approach reads the entire CSV into a single column. Then uses a mask that identifies failed rows to break out and create good and failed dataframes.

Read the entire CSV into a single column

import io

dfs = pd.read_fwf(sim_csv, widths=[999999], header=None)

Build a mask identifying the failed rows

fail_msk = dfs[0].str.contains('failed')

Use that mask to split out and build separate dataframes

df_good = pd.read_csv(io.StringIO('\n'.join(dfs[~fail_msk].squeeze())), header=None)
df_fail = pd.read_csv(io.StringIO('\n'.join(dfs[fail_msk].squeeze())), header=None)

1 Comment

Yes, this would work. However, I would very much like to have two data frames after the process without further data mangling. Second best would be to have the main file (with the vast majority of records) as a data frame, and process the failed lines by hand.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.