i tried below which is one doable solution using Read a table of fixed-width formatted lines into DataFrame with pandas.read_fwf() method and assign a temporary column name col1.
You Raw Data:
$ cat source_data.csv
CODE,Name,Value
1,ABC (CEF) CO.,XYZ,500
2,GOOD VALUE CO., XYZ,20
DataFrame:
>>> df = pd.read_fwf('source_data.csv', names=['col1'])
>>> df
col1
0 CODE,Name,Value
1 1,ABC (CEF) CO.,XYZ,500
2 2,GOOD VALUE CO., XYZ,20
Solution:
So, when you will use str.extract, you will see NaN values which you can drop with dropna() and then use rename to assign the desired column Names as extracted names are Just integers ..
>>> df.col1.str.extract('(\d+)\,(\D+)\,(\d+)')
0 1 2
0 NaN NaN NaN
1 1 ABC (CEF) CO.,XYZ 500
2 2 GOOD VALUE CO., XYZ 20
Desired:
>>> df.col1.str.extract('(\d+)\,(\D+)\,(\d+)').dropna().rename(columns={0:'CODE', 1:'Name', 2:'Value'}).dropna()
CODE Name Value
1 1 ABC (CEF) CO.,XYZ 500
2 2 GOOD VALUE CO., XYZ 20
OR
In case you want to rename the columns names creating a dict then try..
>>> cols={0:'CODE', 1:'Name', 2:'Value'}
>>> df.col1.str.extract('(\d+)\,(\D+)\,(\d+)').dropna().rename(columns=cols).dropna()
CODE Name Value
1 1 ABC (CEF) CO.,XYZ 500
2 2 GOOD VALUE CO., XYZ 20
Regex Explanation:
'(\d+)\,(\D+)\,(\d+)'
1st Capturing Group (\d+)
\d+ matches a digit (equal to [0-9])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
\, matches the character , literally (case sensitive)
2nd Capturing Group (\D+)
\D+ matches any character that\'s not a digit (equal to [^0-9])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
\, matches the character , literally (case sensitive)
3rd Capturing Group (\d+)
\d+ matches a digit (equal to [0-9])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
Hope this will help.
Source datais a column? ow what your desired output would be, can you show that?df = pd.read_csv('source_data.csv')because its comma delimited file so it should work.