Comma values in a column of CSV - not working code

Question

I have CSV file with the below data.

**Source data:**

CODE,Name,Value
1,ABC (CEF) CO.,XYZ,500
2,GOOD VALUE  CO., XYZ,20

Expected output:

But python pandas is not reading the data properly due to comma values in Name column. I used below link. But not working. XYZ is truncating in ABC (CEF) CO.,XYZ.

pattern = '[:;\?\.<\'/]' # I use \ to ignore characters that are used in regex :)
df['Name_Clean'] = df['Name'].str.replace(pattern, '').str.strip()

Unable to remove special characters ;:??/?<

The data you have shown Is Source data is a column? ow what your desired output would be, can you show that? — Karn Kumar
– Karn Kumar, Commented Sep 11, 2019 at 4:12
Source data is file name. Name is the column name. Source value =ABC (CEF) CO.,XYZ & Target value=ABC (CEF) CO.,XYZ. But XYZ is missing from Name — PythonWizard
– PythonWizard, Commented Sep 11, 2019 at 4:17
Okay, so what if you do df = pd.read_csv('source_data.csv') because its comma delimited file so it should work. — Karn Kumar
– Karn Kumar, Commented Sep 11, 2019 at 5:27

Karn Kumar · Accepted Answer · 2019-09-11 06:52:39Z

i tried below which is one doable solution using Read a table of fixed-width formatted lines into DataFrame with pandas.read_fwf() method and assign a temporary column name col1.

You Raw Data:

$ cat source_data.csv
CODE,Name,Value
1,ABC (CEF) CO.,XYZ,500
2,GOOD VALUE  CO., XYZ,20

DataFrame:

>>> df =  pd.read_fwf('source_data.csv', names=['col1'])
>>> df
                        col1
0            CODE,Name,Value
1    1,ABC (CEF) CO.,XYZ,500
2  2,GOOD VALUE  CO., XYZ,20

Solution:

So, when you will use str.extract, you will see NaN values which you can drop with dropna() and then use rename to assign the desired column Names as extracted names are Just integers ..

   >>> df.col1.str.extract('(\d+)\,(\D+)\,(\d+)')
     0                     1    2
0  NaN                   NaN  NaN
1    1     ABC (CEF) CO.,XYZ  500
2    2  GOOD VALUE  CO., XYZ   20

Desired:

>>> df.col1.str.extract('(\d+)\,(\D+)\,(\d+)').dropna().rename(columns={0:'CODE', 1:'Name', 2:'Value'}).dropna()
  CODE                  Name Value
1    1     ABC (CEF) CO.,XYZ   500
2    2  GOOD VALUE  CO., XYZ    20

OR

In case you want to rename the columns names creating a dict then try..

>>> cols={0:'CODE', 1:'Name', 2:'Value'}
>>> df.col1.str.extract('(\d+)\,(\D+)\,(\d+)').dropna().rename(columns=cols).dropna()
  CODE                  Name Value
1    1     ABC (CEF) CO.,XYZ   500
2    2  GOOD VALUE  CO., XYZ    20

Regex Explanation:

'(\d+)\,(\D+)\,(\d+)'


1st Capturing Group (\d+)
\d+ matches a digit (equal to [0-9])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)

\, matches the character , literally (case sensitive)

2nd Capturing Group (\D+)
\D+ matches any character that\'s not a digit (equal to [^0-9])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)

\, matches the character , literally (case sensitive)

3rd Capturing Group (\d+)
\d+ matches a digit (equal to [0-9])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)

Hope this will help.

Hi Karn Kumar, I have 25 columns with dates, text,numbers,alpha numerics. I confused with d,Ds. Please advise.

aborruso · Accepted Answer · 2019-09-16 14:44:17Z

1

I think that the best solution is to correct your CSV. Starting from

CODE,Name,Value
1,ABC (CEF) CO.,XYZ,500
2,GOOD VALUE  CO., XYZ,20

and applying

<./input.csv sed -r 's/([0-9]+),(.+),([0-9]+)/\1,"\2",\3/g' >./output.csv

you will have a properly formatted CSV

CODE,Name,Value
1,"ABC (CEF) CO.,XYZ",500
2,"GOOD VALUE  CO., XYZ",20

Some notes about the command:

sed is a command line utility that parses and transforms text (you can use it in every operating system);
<./input.csv sed to send the content of your input file to sed;
s/([0-9]+),(.+),([0-9]+)/\1,"\2",\3/g is the search and replace via regex https://regex101.com/r/WRzcEW/1 (in the upper right part you find the explanation);
>./output.csv to save the output

edited Sep 16, 2019 at 14:44

answered Sep 11, 2019 at 8:18

aborruso

5,8723 gold badges27 silver badges49 bronze badges

2 Comments

PythonWizard Over a year ago

Hi All, I did not understand the code...<input.csv sed -r 's/([0-9]+),(.+),([0-9]+)/\1,"\2",\3/g' Please help me

aborruso Over a year ago

Hi @PythonWizard I have added some final notes for you in my reply

Collectives™ on Stack Overflow

Comma values in a column of CSV - not working code

2 Answers 2

You Raw Data:

DataFrame:

Solution:

Regex Explanation:

1 Comment

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

You Raw Data:

DataFrame:

Solution:

Regex Explanation:

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related