0

I have CSV file with the below data.

**Source data:**

CODE,Name,Value
1,ABC (CEF) CO.,XYZ,500
2,GOOD VALUE  CO., XYZ,20

Expected output:
enter image description here

But python pandas is not reading the data properly due to comma values in Name column. I used below link. But not working. XYZ is truncating in ABC (CEF) CO.,XYZ.

pattern = '[:;\?\.<\'/]' # I use \ to ignore characters that are used in regex :)
df['Name_Clean'] = df['Name'].str.replace(pattern, '').str.strip()

Unable to remove special characters ;:??/?<

7
  • The data you have shown Is Source data is a column? ow what your desired output would be, can you show that? Commented Sep 11, 2019 at 4:12
  • Source data is file name. Name is the column name. Source value =ABC (CEF) CO.,XYZ & Target value=ABC (CEF) CO.,XYZ. But XYZ is missing from Name Commented Sep 11, 2019 at 4:17
  • what is expected output you see ? Commented Sep 11, 2019 at 4:23
  • Hi @KarnKumar, updated the post. Please check. Thank you Commented Sep 11, 2019 at 4:46
  • Okay, so what if you do df = pd.read_csv('source_data.csv') because its comma delimited file so it should work. Commented Sep 11, 2019 at 5:27

2 Answers 2

1

i tried below which is one doable solution using Read a table of fixed-width formatted lines into DataFrame with pandas.read_fwf() method and assign a temporary column name col1.

You Raw Data:

$ cat source_data.csv
CODE,Name,Value
1,ABC (CEF) CO.,XYZ,500
2,GOOD VALUE  CO., XYZ,20

DataFrame:

>>> df =  pd.read_fwf('source_data.csv', names=['col1'])
>>> df
                        col1
0            CODE,Name,Value
1    1,ABC (CEF) CO.,XYZ,500
2  2,GOOD VALUE  CO., XYZ,20

Solution:

So, when you will use str.extract, you will see NaN values which you can drop with dropna() and then use rename to assign the desired column Names as extracted names are Just integers ..

   >>> df.col1.str.extract('(\d+)\,(\D+)\,(\d+)')
     0                     1    2
0  NaN                   NaN  NaN
1    1     ABC (CEF) CO.,XYZ  500
2    2  GOOD VALUE  CO., XYZ   20

Desired:

>>> df.col1.str.extract('(\d+)\,(\D+)\,(\d+)').dropna().rename(columns={0:'CODE', 1:'Name', 2:'Value'}).dropna()
  CODE                  Name Value
1    1     ABC (CEF) CO.,XYZ   500
2    2  GOOD VALUE  CO., XYZ    20

OR

In case you want to rename the columns names creating a dict then try..

>>> cols={0:'CODE', 1:'Name', 2:'Value'}
>>> df.col1.str.extract('(\d+)\,(\D+)\,(\d+)').dropna().rename(columns=cols).dropna()
  CODE                  Name Value
1    1     ABC (CEF) CO.,XYZ   500
2    2  GOOD VALUE  CO., XYZ    20

Regex Explanation:

'(\d+)\,(\D+)\,(\d+)'


1st Capturing Group (\d+)
\d+ matches a digit (equal to [0-9])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)

\, matches the character , literally (case sensitive)

2nd Capturing Group (\D+)
\D+ matches any character that\'s not a digit (equal to [^0-9])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)

\, matches the character , literally (case sensitive)

3rd Capturing Group (\d+)
\d+ matches a digit (equal to [0-9])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)

Hope this will help.

Sign up to request clarification or add additional context in comments.

1 Comment

Hi Karn Kumar, I have 25 columns with dates, text,numbers,alpha numerics. I confused with d,Ds. Please advise.
1

I think that the best solution is to correct your CSV. Starting from

CODE,Name,Value
1,ABC (CEF) CO.,XYZ,500
2,GOOD VALUE  CO., XYZ,20

and applying

<./input.csv sed -r 's/([0-9]+),(.+),([0-9]+)/\1,"\2",\3/g' >./output.csv

you will have a properly formatted CSV

CODE,Name,Value
1,"ABC (CEF) CO.,XYZ",500
2,"GOOD VALUE  CO., XYZ",20

Some notes about the command:

  • sed is a command line utility that parses and transforms text (you can use it in every operating system);
  • <./input.csv sed to send the content of your input file to sed;
  • s/([0-9]+),(.+),([0-9]+)/\1,"\2",\3/g is the search and replace via regex https://regex101.com/r/WRzcEW/1 (in the upper right part you find the explanation);
  • >./output.csv to save the output

2 Comments

Hi All, I did not understand the code...<input.csv sed -r 's/([0-9]+),(.+),([0-9]+)/\1,"\2",\3/g' Please help me
Hi @PythonWizard I have added some final notes for you in my reply

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.