2

I have a csv which at some point becomes like this:

  57926,57927,"79961', 'dsfdfdf'",fdfdfdfd,0.40997048,5 x fdfdfdfd,
57927,57928,"fb0ec52878b165aa14ae302e6064aa636f9ca11aa11f5', 'fdfd'",fdfdfd,1.64948454,20 fdfdfdfd,"



                         US 



                "
57928,57929,"f55bf599dba600550de724a0bec11166b2c470f98aa06', 'fdfdf'",fdfdfd,0.81300813,10 fdfdfdfd,"



                         US 







                "
57929,57930,"82e6b', 'reetrtrt'",trtretrtr,0.79783365,fdfdfdf,"



                         NL 

I want to get rid of this empty lines. So far I tried the following script :

df = pd.read_csv("scedon_etoimo.csv")

df = df.replace(r'\\n',' ', regex=True)   

and

df=df.replace(r'\r\r\r\r\n\t\t\t\t\t\t', '',regex=True) 

as this is the error I am getting. So far I haven't manage to clean my file and do the stuff I want to do. I am not sure if I am using the correct approach. I am using pandas to process my dataset. Any help?
"

2
  • the format looks invalid to me (different amount of cols perrow). is this the actual file? Commented Sep 26, 2018 at 19:33
  • @Gerasimos, you need to add an additional " to the end of your sample file. Right now the open quotes will break attempts to read your snippet. Commented Sep 26, 2018 at 19:48

3 Answers 3

2

I would first open and preprocess the file's data, and just then pass to pandas

lines = []
with open('file.csv') as f:
    for line in f:
        if line.strip(): lines.append(line.strip())

df = pd.read_csv(io.StringIO("\n".join(lines)))
Sign up to request clarification or add additional context in comments.

2 Comments

Another option is to read the file into a dataframe, and then post-process with the str.strip function using the dataframe apply or assignment as in the accepted answer for: stackoverflow.com/questions/40950310/…
I also tried df.replace('\s+', '',regex=True,inplace=True) and worked. Anyway, thank you all for the answers!
0

Based on the file snippet you provided, here is how you can replace those empty lines Pandas is storing as NaNs with a blank string.

import numpy as np
df = pd.read_csv("scedon_etoimo.csv")
df = df.replace(np.nan, "", regex=True)

This will allow you to do everything on the base Pandas DataFrame without reading through your file(s) more than once. That being said, I would also recommend preprocessing your data before loading it in as that is often times a much safer way to handle data in non-uniform layouts.

Comments

0

Try:

df.replace(to_replace=r'[\n\r\t]', value='', regex=True, inplace=True)

This instruction replaces each \n, \r and Tab with nothing. Due to inplace argument, no need to substitute the result to df again.

Alternative: Use to_replace=r'\s' to eliminate also spaces, maybe in selected columns only.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.