How to get rid of NaN values in csv file? Python

Question

First than all, I know there's answers about this matter, but none of them are working for me until now. Anyway, I would like to know your answers, although I have already used that solution.

I have a csv file called mbti_datasets.csv. The the label of the first column is type and the second column is called description. Each row represent a new personality type (with its respective type and description).

TYPE        | DESCRIPTION
 a          | This personality likes to eat apples...\nThey look like monkeys...\nIn fact, are strong people...
 b          | b.description
 c          | c.description
 d          | d.description
...16 types | ...

In the following code, I'm trying to duplicate each personality type when the description have \n.

Code:

import pandas as pd

# Reading the file
path_root = 'gdrive/My Drive/Colab Notebooks/MBTI/mbti_datasets.csv'
root_fn = path_rooth + 'mbti_datasets.csv'
df = pd.read_csv(path_root, sep = ',', quotechar = '"', usecols = [0, 1])

# split the column where there are new lines and turn it into a series
serie = df['description'].str.split('\n').apply(pd.Series, 1).stack()

# remove the second index for the DataFrame and the series to share indexes
serie.index = serie.index.droplevel(1)

# give it a name to join it to the DataFrame
serie.name = 'description'

# remove original column
del df['description']

# join the series with the DataFrame, based on the shared index
df = df.join(serie)

# New file name and writing the new csv file
root_new_fn = path_root + 'mbti_new.csv'

df.to_csv(root_new_fn, sep = ',', quotechar = '"', encoding = 'utf-8', index = False)
new_df = pd.read_csv(root_new_fn)

print(new_df)

EXPECTED OUTPUT:

TYPE | DESCRIPTION
 a   | This personality likes to eat apples... 
 a   | They look like monkeys...
 a   | In fact, are strong people...
 b   | b.description
 b   | b.description
 c   | c.description
...  | ...

CURRENT OUTPUT:

TYPE | DESCRIPTION
 a   | This personality likes to eat apples...
 a   | They look like monkeys...NaN
 a   | NaN
 a   | In fact, are strong people...NaN
 b   | b.description...NaN
 b   | NaN
 b   | b.description
 c   | c.description
...  | ...

I'm not 100% sure, but I think the NaN value is \r.

Files uploaded to github as requested: CSV FILES

Using the @YOLO solution: CSV YOLO FILE E.g. where is failing:

2 INTJ  Existe soledad en la cima y-- siendo # adds -- in blank random blank spaces
3 INTJ  -- y las mujeres # adds -- in the beginning
3 INTJ  (...) el 0--8-- de la poblaci # doesnt end the word 'población'
10 INTJ icos-- un conflicto que parecer--a imposible. # starts letters randomly
12 INTJ c #adds just 1 letter

Translation for fully understanding:

2 INTJ There is loneliness at the top and-- being # adds -- in blank spaces
3 INTJ -- and women # adds - in the beginning
3 INTJ (...) on 0--8-- of the popula-- # doesnt end the word 'population'
10 INTJ icos-- a conflict that seems--to impossible. # starts letters randomly
12 INTJ c #adds just 1 letter

When I display if there's any NaN value and which type:

print(new_df['descripcion'].isnull())

<class 'float'>
0     False
1     False
2     False
3     False
4     False
5     False
6     False
7      True
8     False
9      True
10    False
11     True
continue...

How about using .replace('\r','') to get rid of \r first? — Matthew Son
– Matthew Son, Commented Feb 26, 2020 at 18:09
@MatthewSon I already tried, as I said before, I'm not 100% sure if this NaN value is \r — Y4RD13
– Y4RD13, Commented Feb 26, 2020 at 18:12
Then please provide a Minimal, Reproducible Example. Otherwise all we can do is guess where the NaN values come from. If we would actually have the file or a sample set or similar it might be easier to help. — LeoE
– LeoE, Commented Feb 26, 2020 at 18:19
@LeoE I just uploaded the files to github and shared the link in the description. — Y4RD13
– Y4RD13, Commented Feb 26, 2020 at 18:30

YOLO · Accepted Answer · 2020-02-26 19:10:39Z

2

Here's a way to do, I had to find a workaround to replace \n character, somehow it wasn't working in the straight forward manner:

df['DESCRIPTION'] = df['DESCRIPTION'].str.replace('[^a-zA-Z0-9\s.]','--').str.split('--n')

df = df.explode('DESCRIPTION')

print(df)

           TYPE                               DESCRIPTION
0   a             This personality likes to eat apples...
0   a                           They look like monkeys...
0   a                      In fact-- are strong people...
1   b                                       b.description
2   c                                       c.description
3   d                                       d.description

answered Feb 26, 2020 at 19:10

YOLO

22k5 gold badges25 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Y4RD13 Over a year ago

It is working to get rid of the NaN value, but is destroying the sintaxis of the senteces, not completing the words or not completing a sentece. E.g.: --This p--rsonalit is .... I think is because the accentuation of the words (áéíóú) (the descriptions are in spanish and english).

Y4RD13 Over a year ago

Also, I don't understand completly how works this string '[^a-zA-Z0-9\s.]', maybe a fully understanding of this part can give me an accurate solution.

Y4RD13 Over a year ago

The result with your code: github.com/GUNTERMAXIMUS/mbti/blob/master/mbti_new%20(1).csv

YOLO Over a year ago

can you update the question with some more samples where this is failing ? [^a-zA-Z0-9\s.] bascially removes everything which is not a letter, number, space or a dot.

Y4RD13 · Accepted Answer · 2020-02-27 00:14:20Z

The problem can be attributed to the description cells, as there are parts with two new consecutive lines, with nothing between them.

I just used .dropna() to read the new csv created, and rewriting it without the NaN values. Anyway, I think repeating this process is not the best way, but it's going straight as a solution.

df.to_csv(root_new_fn, sep = ',', quotechar = '"', encoding = 'utf-8', index = False)
new_df = pd.read_csv(root_new_fn).dropna()

new_df.to_csv(root_new_fn, sep = ',', quotechar = '"', encoding = 'utf-8', index = False)
new_df = pd.read_csv(root_new_fn)

print(type(new_df.iloc[7, 1]))# where was a NaN value
print(new_df['descripcion'].isnull())

<class 'str'>
0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
and continues...

Collectives™ on Stack Overflow

How to get rid of NaN values in csv file? Python

2 Answers 2

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related