2

First than all, I know there's answers about this matter, but none of them are working for me until now. Anyway, I would like to know your answers, although I have already used that solution.

I have a csv file called mbti_datasets.csv. The the label of the first column is type and the second column is called description. Each row represent a new personality type (with its respective type and description).

TYPE        | DESCRIPTION
 a          | This personality likes to eat apples...\nThey look like monkeys...\nIn fact, are strong people...
 b          | b.description
 c          | c.description
 d          | d.description
...16 types | ...

In the following code, I'm trying to duplicate each personality type when the description have \n.

Code:

import pandas as pd

# Reading the file
path_root = 'gdrive/My Drive/Colab Notebooks/MBTI/mbti_datasets.csv'
root_fn = path_rooth + 'mbti_datasets.csv'
df = pd.read_csv(path_root, sep = ',', quotechar = '"', usecols = [0, 1])

# split the column where there are new lines and turn it into a series
serie = df['description'].str.split('\n').apply(pd.Series, 1).stack()

# remove the second index for the DataFrame and the series to share indexes
serie.index = serie.index.droplevel(1)

# give it a name to join it to the DataFrame
serie.name = 'description'

# remove original column
del df['description']

# join the series with the DataFrame, based on the shared index
df = df.join(serie)

# New file name and writing the new csv file
root_new_fn = path_root + 'mbti_new.csv'

df.to_csv(root_new_fn, sep = ',', quotechar = '"', encoding = 'utf-8', index = False)
new_df = pd.read_csv(root_new_fn)

print(new_df)

EXPECTED OUTPUT:

TYPE | DESCRIPTION
 a   | This personality likes to eat apples... 
 a   | They look like monkeys...
 a   | In fact, are strong people...
 b   | b.description
 b   | b.description
 c   | c.description
...  | ...

CURRENT OUTPUT:

TYPE | DESCRIPTION
 a   | This personality likes to eat apples...
 a   | They look like monkeys...NaN
 a   | NaN
 a   | In fact, are strong people...NaN
 b   | b.description...NaN
 b   | NaN
 b   | b.description
 c   | c.description
...  | ...

I'm not 100% sure, but I think the NaN value is \r.

Files uploaded to github as requested: CSV FILES

Using the @YOLO solution: CSV YOLO FILE E.g. where is failing:

2 INTJ  Existe soledad en la cima y-- siendo # adds -- in blank random blank spaces
3 INTJ  -- y las mujeres # adds -- in the beginning
3 INTJ  (...) el 0--8-- de la poblaci # doesnt end the word 'población'
10 INTJ icos-- un conflicto que parecer--a imposible. # starts letters randomly
12 INTJ c #adds just 1 letter

Translation for fully understanding:

2 INTJ There is loneliness at the top and-- being # adds -- in blank spaces
3 INTJ -- and women # adds - in the beginning
3 INTJ (...) on 0--8-- of the popula-- # doesnt end the word 'population'
10 INTJ icos-- a conflict that seems--to impossible. # starts letters randomly
12 INTJ c #adds just 1 letter

When I display if there's any NaN value and which type:

print(new_df['descripcion'].isnull())

<class 'float'>
0     False
1     False
2     False
3     False
4     False
5     False
6     False
7      True
8     False
9      True
10    False
11     True
continue...
4
  • How about using .replace('\r','') to get rid of \r first? Commented Feb 26, 2020 at 18:09
  • @MatthewSon I already tried, as I said before, I'm not 100% sure if this NaN value is \r Commented Feb 26, 2020 at 18:12
  • 2
    Then please provide a Minimal, Reproducible Example. Otherwise all we can do is guess where the NaN values come from. If we would actually have the file or a sample set or similar it might be easier to help. Commented Feb 26, 2020 at 18:19
  • @LeoE I just uploaded the files to github and shared the link in the description. Commented Feb 26, 2020 at 18:30

2 Answers 2

2

Here's a way to do, I had to find a workaround to replace \n character, somehow it wasn't working in the straight forward manner:

df['DESCRIPTION'] = df['DESCRIPTION'].str.replace('[^a-zA-Z0-9\s.]','--').str.split('--n')

df = df.explode('DESCRIPTION')

print(df)

           TYPE                               DESCRIPTION
0   a             This personality likes to eat apples...
0   a                           They look like monkeys...
0   a                      In fact-- are strong people...
1   b                                       b.description
2   c                                       c.description
3   d                                       d.description
Sign up to request clarification or add additional context in comments.

4 Comments

It is working to get rid of the NaN value, but is destroying the sintaxis of the senteces, not completing the words or not completing a sentece. E.g.: --This p--rsonalit is .... I think is because the accentuation of the words (áéíóú) (the descriptions are in spanish and english).
Also, I don't understand completly how works this string '[^a-zA-Z0-9\s.]', maybe a fully understanding of this part can give me an accurate solution.
can you update the question with some more samples where this is failing ? [^a-zA-Z0-9\s.] bascially removes everything which is not a letter, number, space or a dot.
0

The problem can be attributed to the description cells, as there are parts with two new consecutive lines, with nothing between them.

I just used .dropna() to read the new csv created, and rewriting it without the NaN values. Anyway, I think repeating this process is not the best way, but it's going straight as a solution.

df.to_csv(root_new_fn, sep = ',', quotechar = '"', encoding = 'utf-8', index = False)
new_df = pd.read_csv(root_new_fn).dropna()

new_df.to_csv(root_new_fn, sep = ',', quotechar = '"', encoding = 'utf-8', index = False)
new_df = pd.read_csv(root_new_fn)

print(type(new_df.iloc[7, 1]))# where was a NaN value
print(new_df['descripcion'].isnull())

<class 'str'>
0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
and continues...

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.