Pandas dataframe replaces strings with NaN using pd.concat

Question

I have a pandas dataframe consisting of strings, i.e 'P1', 'P2', 'P3', ..., null.

When I try to concatenate this data frame with another, all of the strings get replaced with 'NaN'.

See my code below:

descriptions = pd.read_json('https://raw.githubusercontent.com/ansymo/msr2013-bug_dataset/master/data/v02/eclipse/short_desc.json')
descriptions = descriptions.reset_index(drop=1)
descriptions['desc'] = descriptions.short_desc.apply(operator.itemgetter(0)).apply(operator.itemgetter('what'))
f1=pd.DataFrame(descriptions['desc'])

bugPrior = pd.read_json('https://raw.githubusercontent.com/ansymo/msr2013-bug_dataset/master/data/v02/eclipse/priority.json')
bugPrior = bugPrior.reset_index(drop=1)
bugPrior['priority'] = bugPrior.priority.apply(operator.itemgetter(0)).apply(operator.itemgetter('what'))
f2=pd.DataFrame(bugPrior['priority'])

df = pd.concat([f1,f2])
print(df.head())

The output is as follows:

              desc                                     priority
0    Usability issue with external editors (1GE6IRL)      NaN
1             API - VCM event notification (1G8G6RR)      NaN
2  Would like a way to take a write lock on a tea...      NaN
3  getter/setter code generation drops "F" in ".....      NaN
4  Create Help Index Fails with seemingly incorre...      NaN

Any ideas as to how I might stop this from happening?

Ultimately, my goal is to have everything in a single data frame so that I might removes all rows with "null" values. It would also help later on in the code.

Thanks.

cs95 · Accepted Answer · 2017-08-29 16:02:02Z

2

Assuming you want to concatenate those columns horizontally, you'll need to pass axis=1 to pd.concat, because by default, concatenation is vertical.

df = pd.concat([f1,f2], axis=1)

To drop those NaN rows, you should be able to use df.dropna. Call df.reset_index after.

df = pd.concat([f1, f2], 1)
df = df.dropna().reset_index(drop=True)
print(df.head(10))
                                                desc priority
0  Create Help Index Fails with seemingly incorre...       P3
1  Internal compiler error when compiling switch ...       P3
2  Default text sizes in org.eclipse.jface.resour...       P3
3  [Presentations] [ViewMgmt] Holding mouse down ...       P3
4  Parsing of function declarations in stdio.h is...       P2
5  CCE in RenameResourceAction while renaming ele...       P3
6  Option to prevent cursor from moving off end o...       P3
7        Tasks section in the user doc is very stale       P3
8  Importing existing project with different case...       P3
9  Workspace in use --> choose new workspace but ...       P3

Printing out df.priority.unique(), we see there are 5 unique priorities:

print(df.priority.unique())
array(['P3', 'P2', 'P4', 'P1', 'P5'], dtype=object)

edited Aug 29, 2017 at 16:02

answered Aug 29, 2017 at 15:53

cs95

406k106 gold badges744 silver badges797 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Walter U. Over a year ago

Thank you for your help, this dataset is driving m nuts already, and this is just the data import!

jezrael · Accepted Answer · 2017-08-29 16:28:50Z

2

I think the best there is not create DataFrames from columns:

descriptions = pd.read_json('https://raw.githubusercontent.com/ansymo/msr2013-bug_dataset/master/data/v02/eclipse/short_desc.json')
descriptions = descriptions.reset_index(drop=1)

#get Series to f1
f1 = descriptions.short_desc.apply(operator.itemgetter(0)).apply(operator.itemgetter('what'))
print (f1.head())

bugPrior = pd.read_json('https://raw.githubusercontent.com/ansymo/msr2013-bug_dataset/master/data/v02/eclipse/priority.json')
bugPrior = bugPrior.reset_index(drop=1)

#get Series to f2
f2 = bugPrior.priority.apply(operator.itemgetter(0)).apply(operator.itemgetter('what'))
print (f2.head())

Then use same solution as cᴏʟᴅsᴘᴇᴇᴅ answer:

df = pd.concat([f1,f2], axis=1).dropna().reset_index(drop=True)
print (df.head())
                                          short_desc priority
0  Create Help Index Fails with seemingly incorre...       P3
1  Internal compiler error when compiling switch ...       P3
2  Default text sizes in org.eclipse.jface.resour...       P3
3  [Presentations] [ViewMgmt] Holding mouse down ...       P3
4  Parsing of function declarations in stdio.h is...       P2

edited Aug 29, 2017 at 16:28

answered Aug 29, 2017 at 16:03

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

5 Comments

cs95 Over a year ago

This is exactly my answer. :)

cs95 Over a year ago

It's okay. You didn't have to make that edit, but thanks, I appreciate it.

Walter U. Over a year ago

@jezrael Thanks for your answer. I think I might apply your recommendation and just create columns.

cs95 Over a year ago

@jezrael There are no problems. I actually upvoted your answer.

jezrael Over a year ago

@cᴏʟᴅsᴘᴇᴇᴅ Thanks ;)

Collectives™ on Stack Overflow

Pandas dataframe replaces strings with NaN using pd.concat

2 Answers 2

1 Comment

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related