How to ignore Null values in a CSV columns with pandas while processing the text?

Question

I have a CSV file and each word in a sentence is represented in cell, with a null cell between each sentence.

My problem is in run_id column, after I load the csv file using pandas I separate each sentence using function "get sent from df" but I've a line of assertion that double check that the run_id is unique and =1 but it fails because it take "Null" as a "Null sentence"

Below is a snippet of my code, I hope you can help

Note : I working on T="test_RE"

def load_dataset(fn,T):

            if T=="test_RE":
          df = pandas.read_csv(fn,
                         sep= ";",
                         header=0,
                         keep_default_na=False)
          df.drop(df.columns[df.columns.str.contains('unnamed',case = False)],axis = 1, inplace = True)
          df.word_id = pd.to_numeric(df.word_id, errors='coerce').astype('Int64')
          df.run_id = pd.to_numeric(df.run_id, errors='coerce').astype('Int64')
          df.sent_id = pd.to_numeric(df.sent_id, errors='coerce').astype('Int64')
          df.head_pred_id = pd.to_numeric(df.head_pred_id, errors='coerce').astype('Int64')
      else:
            df = pandas.read_csv(fn,
                         sep= "\t",
                         header=0,
                         keep_default_na=False)
      print (df.dtypes)

      if T=="train":
        encoder.fit(df.label.values)
        print('this is the IF cond')
        print('df.label.values. shape',df.label.values.shape)

      sents = get_sents_from_df(df)

      print('shape of sents 0',sents[0].shape)
      print('sents[0]',sents[0])
      print('shape of sents 1',sents[1].shape)
      print('sents[1]',sents[1])

      #make sure that all sents agree on run_id

                assert(all([len(set(sent.run_id.values)) == 1
                    for sent in sents])) **ERROR HERE**

the function

def get_sents_from_df( df):

      #Split a data frame by rows accroding to the sentences
      return [df[df.run_id == run_id]
            for run_id
            in sorted(set(df.run_id.values))]

shape of sent 0 is (10,8) which is correct and the sent[0] is correct

but shape of sent1 is (0,8) and of course sent1 isn't printed because it null, I should have sent1 shape = (6,8) any help ?

Image of Output of print statements:

[check image below] : there no image and besides that it's always better to post a sample of your input data — Sebastien D
– Sebastien D, Commented Jun 20, 2019 at 8:17
On the first place, always prefer code to screenshots. Second, what is your code supposed to do? What is the desired output? — Sebastien D
– Sebastien D, Commented Jun 20, 2019 at 8:32

Sebastien D · Accepted Answer · 2019-06-20 09:12:53Z

1

To skip the blank rows (which contain both None values and empty strings) , why not just do:

df = df[df.word.apply(lambda x : len(x)>0)]

answered Jun 20, 2019 at 9:12

Sebastien D

4,5024 gold badges23 silver badges50 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to ignore Null values in a CSV columns with pandas while processing the text?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related