0

I am trying to run a for loop on a long dataframe and count the number of English and non-English words in a given text (each text is a new row).

+-------+--------+----+
| Index |  Text  | ID |
+-------+--------+----+
|     1 | Text 1 |  1 |
|     2 | Text 2 |  2 |
|     3 | Text 3 |  3 |
+-------+--------+----+
     

This is my code

c = 0
for text in df_letters['Text_clean']:
    # Counters
    CTEXT= text
    c +=1
    eng_words = 0
    non_eng_words = 0
    text = " ".join(text.split())
    # For every word in text
    for word in text.split(' '):
      # Check if it is english
      if english_dict.check(word) == True:
        eng_words += 1
      else:
        non_eng_words += 1
    # Print the result
    # NOTE that these results are discarded each new text
    df_letters.at[text, 'eng_words'] = eng_words
    df_letters.at[text, 'non_eng_words'] = non_eng_words
    df_letters.at[text, 'Input'] = CTEXT
    #print('Index: {}; EN: {}; NON-EN: {}'.format(c, eng_words, non_eng_words))

but instead of getting the same dataframe i used as input with 3 new columns

+-------+--------+----+---------+-------------+---------+
| Index |  Text  | ID | English | Non-English |  Input  |
+-------+--------+----+---------+-------------+---------+
|     1 | Text 1 |  1 |       1 |           0 | Text 1  |
|     2 | Text 2 |  2 |       1 |           0 | Text 2  |
|     3 | Text 3 |  3 |       0 |           1 | Text 3  |
+-------+--------+----+---------+-------------+---------+

the dataframe is duplicating in length, adding new rows for each new text. like this

+--------+--------+-----+---------+-------------+--------+
| Index  |  Text  | ID  | English | Non-English | Input  |
+--------+--------+-----+---------+-------------+--------+
| 1      | Text 1 | 1   | nan     | nan         | nan    |
| 2      | Text 2 | 2   | nan     | nan         | nan    |
| 3      | Text 3 | 3   | nan     | nan         | nan    |
| Text 1 | nan    | nan | 1       | 0           | Text 1 |
| text 2 | nan    | nan | 1       | 0           | Text 2 |
| Text 3 | nan    | nan | 0       | 1           | Text 3 |
+--------+--------+-----+---------+-------------+--------+

What am i doing wrong here?

1 Answer 1

1

The Series.at access the DataFrame by the index value. The index of your DataFrame are [1,2,3] and not [Text 1, Text 2, Text 3]. I think the best solution for you is to replace your loop by one like this:

for index, text in df_letters['Text_clean'].iteritems():

where index will be then you can do:

df_letters.at[index, 'eng_words'] = eng_words
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.