I have a set of >1000 rows of POS-tagged sentences. I want to remove words that are tagged with "RB", "IN", "PRP", "CC", "PR", "DT", "CC".
Here is my data, the "pos_tag" column shows how my data is now. The "pos_tag_clean" is what I would like to see after removing the words.
| pos_tag | pos_tag_clean |
|---|---|
| [(semoga, SC), (saja, RB), (di, IN), (sini, PR), (bisa, MD), (cepat, JJ), (cair, NN), (semoga, NN), (saja, RB), (ini, PR), (beneran, NN), (ada, VB), (nya, NN), (bantuan, NN), (buat, JJ), (butuh, VB), (banget, NN)] | (semoga, SC), (bisa, MD), (cepat, JJ), (cair, NN), (semoga, NN), (beneran, NN), (ada, VB), (nya, NN), (bantuan, NN), (buat, JJ), (butuh, VB), (banget, NN)] |
| [(kak, VB), (kenapa, WH), (perbaikan, NN), (sistem, NN), (nya, PRP), (tidak, NEG), (selesai, VB)] | [(kak, VB), (kenapa, WH), (perbaikan, NN), (sistem, NN), (tidak, NEG), (selesai, VB)] |
| [(sangat, RB), (baik, JJ)] | [(baik, JJ)] |
I tried using this code but the code is not suitable for looping across rows.
df['pos_tag'].pop(df['pos_tag'].index(('The', 'DT')))
invalid_tuples = []
for i, t in df['pos_tag']:
if t[1] in ("RB", "IN", "PRP", "CC", "PR", "DT", "CC"):
invalid_tuples.append(i)
for i in invalid_tuples:
del df['pos_tag'][i]