1

I have a set of >1000 rows of POS-tagged sentences. I want to remove words that are tagged with "RB", "IN", "PRP", "CC", "PR", "DT", "CC".

Here is my data, the "pos_tag" column shows how my data is now. The "pos_tag_clean" is what I would like to see after removing the words.

pos_tag pos_tag_clean
[(semoga, SC), (saja, RB), (di, IN), (sini, PR), (bisa, MD), (cepat, JJ), (cair, NN), (semoga, NN), (saja, RB), (ini, PR), (beneran, NN), (ada, VB), (nya, NN), (bantuan, NN), (buat, JJ), (butuh, VB), (banget, NN)] (semoga, SC), (bisa, MD), (cepat, JJ), (cair, NN), (semoga, NN), (beneran, NN), (ada, VB), (nya, NN), (bantuan, NN), (buat, JJ), (butuh, VB), (banget, NN)]
[(kak, VB), (kenapa, WH), (perbaikan, NN), (sistem, NN), (nya, PRP), (tidak, NEG), (selesai, VB)] [(kak, VB), (kenapa, WH), (perbaikan, NN), (sistem, NN), (tidak, NEG), (selesai, VB)]
[(sangat, RB), (baik, JJ)] [(baik, JJ)]

I tried using this code but the code is not suitable for looping across rows.

df['pos_tag'].pop(df['pos_tag'].index(('The', 'DT')))

invalid_tuples = []
for i, t in df['pos_tag']:
    if t[1] in ("RB", "IN", "PRP", "CC", "PR", "DT", "CC"):
        invalid_tuples.append(i)
for i in invalid_tuples:
    del df['pos_tag'][i]

1 Answer 1

1

Try:

forbidden = {"RB", "IN", "PRP", "CC", "PR", "DT", "CC"}

df["pos_tag_clean"] = df["pos_tag"].apply(
    lambda x: [(v, tag) for v, tag in x if tag not in forbidden]
)
print(df.to_markdown(index=False))

Prints:

pos_tag pos_tag_clean
[('semoga', 'SC'), ('saja', 'RB'), ('di', 'IN'), ('sini', 'PR'), ('bisa', 'MD'), ('cepat', 'JJ'), ('cair', 'NN'), ('semoga', 'NN'), ('saja', 'RB'), ('ini', 'PR'), ('beneran', 'NN'), ('ada', 'VB'), ('nya', 'NN'), ('bantuan', 'NN'), ('buat', 'JJ'), ('butuh', 'VB'), ('banget', 'NN')] [('semoga', 'SC'), ('bisa', 'MD'), ('cepat', 'JJ'), ('cair', 'NN'), ('semoga', 'NN'), ('beneran', 'NN'), ('ada', 'VB'), ('nya', 'NN'), ('bantuan', 'NN'), ('buat', 'JJ'), ('butuh', 'VB'), ('banget', 'NN')]
[('kak', 'VB'), ('kenapa', 'WH'), ('perbaikan', 'NN'), ('sistem', 'NN'), ('nya', 'PRP'), ('tidak', 'NEG'), ('selesai', 'VB')] [('kak', 'VB'), ('kenapa', 'WH'), ('perbaikan', 'NN'), ('sistem', 'NN'), ('tidak', 'NEG'), ('selesai', 'VB')]
[('sangat', 'RB'), ('baik', 'JJ')] [('baik', 'JJ')]
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.