0

I am new to this but I am trying to split text in a pandas dataframe into individual rows consisting of each tokens of the text and also its respective POS and TAG. For example:

            Text
   1        Police officers arrest teen.
   2        Man agrees to help.

What i am trying to achieve here is:

Sentence#  Token     POS   Tag
   1       Police    NNS   B-NP
           officers  NNS   I-NP
           arrest    VBP   B-VP
           teen      NN    B-NP
   2       Man       NNP   B-NP
           agrees    VBZ   B-VP
           to        TO    B-VP
           help      VB    B-VP
9
  • What are you counting as a token? For example, what if a word ends with a colon/semicolon? Do you want the colon/semicolon to be treated as a separate token? Commented Apr 10, 2022 at 14:36
  • @oda in this case semicolon/colon and "." will be replaced with " " spaces before tokenizing the text. That reminds me i have to edit my question and remove the "." Commented Apr 10, 2022 at 15:01
  • @oda Yes! thank you very much! is it also possible to add the tree tags? like the B-NP, I-NP, B-VP? Commented Apr 10, 2022 at 16:20
  • @oda What do you mean by random things in my desired output? I'm looking for the Tag to be for example Police officers arrest teen -> B-NP I-NP B-VP B-NP if im not mistaken that is the correct tree tag im not too sure myself Commented Apr 10, 2022 at 17:53
  • Were you able to sort everything out? Commented Apr 14, 2022 at 3:22

1 Answer 1

1

The nltk module can help you do what you want. This code makes use of nltk to create a new DataFrame with similar output to your desired output. In order to get matching tags to your desired output, you will likely need to supply your own chunk parser. I am no expert in POS and IOB tagging.

import pandas as pd
from nltk import word_tokenize, pos_tag, tree2conlltags, RegexpParser

# orig data
d = {'Text': ["Police officers arrest teen.", "Man agrees to help."]}
# orig DataFrame
df = pd.DataFrame(data = d)

# new data
new_d = {'Sentence': [], 'Token': [], 'POS': [], 'Tag': []}

# grammar taken from nltk.org
grammar = r"NP: {<[CDJNP].*>+}"
parser = RegexpParser(grammar)

for idx, row in df.iterrows():
    temp = tree2conlltags(parser.parse(pos_tag(word_tokenize(row["Text"]))))
    new_d['Token'].extend(i[0] for i in temp)
    new_d['POS'].extend(i[1] for i in temp)
    new_d['Tag'].extend(i[2] for i in temp)
    new_d['Sentence'].extend([idx + 1] * len(temp))

# new DataFrame
new_df = pd.DataFrame(data = new_d)

print(f"***Original DataFrame***\n\n {df}\n")
print(f"***New DataFrame***\n\n {new_df}")

Output:

***Original DataFrame***

                            Text
0  Police officers arrest teen.
1           Man agrees to help.

***New DataFrame***

    Sentence     Token  POS   Tag
0         1    Police  NNP  B-NP
1         1  officers  NNS  I-NP
2         1    arrest  VBP     O
3         1      teen   NN  B-NP
4         1         .    .     O
5         2       Man   NN  B-NP
6         2    agrees  VBZ     O
7         2        to   TO     O
8         2      help   VB     O
9         2         .    .     O

Note after doing a pip install of nltk, before the above code can run, you will likely have to call nltk.download a few times. The error message you get should tell you what to execute. For example, you will likely need to execute this

>>> import nltk
>>> nltk.download('punkt')
>>> nltk.download('averaged_perceptron_tagger')
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.