4

I have a string contain words, each word has its own token (eg. NN/NNP/JJ etc). I want to take specific repeat words that contain NNP token. My code so far:

import re

sentence = "Rapunzel/NNP Sheila/NNP let/VBD down/RP her/PP$ long/JJ golden/JJ hair/NN in Yasir/NNP"

tes = re.findall(r'(\w+)/NNP', sentence)
print(tes)

The result of the code:

['Rapunzel', 'Sheila', 'Yasir']

As we see, there are 3 words contain NNP those are Rapunzel/NNP Sheila/NNP (appear next to each other) and Yasir/NNP (seperate by words to other NNP words). My problem is I need to sperate the word with repeat NNP and the other. My expected result is like :

['Rapunzel/NNP', 'Sheila/NNP'], ['Yasir/NNP']

What is the best way to perform this task, thanks.

3
  • Are you sure you need ['Rapunzel/NNP', 'Sheila/NNP'], ['Yasir/NNP'] and not ['Rapunzel', 'Sheila'], ['Yasir']? You set a capturing group in your pattern around \w+ - is it a "typo"? Commented Apr 12, 2017 at 11:45
  • @WiktorStribiżew ya, I actually need to keep the token (NNP) for further process. the \w+ is not a typo, I guess its mean to detect any letter before /NNP . correct me if I am wrong. thanks Commented Apr 12, 2017 at 11:54
  • I meant the parentheses. Then use Tim's suggestion. Commented Apr 12, 2017 at 11:54

3 Answers 3

4

Match the groups as simple strings, and then split them:

>>> [m.split() for m in re.findall(r"\w+/NNP(?:\s+\w+/NNP)*", sentence)]
[['Rapunzel/NNP', 'Sheila/NNP'], ['Yasir/NNP']]
Sign up to request clarification or add additional context in comments.

Comments

3

You can get very close to your expected outcome using a different capture group.

>>> re.findall(r'((?:\w+/NNP\s*)+)', sentence)
['Rapunzel/NNP Sheila/NNP ', 'Yasir/NNP']

Capture group ((?:\w+/NNP\s*)+) will group all the \w+/NNP patterns together with optional spaces in between.

Comments

1

Here's an alternative without any regex. It uses groupby and split():

from itertools import groupby

string = "Rapunzel/NNP Sheila/NNP let/VBD down/RP her/PP$ long/JJ golden/JJ hair/NN in Yasir/NNP"
words = string.split()

def get_token(word):
    return word.split('/')[-1]

print([list(ws) for token, ws in groupby(words, get_token) if token == "NNP"])
# [['Rapunzel/NNP', 'Sheila/NNP'], ['Yasir/NNP']]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.