Extract list of words with specific character from string using regex in Python

Question

I have a string contain words, each word has its own token (eg. NN/NNP/JJ etc). I want to take specific repeat words that contain NNP token. My code so far:

import re

sentence = "Rapunzel/NNP Sheila/NNP let/VBD down/RP her/PP$ long/JJ golden/JJ hair/NN in Yasir/NNP"

tes = re.findall(r'(\w+)/NNP', sentence)
print(tes)

The result of the code:

['Rapunzel', 'Sheila', 'Yasir']

As we see, there are 3 words contain NNP those are Rapunzel/NNP Sheila/NNP (appear next to each other) and Yasir/NNP (seperate by words to other NNP words). My problem is I need to sperate the word with repeat NNP and the other. My expected result is like :

['Rapunzel/NNP', 'Sheila/NNP'], ['Yasir/NNP']

What is the best way to perform this task, thanks.

Are you sure you need ['Rapunzel/NNP', 'Sheila/NNP'], ['Yasir/NNP'] and not ['Rapunzel', 'Sheila'], ['Yasir']? You set a capturing group in your pattern around \w+ - is it a "typo"? — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Apr 12, 2017 at 11:45
@WiktorStribiżew ya, I actually need to keep the token (NNP) for further process. the \w+ is not a typo, I guess its mean to detect any letter before /NNP . correct me if I am wrong. thanks — ytomo
– ytomo, Commented Apr 12, 2017 at 11:54

Tim Pietzcker · Accepted Answer · 2017-04-12 11:45:38Z

4

Match the groups as simple strings, and then split them:

>>> [m.split() for m in re.findall(r"\w+/NNP(?:\s+\w+/NNP)*", sentence)]
[['Rapunzel/NNP', 'Sheila/NNP'], ['Yasir/NNP']]

answered Apr 12, 2017 at 11:45

Tim Pietzcker

337k59 gold badges520 silver badges572 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

anubhava · Accepted Answer · 2017-04-12 11:45:18Z

3

You can get very close to your expected outcome using a different capture group.

>>> re.findall(r'((?:\w+/NNP\s*)+)', sentence)
['Rapunzel/NNP Sheila/NNP ', 'Yasir/NNP']

Capture group ((?:\w+/NNP\s*)+) will group all the \w+/NNP patterns together with optional spaces in between.

answered Apr 12, 2017 at 11:45

anubhava

790k67 gold badges603 silver badges671 bronze badges

Comments

Eric Duminil · Accepted Answer · 2017-04-12 11:56:56Z

1

Here's an alternative without any regex. It uses groupby and split():

from itertools import groupby

string = "Rapunzel/NNP Sheila/NNP let/VBD down/RP her/PP$ long/JJ golden/JJ hair/NN in Yasir/NNP"
words = string.split()

def get_token(word):
    return word.split('/')[-1]

print([list(ws) for token, ws in groupby(words, get_token) if token == "NNP"])
# [['Rapunzel/NNP', 'Sheila/NNP'], ['Yasir/NNP']]

answered Apr 12, 2017 at 11:56

Eric Duminil

54.6k10 gold badges80 silver badges134 bronze badges

Collectives™ on Stack Overflow

Extract list of words with specific character from string using regex in Python

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related