3

I want to extract words from a string that contain specific character (/IN) until to other specific character (/NNP). My code so far (still not work):

import re

sentence = "Entah/RB kenapa/NN ini/DT bayik/NN suka/VBI banget/JJ :/: )/CP :/: )/CP :/: )/CP berenang/VBI di/IN Jln/NN Terusan/NNP Borobudur/NNP dan/NN di/IN Jalan/NN Perempatan/ADJ Malioboro/NNP"

tes = re.findall(r'((?:\S+/IN\s\w+/NNP\s*)+)', sentence)
print(tes)

So the sentence contain words di/IN Jln/NN Terusan/NNP Borobudur/NNP and di/IN Jalan/NN Perempatan/ADJ Malioboro/NNP that I like to extract. The expected result:

['di/IN Jln/NN Terusan/NNP Borobudur/NNP', 'di/IN Jalan/NN Perempatan/ADJ Malioboro/NNP']

What is the best way to do this task? thanks.

1 Answer 1

2

Use

r'\S+/IN\b(?:(?!\S+/IN\b).)+\S+/NNP\b'

See the regex demo

Details

  • \S+ - 1+ non-whitespace symbols
  • /IN\b - a /IN substring as a whole word
  • (?:(?!\S+/IN\b).)+ - any 1+ chars other than line break chars that do not match the \S+/IN\b pattern sequence (use re.DOTALL to match line breaks, too)
  • \S+/NNP\b - 1+ non-whitespaces, /NNP as a whole word (since \b is a word boundary)
Sign up to request clarification or add additional context in comments.

14 Comments

Try : sentence = "Entah/RB kenapa/NN ini/DT bayik/NN suka/VBI banget/JJ :/: )/CP :/: )/CP :/: )/CP berenang/VBI di/IN Jln/NN Terusan/NNP Borobudur d/NN /NNP dan/NN di/IN Jalan/NN Perempatan/ADJ Malioboro/NNP" added d/NN between the first two NNP.
Still doesn't work. Beside that case it also doesn't cover the case that there is no string between /IN and /NNp.
@Kasramvd: There must be something between them, there will be at least a space, thus the + quantifier is correct.
I mean with space. It doesn't match that case. Test with sentence = "Entah/RB kenapa/NN ini/DT bayik/NN suka/VBI banget/JJ :/: )/CP :/: )/CP :/: )/CP berenang/VBI di/IN Jln/NN Terusan/NNP Borobudur d/IN /NNP dan/NN di/IN Jalan/NN Perempatan/ADJ Malioboro/NNP"
Can a token be empty and classified as a noun? That is weird, but [^\s/]*/IN\b(?:(?![^\s/]*/IN\b).)+[^\s/]*/NNP\b would cover that.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.