8

I want to extract a portion of a large string. There's a target word and an upper bound on the number of words before and after that. The extracted substring must therefore contain the target word along with the upper bound words before and after it. The before and after part can contain lesser words if the target word is closer to the beginning or end of the text.

Eample string

"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."

Target word: laboris

words_before: 5

words_after: 2

Should return ['veniam, quis nostrud exercitation ullamco laboris nisi ut']

I thought of a couple of possible patterns but none of them worked. I guess it can also be done by simply traversing the string front and back from the target word. However a regex would definitely make things easier. Any help would be appreciated.

1
  • Thanks for all the answers. All of them work as desired. The regex one is most convenient for me, since the strings I have are filled with non alphabetical characters! Commented Oct 6, 2015 at 19:09

3 Answers 3

5

If you want to split words, you can use slice() and split() function. For example:

>>> text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod
 tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, qu
is nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
 Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu
 fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in 
culpa qui officia deserunt mollit anim id est laborum.".split()

>>> n = text.index('laboris')
>>> s = slice(n - 5, n + 3)

>>> text[s]
['veniam,', 'quis', 'nostrud', 'exercitation', 'ullamco', 'laboris', 'nisi', 'ut']
Sign up to request clarification or add additional context in comments.

Comments

3
If you still want regex....

def find_context(word_, n_before, n_after, string_):
    import re

    b= '\w+\W+'  * n_before
    a=  '\W+\w+' * n_after
    pattern = '(' + b + word_ + a + ')'

    print(re.search(pattern, string_).groups(1)[0])


find_context('laboris', 5, 2, st)

veniam, quis nostrud exercitation ullamco laboris nisi ut

find_context('culpa', 2, 2, st)

sunt in culpa qui officia

2 Comments

This seems like it will always give 5 before and 2 after. I think OP wants arbitrary number for before and after. Or is it in fact just 5 or 2?
@idjaw I made a change, now it is a function, and one can enter parameter values.
2

You can also approach it with nltk and it's "concordance" method, inspired by Calling NLTK's concordance - how to get text before/after a word that was used?:

A concordance view shows us every occurrence of a given word, together with some context.

import nltk


def get_neighbors(input_text, word, before, after):
    text = nltk.Text(nltk.tokenize.word_tokenize(input_text))

    concordance_index = nltk.ConcordanceIndex(text.tokens)
    offset = next(offset for offset in concordance_index.offsets(word))

    return text.tokens[offset - before - 1: offset] + text.tokens[offset: offset + after + 1]

text = u"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."  
print(get_neighbors(text, 'laboris', 5, 2))

Prints 5 words/tokens before the target word and 2 after:

[u'veniam', u',', u'quis', u'nostrud', u'exercitation', u'ullamco', u'laboris', u'nisi', u'ut']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.