Python regex to extract a portion of string

Question

I want to extract a portion of a large string. There's a target word and an upper bound on the number of words before and after that. The extracted substring must therefore contain the target word along with the upper bound words before and after it. The before and after part can contain lesser words if the target word is closer to the beginning or end of the text.

Eample string

"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."

Target word: laboris

words_before: 5

words_after: 2

Should return ['veniam, quis nostrud exercitation ullamco laboris nisi ut']

I thought of a couple of possible patterns but none of them worked. I guess it can also be done by simply traversing the string front and back from the target word. However a regex would definitely make things easier. Any help would be appreciated.

Thanks for all the answers. All of them work as desired. The regex one is most convenient for me, since the strings I have are filled with non alphabetical characters! — user2963623
– user2963623, Commented Oct 6, 2015 at 19:09

Remi Guan · Accepted Answer · 2015-10-04 01:25:17Z

5

If you want to split words, you can use slice() and split() function. For example:

>>> text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod
 tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, qu
is nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
 Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu
 fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in 
culpa qui officia deserunt mollit anim id est laborum.".split()

>>> n = text.index('laboris')
>>> s = slice(n - 5, n + 3)

>>> text[s]
['veniam,', 'quis', 'nostrud', 'exercitation', 'ullamco', 'laboris', 'nisi', 'ut']

edited Oct 4, 2015 at 1:25

answered Oct 4, 2015 at 1:08

Remi Guan

22.5k17 gold badges68 silver badges90 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

LetzerWille · Accepted Answer · 2015-10-04 01:49:23Z

3

If you still want regex....

def find_context(word_, n_before, n_after, string_):
    import re

    b= '\w+\W+'  * n_before
    a=  '\W+\w+' * n_after
    pattern = '(' + b + word_ + a + ')'

    print(re.search(pattern, string_).groups(1)[0])


find_context('laboris', 5, 2, st)

veniam, quis nostrud exercitation ullamco laboris nisi ut

find_context('culpa', 2, 2, st)

sunt in culpa qui officia

edited Oct 4, 2015 at 1:49

answered Oct 4, 2015 at 1:13

LetzerWille

5,6965 gold badges26 silver badges28 bronze badges

2 Comments

idjaw Over a year ago

This seems like it will always give 5 before and 2 after. I think OP wants arbitrary number for before and after. Or is it in fact just 5 or 2?

LetzerWille Over a year ago

@idjaw I made a change, now it is a function, and one can enter parameter values.

Community · Accepted Answer · 2017-05-23 12:00:46Z

You can also approach it with nltk and it's "concordance" method, inspired by Calling NLTK's concordance - how to get text before/after a word that was used?:

A concordance view shows us every occurrence of a given word, together with some context.

import nltk


def get_neighbors(input_text, word, before, after):
    text = nltk.Text(nltk.tokenize.word_tokenize(input_text))

    concordance_index = nltk.ConcordanceIndex(text.tokens)
    offset = next(offset for offset in concordance_index.offsets(word))

    return text.tokens[offset - before - 1: offset] + text.tokens[offset: offset + after + 1]

text = u"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."  
print(get_neighbors(text, 'laboris', 5, 2))

Prints 5 words/tokens before the target word and 2 after:

[u'veniam', u',', u'quis', u'nostrud', u'exercitation', u'ullamco', u'laboris', u'nisi', u'ut']

Collectives™ on Stack Overflow

Python regex to extract a portion of string

3 Answers 3

Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related