1

I tried to manage this myself but I coudn't...

I have text:

{Łatwe|Proste} szukanie mieszkania {Sprawdź|Wypróbuj juz dziś}, znalezienie {wcale|w ogóle}

I want to get single words from sentence or whole expressions in {} to the list. So in list it will look like this:

  • {Łatwe|Proste}
  • szukanie
  • mieszkania
  • {Sprawdź|Wypróbuj juz dziś}
  • znalezienie ...

I use split() method but it produces for example:

  • {Sprawdź|Wypróbuj
  • juz
  • dziś}

But it should be one word. I don't want to break expressions in {}.

Any help?:)

1
  • And what do you split with? Commented Jan 12, 2013 at 10:45

1 Answer 1

4

Python 2.x solution:

>>> re.findall(r'{[^}]*}|\b\w+\b', u'{Łatwe|Proste} szukanie mieszkania {Sprawdź|Wypróbuj juz dziś}, znalezienie {wcale|w ogóle}', re.U)
[u'{\u0141atwe|Proste}', u'szukanie', u'mieszkania', u'{Sprawd\u017a|Wypr\xf3buj juz dzi\u015b}', u'znalezienie', u'{wcale|w og\xf3le}']

re.U flag is necessary, since by default, \b, \w, and a few others (\d, \s and the negated counterparts) only matches ASCII characters.

Python 3.x solution:

re.findall(r'{[^}]*}|\b\w+\b', '{Łatwe|Proste} szukanie mieszkania {Sprawdź|Wypróbuj juz dziś}, znalezienie {wcale|w ogóle}')

In Python 3.x, \b, \w, \d, \s and their counterparts will perform matching on Unicode characters by default. re.U flag still exists for backward compatibility, but it is redundant to specify.

Sign up to request clarification or add additional context in comments.

1 Comment

Note that it will fail when the text outside {} contains diacritics. You need to indicate re.U flag.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.