6

I have a large string and I want to get all sub-strings of format [[someword]] from it.
Meaning, get all words (list) which are wrapped in opening and closing square brackets.

Now one way to do this is splitting string by space and then filtering the list with this filter but the problem is some times [[someword]] does not exist as a word, it might have a ,, space or . right before of after it.

What is the best way to do this?

I will appreciate a solution in Scala but as this is more of a programming problem, I will convert your solution to Scala if it's in some other language I know e.g. Python.

This question is different from marked duplicate because the regex needs to able to accommodate characters other than English characters in between the brackets.

2
  • 1
    Can you post sample sample strings from which you want to extract your words? The regex you can use to match and extract words like you wanted is this \[{2}[^[\]]+\]{2} Demo Commented Apr 1, 2019 at 11:39
  • Possible duplicate of Scala regexps: how to return matches as array or list Commented Apr 1, 2019 at 11:50

2 Answers 2

3

You can use this (?<=\[{2})[^[\]]+(?=\]{2}) regex to match and extract all the words you need that are contained in double square brackets.

Here is a Python solution,

import re

s = 'some text [[someword]] some [[some other word]]other text '
print(re.findall(r'(?<=\[{2})[^[\]]+(?=\]{2})', s))

Prints,

['someword', 'some other word']

I never worked in Scala but here is a solution in Java and as I know Scala is based upon Java only hence this may help.

String s = "some text [[someword]] some [[some other word]]other text ";
Pattern p = Pattern.compile("(?<=\\[{2})[^\\[\\]]+(?=\\]{2})");
Matcher m = p.matcher(s);
while(m.find()) {
    System.out.println(m.group());
}

Prints,

someword
some other word

Let me know if this is what you were looking for.

Sign up to request clarification or add additional context in comments.

4 Comments

Okay, let me see if I can find re.findall alternative in Scala.
I want some word and some other word only in my list, how to get rid of brackets in a good way?
Sorry I thought you wanted brackets too. It's very easy. Just change the regex to (?<=\\[{2})[^\\[\\]]+(?=\\]{2}) Let me update my answer.
@saadi: Updated my answer. Hopefully this is what you needed. Let me know if you face any issues.
2

Scala solution:

val text = "[[someword1]] test [[someword2]] test 1231"

val pattern = "\\[\\[(\\p{L}+)]\\]".r //match words with brackets and get content with group
val values = pattern
   .findAllIn(text)
   .matchData
   .map(_.group(1)) //get 1st group
   .toList

println(values)

7 Comments

hi, thanks for your response. I believe the w+ in your comment only accommodates English words because I tried with Arabic and it failed. Can you update this regex to accommodate non English words and letters?
@saadi: Try using \\p{L} (this will capture not just English but characters from other languages represented in Unicode) instead of \\w
@PushpeshKumarRajwanshi did not work with this string و[[لبنان]] بما فيها مدينة [[القدس]]، بعد أن هزم جيش [[مملكة بيت المقدس|بيت المقدس]] هزيمة. I tried putting your regex there, that was invalid I guess that was python specific or something. Can you help me find a less English specific solution please? Scala code is here.
@saadi: \\p{L} only represents one character so you need to write \\p{L}+
@saadi: You can follow the regex given in my answer to allow any text within those double brackets. Use this regex (?<=\\[{2})[^\\[\\]]+(?=\\]{2}) Also, as this regex doesn't require any groups, hence make sure to write group(0) instead of group(1)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.