7

I have this structure of text :

1.6.1 Members................................................................ 12
1.6.2 Accessibility.......................................................... 13
1.6.3 Type parameters........................................................ 13
1.6.4 The T generic type aka <T>............................................. 13

I need to create JS objects :

{ 
  num:"1.6.1",
  txt:"Members"
},
{ 
  num:"1.6.2",
  txt:"Accessibility"
} ...

That's not a problem.

The problem is that I want to extract values via Regex split via positive lookahead :

Split via the first time you see that next character is a letter

enter image description here

What have i tried :

'1.6.1 Members........... 12'.split(/\s(?=(?:[\w\. ])+$)/i)

This is working fine :

["1.6.1", "Members...........", "12"] // I don't care about the 12.

But If I have 2 words or more :

'1.6.3 Type parameters................ 13'.split(/\s(?=(?:[\w\. ])+$)/i)

The result is :

["1.6.3", "Type", "parameters................", "13"] //again I don't care about 13.

Of course I can join them , but I want the words to be together.

Question :

How can I enhance my regex NOT to split words ?

Desired result :

["1.6.3", "Type parameters"]

or

["1.6.3", "Type parameters........"] // I will remove extras later

or

["1.6.3", "Type parameters........13"]// I will remove extras later

NB

I know I can do split via " " or by other simpler solution but I'm seeking ( for pure knowledge) for an enhancement for my solution which uses positive lookahead split.

Full online example :

nb2 :

The text can contain capital letter in the middle also.

1
  • Hey Royi, did any of the solutions work for you, or do you need any tweaks? Please note that we gave you Match instead of Split solutions because Match All and Split are two sides of the same coin, you get the same array but in this case matching is much easier. Commented Jul 16, 2014 at 21:50

3 Answers 3

3

You can use this regex:

/^(\d+(?:\.\d+)*) (\w+(?: \w+)*)/gm

And get your desired matches using matched group #1 and matched group #2.

Online Regex Demo

Update: For String#split you can use this regex:

/ +(?=[A-Z\d])/g

Regex Demo

Update 2: With the possibility of having capital letters also in chapter names following more complex regex is needed:

var re = /(\D +(?=[a-z]))| +(?=[a-z\d])/gmi; 
var str = '1.6.3 Type Foo Bar........................................................ 13';
var m = str.split( re );
console.log(m[0], ',', m.slice(1, -1).join(''), ',', m.pop() );

//=> 1.6.3 , Type Foo Bar........................................................ , 13
Sign up to request clarification or add additional context in comments.

12 Comments

Thank you for reply. But I mentioned something in my NB ( again - for pure knowledge reagrding positive lookahead splits).
Not was OP was asking ("which uses positive lookahead split"), and overly complicated too.
But I am not using String#split. This regex is for String#match
The updated answer is brilliant, but hopefully there no words in the middle of the title starts with uppercase letter.
Ah so you can have capital letters in between also!! will need to find new ways to split it. btw my earlier regex will work fine with this case as well. Give me some time to try something different for String#split (if at all possible)
|
2

EDIT: Since you added 1.6.1 The .net 4.5 framework.... to the requirements, we can tweak the answer to this:

^([\d.]+) ((?:[^.]|\.(?!\.))+)

And if you want to allow sequences of up to three dots in the title, as in 1.6.1 She said... Boo!..........., it's an easy tweak from there ({3} quantifier):

^([\d.]+) ((?:[^.]|\.(?!\.{3}))+)

Original:

^([\d.]+) ([^.]+)

In the regex demo, see the Groups in the right pane.

To retrieve Groups 1 and 2, something like:

var myregex = /^([\d.]+) ((?:[^.]|\.(?!\.))+)/mg;
var theMatchObject = myregex.exec(yourString);
while (theMatchObject != null) {
    // the numbers: theMatchObject[1]
    // the title: theMatchObject[1]
    theMatchObject = myregex.exec(yourString);
}

OUTPUT

Group 1     Group 2
1.6.1       Members
1.6.2       Accessibility
1.6.3       Type parameters
1.6.4       The T generic type aka <T>**
1.6.1       The .net 4.5 framework

Explanation

  • ^ asserts that we are a the beginning of the line
  • The parentheses in ([\d.]+) capture digits and dots to Group 1
  • The parentheses in ((?:[^.]|\.(?!\.))+) capture to Group 2...
  • [^.] one char that is not a dot, | OR...
  • \.(?!\.) a dot that is not followed by a dot...
  • + one or more times

5 Comments

It's indeed the most simple way.
That's an easy tweak... Done. :)
Also, added an additional tweak in case you'd like to allow sequences of up to three dots in the title, as in 1.6.1 She said... Boo!..........., as well as full explanation.
Casimirs dolution is the shortest one , but since i said explicitly using positive look ahead - i chosed anubhava solution ps +1
You choose whatever you like, but right now mine is the only one that works with 1.6.1 The .net 4.5 framework.... or 1.6.1 She said... Boo!..........., right? :)
1

You can use this pattern too:

var myStr = "1.6.1 Members................................................................ 12\n1.6.2 Accessibility.......................................................... 13\n1.6.3 Type parameters........................................................ 13\n1.6.4 The T generic type aka <T>............................................. 13";

console.log(myStr.split(/ (.+?)\.{2,} ?\d+$\n?/m));

About a way with a lookahead :

I don't think it is possible. Because the only way to skip a character (here a space between two words), is to match it on the occasion of the previous occurence of a space (between the number and the first word). In other words, you use the fact that characters can not be matched more than one time.

But if, except the space where you want to split, all the pattern is enclosed in a lookahead, and since the substring matched by this subpattern in the lookahead isn't a part of the match result (in other words, it's only a check and the corresponding characters are not eaten by the regex engine), you can't skip the next spaces, and the regex engine will continue his road until the next space character.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.