split line via regex in javascript?

Question

I have this structure of text :

1.6.1 Members................................................................ 12
1.6.2 Accessibility.......................................................... 13
1.6.3 Type parameters........................................................ 13
1.6.4 The T generic type aka <T>............................................. 13

I need to create JS objects :

{ 
  num:"1.6.1",
  txt:"Members"
},
{ 
  num:"1.6.2",
  txt:"Accessibility"
} ...

That's not a problem.

The problem is that I want to extract values via Regex split via positive lookahead :

Split via the first time you see that next character is a letter

enter image description here

What have i tried :

'1.6.1 Members........... 12'.split(/\s(?=(?:[\w\. ])+$)/i)

This is working fine :

["1.6.1", "Members...........", "12"] // I don't care about the 12.

But If I have 2 words or more :

'1.6.3 Type parameters................ 13'.split(/\s(?=(?:[\w\. ])+$)/i)

The result is :

["1.6.3", "Type", "parameters................", "13"] //again I don't care about 13.

Of course I can join them , but I want the words to be together.

Question :

How can I enhance my regex NOT to split words ?

Desired result :

["1.6.3", "Type parameters"]

or

["1.6.3", "Type parameters........"] // I will remove extras later

or

["1.6.3", "Type parameters........13"]// I will remove extras later

NB

I know I can do split via " " or by other simpler solution but I'm seeking ( for pure knowledge) for an enhancement for my solution which uses positive lookahead split.

Full online example :

nb2 :

The text can contain capital letter in the middle also.

Hey Royi, did any of the solutions work for you, or do you need any tweaks? Please note that we gave you Match instead of Split solutions because Match All and Split are two sides of the same coin, you get the same array but in this case matching is much easier. — zx81
– zx81, Commented Jul 16, 2014 at 21:50

anubhava · Accepted Answer · 2014-07-16 12:10:32Z

3

You can use this regex:

/^(\d+(?:\.\d+)*) (\w+(?: \w+)*)/gm

And get your desired matches using matched group #1 and matched group #2.

Online Regex Demo

Update: For String#split you can use this regex:

/ +(?=[A-Z\d])/g

Regex Demo

Update 2: With the possibility of having capital letters also in chapter names following more complex regex is needed:

var re = /(\D +(?=[a-z]))| +(?=[a-z\d])/gmi; 
var str = '1.6.3 Type Foo Bar........................................................ 13';
var m = str.split( re );
console.log(m[0], ',', m.slice(1, -1).join(''), ',', m.pop() );

//=> 1.6.3 , Type Foo Bar........................................................ , 13

edited Jul 16, 2014 at 12:10

answered Jul 16, 2014 at 11:15

anubhava

790k67 gold badges603 silver badges671 bronze badges

Sign up to request clarification or add additional context in comments.

12 Comments

Royi Namir Over a year ago

Thank you for reply. But I mentioned something in my NB ( again - for pure knowledge reagrding positive lookahead splits).

Christoph Over a year ago

Not was OP was asking ("which uses positive lookahead split"), and overly complicated too.

anubhava Over a year ago

But I am not using String#split. This regex is for String#match

Paul Chen Over a year ago

The updated answer is brilliant, but hopefully there no words in the middle of the title starts with uppercase letter.

anubhava Over a year ago

Ah so you can have capital letters in between also!! will need to find new ways to split it. btw my earlier regex will work fine with this case as well. Give me some time to try something different for String#split (if at all possible)

|

zx81 · Accepted Answer · 2014-07-17 23:22:27Z

2

EDIT: Since you added 1.6.1 The .net 4.5 framework.... to the requirements, we can tweak the answer to this:

^([\d.]+) ((?:[^.]|\.(?!\.))+)

And if you want to allow sequences of up to three dots in the title, as in 1.6.1 She said... Boo!..........., it's an easy tweak from there ({3} quantifier):

^([\d.]+) ((?:[^.]|\.(?!\.{3}))+)

Original:

^([\d.]+) ([^.]+)

In the regex demo, see the Groups in the right pane.

To retrieve Groups 1 and 2, something like:

var myregex = /^([\d.]+) ((?:[^.]|\.(?!\.))+)/mg;
var theMatchObject = myregex.exec(yourString);
while (theMatchObject != null) {
    // the numbers: theMatchObject[1]
    // the title: theMatchObject[1]
    theMatchObject = myregex.exec(yourString);
}

OUTPUT

Group 1     Group 2
1.6.1       Members
1.6.2       Accessibility
1.6.3       Type parameters
1.6.4       The T generic type aka <T>**
1.6.1       The .net 4.5 framework

Explanation

^ asserts that we are a the beginning of the line
The parentheses in ([\d.]+) capture digits and dots to Group 1
The parentheses in ((?:[^.]|\.(?!\.))+) capture to Group 2...
[^.] one char that is not a dot, | OR...
\.(?!\.) a dot that is not followed by a dot...
+ one or more times

edited Jul 17, 2014 at 23:22

answered Jul 16, 2014 at 11:53

zx81

42k10 gold badges92 silver badges106 bronze badges

5 Comments

Casimir et Hippolyte Over a year ago

It's indeed the most simple way.

zx81 Over a year ago

That's an easy tweak... Done. :)

zx81 Over a year ago

Also, added an additional tweak in case you'd like to allow sequences of up to three dots in the title, as in 1.6.1 She said... Boo!..........., as well as full explanation.

Royi Namir Over a year ago

Casimirs dolution is the shortest one , but since i said explicitly using positive look ahead - i chosed anubhava solution ps +1

zx81 Over a year ago

You choose whatever you like, but right now mine is the only one that works with 1.6.1 The .net 4.5 framework.... or 1.6.1 She said... Boo!..........., right? :)

Casimir et Hippolyte · Accepted Answer · 2014-07-16 17:58:00Z

You can use this pattern too:

var myStr = "1.6.1 Members................................................................ 12\n1.6.2 Accessibility.......................................................... 13\n1.6.3 Type parameters........................................................ 13\n1.6.4 The T generic type aka <T>............................................. 13";

console.log(myStr.split(/ (.+?)\.{2,} ?\d+$\n?/m));

About a way with a lookahead :

I don't think it is possible. Because the only way to skip a character (here a space between two words), is to match it on the occasion of the previous occurence of a space (between the number and the first word). In other words, you use the fact that characters can not be matched more than one time.

But if, except the space where you want to split, all the pattern is enclosed in a lookahead, and since the substring matched by this subpattern in the lookahead isn't a part of the match result (in other words, it's only a check and the corresponding characters are not eaten by the regex engine), you can't skip the next spaces, and the regex engine will continue his road until the next space character.

Collectives™ on Stack Overflow

split line via regex in javascript?

3 Answers 3

Online Regex Demo

Regex Demo

12 Comments

5 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

12 Comments

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related