0

Here's is my problem:

  • I have a regular expression, this expression contains one, and only one capture group,
  • This regular expression cannot be changed,
  • I have a string, that will be matched this regular expression,
  • The regex will match the complete string, it's not a look-up, if the regex cannot be matched to the string, the function will fail prior reaching this step.

=> I want to get the captured sub-string position in the string, and it's length.

Example;

If my regex is

^.*?\/F?L?(\d+)$

my string is

"( 413) 250/FL250"

I want to get 14, and 3.

In those conditions, search would return 1.

This is a simple example, but we can have extremely complex regex, however the principle is always the same: one and only one capture group, and find the position of the captured string in the main one.

Thanks a lot for your help, I'm stucked.

EDITION:

So I made something with ant (our base work environnement is ant) which consist of getting the leftContext of the capture group, then determine it's size. To get the leftContext, I simply move the parenthesis of the capture groupe at the left part. Ex: \d(\s) becomes (\d)\s.

So there I have a question about it:

<macrodef name="Get_CaptureGroup_Position" >
    <attribute name="text" />
    <attribute name="mask" />
    <attribute name="start" />
    <attribute name="end" />
    <sequential>

        <var name="_GMLCS_modified_regex"       unset="true"/>
        <var name="_GMLCS_leftContext"          unset="true"/>
        <var name="_GMLCS_leftContext_len"      unset="true"/>
        <var name="_GMLCS_CapturedGroup"        unset="true"/>
        <var name="_GMLCS_CapturedGroup_len"    unset="true"/>

        <propertyregex property="_GMLCS_modified_regex" override="yes"  input="@{mask}" regexp="(.*[^\\])\)([^?].*)" replace="\1\2" />  
        <propertyregex property="_GMLCS_modified_regex" override="yes" input="${_GMLCS_modified_regex}" regexp="(.*[^\\])\(([^?].*)" replace="\1)\2" />
        <var name="_GMLCS_modified_regex" value="(${_GMLCS_modified_regex}" />

        <propertyregex property="_GMLCS_leftContext"    override="yes" input="@{text}" regexp="${_GMLCS_modified_regex}" select="\1" />
        <propertyregex property="_GMLCS_CapturedGroup"  override="yes" input="@{text}" regexp="@{mask}" select="\1" />

        <getAttributeLength text="${_GMLCS_leftContext}"    property="_GMLCS_leftContext_len" />
        <getAttributeLength text="${_GMLCS_CapturedGroup}"  property="_GMLCS_CapturedGroup_len" />

        <math result="_GMLCS_leftContext_len"   operation="+" operand1="${_GMLCS_leftContext_len}" operand2="1" />
        <math result="_GMLCS_CapturedGroup_len" operation="+" operand1="${_GMLCS_leftContext_len}" operand2="${_GMLCS_CapturedGroup_len}" />

        <var name="@{start}" value="${_GMLCS_leftContext_len}" />
        <var name="@{end}" value="${_GMLCS_CapturedGroup_len}" />

        <var name="_GMLCS_modified_regex"       unset="true"/>
        <var name="_GMLCS_leftContext"          unset="true"/>
        <var name="_GMLCS_leftContext_len"      unset="true"/>
        <var name="_GMLCS_CapturedGroup"        unset="true"/>
        <var name="_GMLCS_CapturedGroup_len"    unset="true"/>
    </sequential>
</macrodef>

My question is that, when I pass this regex:

(?:A|.*)/F?L?(\d+)\s*\d*(?:A|.*)

I get:

First property regex:

(?:A|.*)/F?L?(\d+\s*\d*(?:A|.*) = CORRECT

Second propoerty regex:

(?:A|.*)/F?L?)\d+\s*\d*(?:A|.*) = CORRECT

Var:

((?:A|.*)/F?L?)\d+\s*\d*(?:A|.*) = CORRECT

Start and End: 7 and 10 = CORRECT.

This is actually correct, but I believe it should not be, my question is why the ")" at the end of (?:...) blocks were not removed ?

2 Answers 2

0

Here the final answer we have for our issue. It's done by ANT, but I think it is transposable to javascript:

<macrodef name="Get_CaptureGroup_Position" >
<attribute name="text" />
<attribute name="mask" />
<attribute name="start" />
<attribute name="end" />
<sequential>

    <var name="_GMLCS_modified_regex"       unset="true"/>
    <var name="_GMLCS_leftContext"          unset="true"/>
    <var name="_GMLCS_leftContext_len"      unset="true"/>
    <var name="_GMLCS_CapturedGroup"        unset="true"/>
    <var name="_GMLCS_CapturedGroup_len"    unset="true"/>

    <propertyregex property="_GMLCS_modified_regex" override="yes" input="@{mask}" regexp="^((?:|(?:[^\\]|\\.)*))\(([^?].*)$" replace="(\1\2" />

    <propertyregex property="_GMLCS_leftContext"    override="yes" input="@{text}" regexp="${_GMLCS_modified_regex}" select="\1" />
    <propertyregex property="_GMLCS_CapturedGroup"  override="yes" input="@{text}" regexp="@{mask}" select="\1" />

    <getAttributeLength text="${_GMLCS_leftContext}"    property="_GMLCS_leftContext_len" />
    <getAttributeLength text="${_GMLCS_CapturedGroup}"  property="_GMLCS_CapturedGroup_len" />

    <math result="@{start}" operation="-" operand1="${_GMLCS_leftContext_len}" operand2="${_GMLCS_CapturedGroup_len}" datatype="int"/>
    <math result="@{start}" operation="+" operand1="${@{start}}" operand2="1" datatype="int"/>
    <var name="@{end}" value="${_GMLCS_leftContext_len}" />

    <var name="_GMLCS_modified_regex"       unset="true"/>
    <var name="_GMLCS_leftContext"          unset="true"/>
    <var name="_GMLCS_leftContext_len"      unset="true"/>
    <var name="_GMLCS_CapturedGroup"        unset="true"/>
    <var name="_GMLCS_CapturedGroup_len"    unset="true"/>
</sequential>

Sign up to request clarification or add additional context in comments.

Comments

0

It is trivial to get the length as shown in the 2 methods below, but it is impossible in general case to get the start and end index of the text captured by a capturing group.

The first method with String.match, for non-global RegExp only:

// reNonGlobal can be a variable containing RegExp object
// or a RegExp object directly specified.
var result = inputString.match(reNonGlobal);

if (result != null) {
    console.log(result[groupNumber].length);
}

The second method with RegExp.exec, for any RegExp:

var arr;
// The RegExp object must be assigned to a variable
var re = ...;

if (re.global) {
    while ((arr = re.exec(inputString)) != null) {
        console.log(arr[groupNumber].length);

        // lastIndex is not advanced when empty string is matched
        // Need to manually advance it to prevent infinite loop
        if (arr[0].length == 0) {
            re.lastIndex += 1;
        }
    }
} else {
    if ((arr = re.exec(inputString)) != null) {
        console.log(arr[groupNumber].length);
    }
}

Using indexOf (or any other method) to locate the index of the captured text is unreliable, and dependent on particular regex and/or input.

6 Comments

One possibility I thought too would be to dynamically modify the regex string to catch everything that is actually before the capture group, and determine it's lenght. ^.*?\/F?L?(\d+)$ Would become: (^.*?\/F?L?)\d+$ So I would need to to split the regex string on "(" character, but it should not split on "(?:" not "(" Would you have an idea to do this split ? Thanks
It seems that: [^\\]\([^?] is giving some results.
@user3870905: You would need to parse the regex if you want to modify it. Even so, it won't help you get the correct starting index of a repeated capturing group.
It's not a problem, we have one and only one capture group. If we have 0 or more than 1, the build fails.
@user3870905: I'm talking about cases such as (a|aaa)* (which can be part of a bigger pattern)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.