0

I want to search a string for an identifier. The identifier can 4 have variations

REF964758362562
REF964-758362-562
964758362562
964-758362-562

The identifier can be located anywhere in a string or on it own. Example:

Lorem ipsum REF964-758362-562
Lorem ipsum ABCD964-758362-562 lorem ipsum
Lorem ipsum REF964-758362-562 lorem ipsum
REF964-758362-562 Lorem ipsum 1234-123456-22
Lorem ipsum 964-758362-562 lorem ipsum
REF964758362562
REF964-758362-562
964758362562
964-758362-562

When a hyphen/dash character is used in the identifier, the hyphen will always appear after the third and 9th digits as shown in the examples.

Here is what i have come up with but i suspect that the regular expression is getting too long and it can probably be shortened. This also does work well when the identifier is not at the beginning of the string. Any tips/ideas?

^[A-Z]*REF[A-Z]*([12]\d{3})(\d{6})(\d{2})$|^([12]\d{3})(\d{6})(\d{2})[A-Z]*REF[A-Z]*|^([12]\d{3})(\d{6})(\d{2})$

I have put them in groups because once i have extracted the identifiers, i want to add the hyphen if the identifier does not have a hyphen. For example, if the identifier extracted is 964758362562, i want to save it as 964-758362-562.

Here are some tests i have run and as you can see not a lot of them match

testRegex = "^[A-Z]*REF[A-Z]*([12]\\d{3})(\\d{6})(\\d{2})$|^([12]\\d{3})(\\d{6})(\\d{2})[A-Z]*REF[A-Z]*|^([12]\\d{3})(\\d{6})(\\d{2})$";
        PATTERN = Pattern.compile(testRegex, Pattern.CASE_INSENSITIVE);

        m = PATTERN.matcher("Lorem ipsum REF964-758362-562");
        if(m.matches()) {
            System.out.println("Match = " + m.group());
        }else{
            System.out.println("No match");
        }

        m = PATTERN.matcher("REF964-758362-562 Lorem ipsum 1234-123456-22");
        if(m.matches()) {
            System.out.println("Match = " + m.group());
        }else{
            System.out.println("No match");
        }

        m = PATTERN.matcher("Lorem ipsum 964-758362-562 lorem ipsum");
        if(m.matches()) {
            System.out.println("Match = " + m.group());
        }else{
            System.out.println("No match");
        }

        m = PATTERN.matcher("Lorem ipsum ABCD964-758362-562 lorem ipsum");
        if(m.matches()) {
            System.out.println("Match = " + m.group());
        }else{
            System.out.println("No match");
        }

        m = PATTERN.matcher("REF964758362562");
        if(m.matches()) {
            System.out.println("Match = " + m.group());
        }else{
            System.out.println("No match");
        }

        m = PATTERN.matcher("REF964-758362-562");
        if(m.matches()) {
            System.out.println("Match = " + m.group());
        }else{
            System.out.println("No match");
        }

        m = PATTERN.matcher("964758362562");
        if(m.matches()) {
            System.out.println("Match = " + m.group());
        }else{
            System.out.println("No match");
        }

        m = PATTERN.matcher("964-758362-562");
        if(m.matches()) {
            System.out.println("Match = " + m.group());
        }else{
            System.out.println("No match");
        }

Output

No match
Match = Not known
No match
No match
No match
No match
No match
No match
No match
No match

4 Answers 4

4

Use this regex:

(REF)?964-?758362-?562

The ? makes the preceding group optional, either zero or one occurrence.

The "REF" is optional, and the dashes are optional.

To force both dashes to be there, use this regex

(REF)?964-758362-562|(REF)?964758362562
Sign up to request clarification or add additional context in comments.

9 Comments

This will also accept identifiers with only one hyphen/dash. Also it seems that there should be digit class \d instead of specific numbers in regex.
Interesting, i have never seen specific numbers used before. How do those translate in the expression?
@Pshemo, the OP says he's looking for that specific identifier, so I think specific numbers are preferable. I do agree, though, that it may be better to not accept identifiers with only one dash.
@ziggy this is just a hardcoded solution with only searches for the specific id you mention in your post. It's not going to work I'm afraid.
@CasimiretHippolyte, the only difference is it forces no dashes or both dashes
|
2

It looks like the identifier follows this general pattern:

  • optional REF
  • 3 digits
  • optional hyphen
  • 6 digits
  • hyphen if the first hyphen was present. No hyphen if not.
  • 3 digits

That being the case this pattern will work

(?>REF)?(\\d{3}+)(-?)(\\d{6}+)\\2(\\d{3}+)

Breaking down the pattern:

  • (?>REF)? an atomic group to match "REF", optionally
  • (\\d{3}+) capture 3 digits, possessively (group 1)
  • (-?) capture optional hyphen (group 2)
  • (\\d{6}+) capture 6 digits, possessively (group 3)
  • \\2 back-reference to whatever was captured in the second group
  • (\\d{3}+) capture 3 digits, possessively (group 4)

The nifty trick is to capture the optional hyphen and then back-reference it so that if the first hyphen is present then second must be; conversely if the first hyphen is not present the second cannot be.

Testcase in Java:

public static void main(String[] args) throws Exception {
    final String[] test = {"Lorem ipsum REF964-758362-562",
        "Lorem ipsum ABCD964-758362-562 lorem ipsum",
        "REF964-758362-562 Lorem ipsum 1234-123456-22",
        "Lorem ipsum 964-758362-562 lorem ipsum",
        "REF964758362562",
        "REF964-758362-562",
        "964-758362562",
        "964758362-562",
        "964758362562",
        "964-758362-562"};
    final Pattern patt = Pattern.compile("(?>REF)?(\\d{3}+)(-?)(\\d{6}+)\\2(\\d{3}+)");
    final MessageFormat format = new MessageFormat("{0}-{1}-{2}");
    for (final String in : test) {
        final Matcher mat = patt.matcher(in);
        while (mat.find()) {
            final String id = format.format(new Object[]{mat.group(1), mat.group(3), mat.group(4)});
            System.out.println(id);
        }
    }
}

Output:

964-758362-562
964-758362-562
964-758362-562
964-758362-562
964-758362-562
964-758362-562
964-758362-562
964-758362-562

Your main problem is using Matcher.matches() which requires the whole input to match the pattern. What you actually want is to find the pattern in the input. For this purpose there is the while(Matcher.find()) idiom - this finds each occurrence of the pattern in the input in turn.

7 Comments

This will accept 123-123456123. Not sure if it is what OP wants.
@Pshemo you're right, the OP might not want that. Lemme hack a backreference into there.
Yes if a hyphen/dash is used, both dashes should exist.
@ziggy updated the answer with that in mind. Should do the trick.
What does the '>' character do in the (?>REF) group - Thanks
|
2

Idea of other answer is quite good but in case you don't want to accept identifiers with only one dash like 123-123456123 you should use something like

  (REF)?(\\d{3}-\\d{6}-\\d{3}|\\d{12})
//which means 
// REF 
// and after that numbers in form
//      XXX-XXXXXX-XXX      OR   XXXXXXXXXXXX
// where X represents any digit

You can surround this regex with \b which is word boundary to make sure that it is separate word, not part of some other word.

2 Comments

I think I prefer my nifty back-reference :p.
@BoristheSpider Yes \\2 seems to solve this problem. Possessive quantifier can also improve performance but for now I didn't focus on that :)
1

You likely wanted to use m.find() instead of m.matches():

    testRegex = "(?:REF)?(\\d{3})(-?)(\\d{6})\\2(\\d{3})";
    PATTERN = Pattern.compile(testRegex, Pattern.CASE_INSENSITIVE);
    m = PATTERN.matcher(
            "Lorem ipsum REFREF964-758362-562\n" +
            "Lorem ipsum ABCD964-758362-562 lorem ipsum\n" +
            "Lorem ipsum REF964-758362-562 lorem ipsum\n" +
            "REF964-758362-562 Lorem ipsum 1234-123456-22\n" +
            "Lorem ipsum 964-758362-562 lorem ipsum\n" +
            "REF964758362562\n" +
            "REF964-758362-562\n" +
            "964758362562\n" +
            "964-758362-562");
    while(m.find()) {
        System.out.println(m.group(1)+"-"+m.group(3)+"-"+m.group(4));
    }

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.