1

I am trying to parse a file name according to a given pattern but not able to perfect the match. Here is a sample file name:

CRS-ISAU-RPV#3430_Dedalus_Conc.ok.erto_AOTreviglio.doc

And here are my requirements:

til the character # the file name can contain anything, after #, i have to find character _ or the character - to separate a string. The string in between the character(optionally _ or - - but not both) can contain any other character. So eventually after the character # i must have exactly three (3) _ or - characters combined. The string should end with .doc or .docx or .odt but NOT .ok.doc or .ok.docx or .ok.odt.

Here is what i tried:

(.*)#([^_-]+)[_-]([^_-]+)[_-]([^_-]+)[_-]([^_-]+)\.[doc|odt|docx].*(?<!\.ok)$

But this forces me to end the string with .doc.ok or .docs.ok or .docx.ok and actually i want to retain the file extension at the end.

If i try this:

(.*)#([^_-]+)[_-]([^_-]+)[_-]([^_-]+)[_-]([^_-]+)\..*(?<!ok\.[doc|odt|docx])$

it wont work.

Any help would be appreciated. Thank you :)

4

1 Answer 1

2

It seems you can use

^([^#]*#[^-_]*)[-_](.*)$(?<=(?<!\.ok)\.(?:docx?|odt)$)

Explanation:

  • ^ - start of string (not necessary when used with .matches(), but not harmful)
  • ([^#]*#[^-_]*) - Group 1: any 0+ characters other than # ([^#]*) followed with # and then any 0+ characters iother than - and _ (with [-_])
  • (.*)$ - match 0+ characters other than a newline (since DOTALL mode is not specified) up to the end of string BUT...
  • (?<=(?<!\.ok)\.(?:docx?|odt)$) - after reaching the end, check if there is .doc or .docx or .odt at the end (see (?<=\.(?:docx?|odt)$)) that are not preceded with .ok (see (?<!\.ok)). In PCRE, these conditions should be split, Java regex seems to cope with alternations inside the lookbehind.

A lookahead-based alternative:

^([^#]*#[^-_]*)[-_](?=.*(?<!\.ok)\.(?:docx?|odt)$)(.*)$

See the regex101 demo. It is the same, but all the end-of-string checks are done after matching the - or _.

See the Java demo:

List<String> strs = Arrays.asList("CRS-ISAU-RPV#3430_Dedalus_Conc.ok.erto_AOTreviglio.doc",
            "CRS-ISAU-RPV#3430_Dedalus_Conc.ok.erto_AOTreviglio.docx",
            "CRS-ISAU-RPV#3430_Dedalus_Conc.ok.erto_AOTreviglio.odt",
            "CRS-ISAU-RPV#3430_Dedalus_Conc.ok.erto_AOTreviglio.ok.docx",
            "CRS-ISAU-RPV#3430_Dedalus_Conc.ok.erto_AOTreviglio.ok.odt"
        );
for (String str : strs) {
    System.out.println("----------\nMatching: " + str);
    Matcher m = Pattern.compile("^([^#]*#[^-_]*)([-_])(.*)$(?<=(?<![.]ok)[.](?:docx?|odt)$)").matcher(str);
    if (m.matches()) {
        System.out.println(m.group(1));
        System.out.println(m.group(2));
        System.out.println(m.group(3));
    } else { System.out.println("No match"); }
}
Sign up to request clarification or add additional context in comments.

3 Comments

Appreciate your answer. It misses one thing though in the question: " i must have exactly three (3) _ or - characters combined." Which means, collectively there has to be 3 _ or - Summing up the two characters, they must be as a whole 3. And in between these characters will be strings of any sort BUT not containing #. The sample file name(String) shows that there are 3 _ but you can replace any _ with - and yet it should be a match.
Maybe this? ^([^#\n]*#[^-_\n]*)[-_](?=(?:[^-_\n]*[_-]){3}[^-_\n]*$)(?=.*(?<!\.ok)\.(?:docx?|odt)$)(.*)$ (for testing) and "^([^#]*#[^-_]*)[-_](?=(?:[^-_]*[_-]){3}[^-_]*$)(?=.*(?<![.]ok)[.](?:docx?|odt)$)(.*)$" (for using in code).
I solved it with the help of your answer. I will mark your answer as correct :) Thanks a lot!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.