Count regex matches

Question

Your task

Given a simple regular expression, you have to count how many strings of length n have a match of length n with the given simple regex. This will just be a subset of regexs. Like, no lookaheads or named groups or recursion or whatever weird things regexs have.

Simple regular expression

For the purposes of this challenge, a regex is said to be simple if, for one, it only contains characters in the ASCII range 32-126. Furthermore, it should only use the following functionalities:

match literal characters, much like the regex abc would only match the string "abc";
match options, like abc|def would match "abc" and "def";
match exactly 0 or 1 occurrence of something, e.g. https? matches "http" and "https";
match 1 or more occurrences of something, e.g. ah+ would match "ah", "ahh", "ahhh", "ahhhh", etc;
match any amount of occurrences of something, e.g. 1* matches "", "1", "11", "111", "1111", etc;
match between n and m occurrences of something, e.g. lo{1,4}l matches only "lol", "lool", "loool" and "looool". If n is ommited, it matches up to m occurrences. If m is ommited, it matches at least n occurrences. Assume at least one of n or m is present;
use () to group, e.g. ab(c|d)ef would match "abcef" and "abdef" (c.f. 2nd item in this list) or (10)+ would match "10", "1010", "101010", "10101010", etc;
use . to match any character (in the ASCII range [32, 126]), so ab. would match "abc", "ab9", "ab)", etc;
use \ to escape the special meaning of a character, e.g. ab? would match "a" and "ab", while ab\? only matches "ab?";
use [] as a group of possible characters. Inside the brackets, all characters lose their special behaviours, except for - and \. This means that, for one, ab[cde] is shorthand for ab(c|d|e) and secondly, ab[?+*] matches "ab?", "ab+" and "ab*"; also related to []:
use - to specify a character range within brackets. The ranges you have to support are a-z, A-Z and 0-9, as well as their subsets, like h-z or 3-8. E.g., the regex ab[c-g] matches "abc", "abd", "abe", "abf" and "abg"; Note that - has no special meaning outside of [] so a-z would only match "a-z".

Input

The input for your program/function/routine/etc should be a string representing the regex and an integer n. For the regex, you can further assume:

all characters that show up are in the ASCII range [32, 126]
if {n,m} is used, then \$n \leq m \$
if - is used inside [] then the specified range is well-formed

Output

The number of strings of length n that match the given regex. You only have to account for characters in the ASCII range [32, 126].

Test cases

".*", 0 -> 1
".*", 1 -> 95
".*", 2 -> 9025
".*", 3 -> 857375
".*", 4 -> 81450625
"abc", 2 -> 0
"abc", 4 -> 0
"ab|ac|ad", 2 -> 3
"a(b|c)", 2 -> 2
"hell(o|oo)", 5 -> 1
"https?", 5 -> 1
"ho{1,4}ly", 6 -> 1
"ho{3,}ly", 137 -> 1
"[abcde]{,2}", 2 -> 25
"(10)+", 7 -> 0
"(10)+", 8 -> 1
"ab\?", 3 -> 1
"[t7]h[i1][s5] is c[0o]d[Ee3] g[0oO][l1L]f", 17 -> 432
"\+351 9[1236] [0-9]{3,3} [0-9]{2,2} [0-9][0-9]", 17 -> 40000000
"-", 1 -> 1
"\\", 1 -> 1
"[+?*]", 1 -> 3
"Abc([d-z]*|(.H)+)", 11 -> 5132188812826241
"ab|ab", 2 -> 1
".(.(.(.(.|a))))|hello", 5 -> 7737809375

This is code code-golf so shortest solution in bytes, wins. If you like this challenge, consider upvoting it... And happy golfing!

Suggested test case: one containing {n:m} without n (i.e. "[abcde]{:2}", 2 -> 25) — Kevin Cruijssen
– Kevin Cruijssen, Commented Feb 13, 2020 at 9:30
Was the notion of simple regex an attempt at allowing someone to solve this by other means than brute force? (If yes, I think it's still much too complicated for that, as almost all key features of regular expressions are included.) — Arnauld
– Arnauld, Commented Feb 13, 2020 at 14:32

Community · Accepted Answer · 2020-06-17 09:04:33Z

Java 10, 356 342 284 275 233 225 220 bytes

import java.util.*;List S;r->k->{S=new Stack();p("",k);k=0;for(var p:S)k+=(p+"").matches(r.replaceAll("\\{(,\\d+\\})","{0$1"))?1:0;return k;}void p(String p,int k){if(k<1)S.add(p);else for(char i=32;i<127;)p(p+i++,k-1);}

Works, but way too slow for \$n\geq4\$.

Try it online.

Explanation:

import java.util.*; // Required import for Stack and List
List S;             // List on class-level, uninitialized

k->r->{             // Method with integer and String parameter and integer return-type
  S=new Stack();    //  Create a new List
  p("",k);          //  Put all permutations of length `k` consisting of printable ASCII
                    //  characters in this list
  k=0;              //  Re-use `k` as counter-integer, starting at 0
  for(var p:S)      //  Loop over all permutations in the List:
    k+=             //   Increase the counter by:
       (p+"")       //    Convert the permutation from Object to String
       .matches(r   //    If it matches the input-regex,
         .replaceAll("\\{(,\\d+\\})","{0$1"))?
                    //    after we've replaced all {,m} with {0,m}
          1         //     Increase the counter by 1
       :            //    Else:
          0;        //     Leave the count the same by increasing with 0
  return k;}        //  And return the counter as result

void p(             // Separated method with two parameters and no return-type, where:
    String p,       //   `p` is the prefix-String, starting empty
    int k){         //   `k` is the length the permutations we want to generate
  if(k<1)           //  If `k` is 0:
    S.add(p);       //   Add the current prefix-String to the List
  else              // Else:
    for(char i=32;i<127;)
                    //  Loop over the printable ASCII characters:
      p(            //   And do a recursive call, with:
        p+i++,      //    The prefix-String appended with the current character
        k-1);}      //    And `k` decreased by 1

Some notes:

Java's regex only supports {n,m} and {n,}, since {,m} could be written as {0,m}.
And Java's String#matches builtin implicitly adds a leading ^ and trailing $ to check the entire String.

Will leave a comment appearing to be commenting on the answer but actually end up being a meta continuation of a now existent joke. — lyxal
– lyxal ♦, Commented Feb 13, 2020 at 10:09
@RGS I doubt it.. Java's regex uses {n,m} instead of {n:m}, so I have to account for that. In addition, it only supports {n,} and {n,m} (since {,m} can be {0,m}). So some bytes are already lost there in comparison to your answer. And Python just has better/shorter builtins in general tbh.. :) — Kevin Cruijssen
– Kevin Cruijssen, Commented Feb 13, 2020 at 12:14

Community · Accepted Answer · 2020-06-17 09:04:33Z

3

JavaScript (Node.js), 86 bytes

Another brute-force solution taking input as (n)(regex).

n=>g=(p,s='',c=127)=>n*!s[n-1]?--c>>5&&g(p,s+Buffer([c]))+g(p,s,c):!!s.match(`^${p}$`)

Try it online!

Commented

n =>                      // n = target string length
  g = (                   // g is a recursive function taking:
    p,                    //   p = regex pattern
    s = '',               //   s = current string
    c = 127               //   c = current ASCII code
  ) =>                    //
    n * !s[n - 1] ?       // if n is not equal to 0 and s is less than n char. long:
      --c >> 5 &&         //   decrement c; if it's greater than or equal to 32:
      g(                  //     do a 1st recursive call where:
        p,                //
        s + Buffer([c])   //       the character with ASCII code c is added to s
                          //       c is not passed so that it's reset to 127
      ) +                 //
      g(                  //     do a 2nd recursive call where:
        p, s, c           //       this value of c is discarded
      )                   //
    :                     // else:
      !!s.match(`^${p}$`) //   return true (1) if s matches p, or false (0) otherwise

edited Jun 17, 2020 at 9:04

CommunityBot

1

answered Feb 13, 2020 at 14:26

Arnauld

206k21 gold badges187 silver badges670 bronze badges

\$\begingroup\$ +1 Looking good! I see you also had to do the ^_$ tweak :) \$\endgroup\$

RGS
– RGS

2020-02-13 14:30:20 +00:00
Commented Feb 13, 2020 at 14:30
1

\$\begingroup\$ I have the feeling you can contribute so many tips to the JavaScript tips page xD I.e. the Buffer([c]) to convert from integer to character. PS: Why do codegolfers always like to obscure their code? ;) The >>5 can simply be >31 (which imo makes it a lot more readable). Not that it matters here, since the byte-count remains the same, but stil. \$\endgroup\$

Kevin Cruijssen
– Kevin Cruijssen

2020-02-13 15:00:48 +00:00
Commented Feb 13, 2020 at 15:00
1

\$\begingroup\$ @KevinCruijssen Buffer() doesn't exist in JS. It's a Node thing. So maybe we need a Node tips page, but I'm not sure there are enough Node-specific tricks to justify it. Or maybe we should include Node in Tips for Golfing in ECMAScript 6 and above. \$\endgroup\$

Arnauld
– Arnauld

2020-02-13 15:43:55 +00:00
Commented Feb 13, 2020 at 15:43

Add a comment |

RGS · Accepted Answer · 2020-02-13 19:24:18Z

3

Python 3, 139 137 129 bytes

Saved 2 naughty bytes thanks to @Kevin, then 6 thanks to @ovs and then 2 more by rewriting f, thanks to @wilkben!

lambda r,n:sum(1for s in map(''.join,i.product(map(chr,range(32,127)),repeat=n))if re.match(f"^{r}$",s))
import re,itertools as i

You can try the smaller test cases online. My solution generates all strings of length n with the generator g and then tries to match the regex to the whole string. If the test case is the regex r, I use the regex ^r$ to force the match to be on the whole string.

edited Feb 13, 2020 at 19:24

answered Feb 13, 2020 at 11:53

RGS

14.2k1 gold badge62 silver badges117 bronze badges

1

\$\begingroup\$ This fails on any test case that uses the {n:m} syntax, since Python regexes don't have that (see the [abcde]{:2}, 2 -> 0 X in the results of your test suite). \$\endgroup\$

Grimmy
– Grimmy

2020-02-13 12:17:09 +00:00
Commented Feb 13, 2020 at 12:17
1

\$\begingroup\$ @Grimmy you are absolutely right! My problem was that I wrote the challenge believing the syntax I knew was {n:m} and didn't even check it. Since there is only one other answer that also benefits from me correcting the syntax to {n,m} which was what I intended, I edited the OP. Thanks for being on the lookout! Now the test cases are fine :) \$\endgroup\$

RGS
– RGS

2020-02-13 12:36:18 +00:00
Commented Feb 13, 2020 at 12:36
1

\$\begingroup\$ g can be 6 bytes shorter with some rearrangements: g=lambda n:[""][n:]or(s+chr(c+32)for c in range(95)for s in g(n-1)). \$\endgroup\$

ovs
– ovs

2020-02-13 16:15:00 +00:00
Commented Feb 13, 2020 at 16:15
\$\begingroup\$ -8 bytes using itertools lambda r,n:sum(1for s in map(''.join,i.product(map(chr,range(32,127)),repeat=n))if re.match(f"^{r}$",s)) import re,itertools as i \$\endgroup\$

wilkben
– wilkben

2020-02-13 18:32:07 +00:00
Commented Feb 13, 2020 at 18:32

Add a comment |

Humpty Dumpty · Accepted Answer · 2020-02-13 16:53:56Z

1

perl -E, 79 bytes

($,,$")=@ARGV;@@="";@@=map{$,=$_;map$,.chr,32..126}@@for 1..$,;say~~grep/$"/,@@

answered Feb 13, 2020 at 16:53

Humpty Dumpty

111 bronze badge

1

\$\begingroup\$ Welcome to this community! Can you please include a tio.run link, so that people can check your code passes the test cases? :) \$\endgroup\$

RGS
– RGS

2020-02-13 17:12:11 +00:00
Commented Feb 13, 2020 at 17:12

Add a comment |

Stack Exchange Network

Count regex matches

Your task

Simple regular expression

Input

Output

Test cases

4 Answers 4

Java 10, 356 342 284 275 233 225 220 bytes

JavaScript (Node.js), 86 bytes

Commented

Python 3, 139 137 129 bytes

perl -E, 79 bytes

Your Answer

Linked

Hot Network Questions

Count regex matches

Your task

Simple regular expression

Input

Output

Test cases

4 Answers 4

Java 10, 356 342 284 275 233 225 220 bytes

JavaScript (Node.js), 86 bytes

Commented

Python 3, 139 137 129 bytes

perl -E, 79 bytes

Your Answer

Sign up or log in

Post as a guest

Linked

Related

Hot Network Questions