1

I need to match a javascript string, with a regular expression, that is a string enclosed by single quote and can only contain a backslashed single quote.

The examples string that i would match are like the following:

'abcdefg'
'abc\'defg'
'abc\'de\'fg'
1
  • Can the string contain anything other than alpha chars and escaped single-quote? Commented Dec 10, 2012 at 11:23

4 Answers 4

2

This is the regex that matches all valid JavaScript literal string (that is surrounded by single quote ') and reject all invalid ones. Note that strict mode is assumed.

/'(?:[^'\\\n\r\u2028\u2029]|\\(?:['"\\bfnrtv]|[^\n\r\u2028\u2029'"\\bfnrtvxu0-9]|0(?![0-9])|x[0-9a-fA-F]{2}|u[0-9a-fA-F]{4})|\\(?:\n|\r\n|\r(?!\n)|[\u2028\u2029]))*'/

Or a shorter version:

/'(?:[^'\\\n\r\u2028\u2029]|\\(?:[^\n\rxu0-9]|0(?![0-9])|x[0-9a-fA-F]{2}|u[0-9a-fA-F]{4}|\n|\r\n?))*'/

The regex above is based on the definition of StringLiteral (ignoring the double quoted version) specified in ECMAScript Language Specification, 5.1 Edition published in June 2011.

The regex for the JavaScript literal string surrounded with double quote " is almost the same:

/"(?:[^"\\\n\r\u2028\u2029]|\\(?:[^\n\rxu0-9]|0(?![0-9])|x[0-9a-fA-F]{2}|u[0-9a-fA-F]{4}|\n|\r\n?))*"/

Let's dissect the monster (the longer version, since it is direct translation from the grammar):

  • A StringLiteral (ignoring the double quote version) starts and ends with ', as it can be seen in the regex. In between the quotes is an optional sequence of SingleStringCharacter. This explains the * - 0 or more characters.

  • SingleStringCharacter is defined as:

    SingleStringCharacter ::
           SourceCharacter but not one of ' or \ or LineTerminator
           \ EscapeSequence
           LineContinuation
    

    [^'\\\n\r\u2028\u2029] corresponds to the first rule

    \\(?:['"\\bfnrtv]|[^\n\r\u2028\u2029'"\\bfnrtvxu0-9]|0(?![0-9])|x[0-9a-fA-F]{2}|u[0-9a-fA-F]{4}) corresponds to the second rule

    \\(?:\n|\r\n|\r(?!\n)|[\u2028\u2029]) corresponds to the third rule

  • Let's look at the first rule: SourceCharacter but not one of ' or \ or LineTerminator. This first rule deals with "normal" characters.

    SourceCharacter is any Unicode unit.

    LineTerminator is Line Feed <LF> (\u000A or \n), Carriage Return <CR> (\u000D or \r), Line Separator <LS> (\u2028) or Paragraph Separator <PS> (\u2029).

    So we will just use a negative character class to represent this rule: [^'\\\n\r\u2028\u2029].

  • For the second rule, which deals with escape sequences, you can see \ before EscapeSequence, as it appears in the regex. As for EscapeSequence, this is its grammar (strict mode):

    EscapeSequence ::
            CharacterEscapeSequence
            0 [lookahead ∉ DecimalDigit]
            HexEscapeSequence
            UnicodeEscapeSequence
    

    ['"\\bfnrtv]|[^\n\r\u2028\u2029'"\\bfnrtvxu0-9] is the regex for CharacterEscapeSequence. It can actually be simplified to [^\n\r\u2028\u2029xu0-9]

    The first part is SingleEscapeCharacter, which includes ', ", \, and for control characters b, f, n, r, t, v.

    The second part is NonEscapeCharacter, which is SourceCharacter but not one of EscapeCharacter or LineTerminator. EscapeCharacter is defined as SingleEscapeCharacter, DecimalDigit or x (for hex escape sequence) or u (for unicode escape sequence).

    0(?![0-9]) is the regex for the second rule of EscapeSequence. This is for specifying null character \0.

    x[0-9a-fA-F]{2} is the regex for HexEscapeSequence

    u[0-9a-fA-F]{4} is the regex for UnicodeEscapeSequence

  • The third rule deals with string that spans multiple lines. Let's look at the grammar of LineContinuation and other related:

    LineContinuation ::
            \ LineTerminatorSequence
    
    LineTerminatorSequence :: 
            <LF> 
            <CR> [lookahead ∉ <LF> ]
            <LS>
            <PS>
            <CR> <LF>
    

    \\(?:\n|\r\n|\r(?!\n)|[\u2028\u2029]) corresponds to the above grammar.

Sign up to request clarification or add additional context in comments.

3 Comments

Umm, seems weird because those regex allow var str = ''' or var str = """ but they shouldn't... :/
Okay so, after digging a bit, var str = ''' is allowed because your two regexes don't have ^ and $ to specify start and end. So, this one is "better" I think: ^"(?:[^"\\\n\r\u2028\u2029]|\\(?:[^\n\rxu0-9]|0(?![0-9])|x[0-9a-fA-F]{2}|u[0-9a-fA-F]{4}|\n|\r\n?))*"$|^'(?:[^'\\\n\r\u2028\u2029]|\\(?:[^\n\rxu0-9]|0(?![0-9])|x[0-9a-fA-F]{2}|u[0-9a-fA-F]{4}|\n|\r\n?))*'$ It handles both '' and "" strings :)
@Leiko: ''' cannot be matched by the regex above. You probably get a match because '' is considered a match. Adding anchors or not depend on your use case.
0

Try this one:

/'(?:[^'\\]|\\'|\\(?!'))*'/

Test it in your console:

/'(?:[^'\\]|\\'|\\(?!'))*'/.exec("'abc\\\'de\\\'fg'")

It'll match

  • Any number of characters that are:
    • NOT ' or \ (except)
    • \' (or)
    • \ (not followed by ')

If you want it to match the entire string, use the ^ start-of-string and $ end-of-string markers:

/^'(?:[^'\\]|\\'|\\(?!'))*'$/

... which will match 'string', 'string\'s are awesome' but not 'string's are awesome' or 'string's

1 Comment

Good catch, fixed the answer. Now just one * so it doesn't backtrack, and matches an empty string. Also changed to a non-capturing group. Cheers.
0

it's not that hard...

Also, you need to detect some other possible chars sequences like \n, \r or \\, breaking a line without escaping is not valid in javascript, you must use the \n sequence.

/^'([^\\'\n\r]|\\'|\\n|\\r|\\\\)*'$/

In execution:

var sample = ["'abcdefg'", // Valid
              "'abc\\'defg'", // Valid
              "'abc\\'de\\'fg'", // Valid
              "'abc\\'\\r\\nde\\'fg'", // Valid
              "'abc\\'\r\nde\\'fg'", // Invalid
              "'abc'def'" // Invalid
             ];
for(var i = 0; i < sample.length; i++)
    console.log(sample[i].match( /^'([^\\'\n\r]|\\'|\\n|\\r|\\\\)*'$/ ));
  1. ^ tell to the matcher that the next condition must match the begining of the string
  2. ' will match the ' delimiter
  3. ( opens a group
  4. [^\\'\n\r] matches anything different from \ and ', and will not match the special \n and \r characters
  5. | if the condition above didn't match anything, the right side of | will be tested
  6. \\' will match \'
  7. \\n will match a \n literal string
  8. |\\r or will match a \r literal string
  9. |\\\\ or will match a \\ literal string
  10. )* close the group and allow it to repeat multiple times and allow it to do not exist (empty string for example)
  11. ' will match the final ' delimiter
  12. $ tell to the matcher that this must be the and of the string

8 Comments

but this string 'abc'def' is matched despite it is not valid
still easy to fix, I've changed the answer a little, just added ^ and $, now bad formations will not match
This undergenerates the possible string literals in JavaScript. Another problem is performance: * after [^\\'] can be removed to reduce backtracking.
actually I disagree that [^\\']* will cause performance issue because it will match every character up to a ' or \ char, so in a string '12345' [^\\']* will match 5 chars in a row (5 ticks), than fail (1 tick) and process \\' that will also fail (1 tick) and leave the group with about 7 ticks. if the * wasn't there it would match 1 char (1 tick) process the group (1 tick) match other char and do this 5 times up to leave the group with about 12 ticks
@JoséRobertoAraújoJúnior: The performance problem happens when the input string fails to match. jsperf.com/regex-star-in-star
|
-2

Try this

/^'([a-z]*(?:\')?[a-z])+'$/

See example here

str = 'abc\'de\'fg';

match = str.match(/^([a-z\\']+)$/g);

Tested in Firebug console. Works with or without the escape chars.

7 Comments

This is not a JavaScript RegEx
Then why do the tags say javascript and regex?
I meant your answer is not a JavaScript compatible RegEx, returns null. JavaScript regex's have slightly different notation than regex for other languages.
Javascript string can't contain numbers ? Why [a-z] only ? Plus, this regex allows var str = '''
@Leiko I initially asked the OP if the string could contain anything other than alpha chars and backslashed single quotes but didn't receive a reply, therefore I took the provided examples at face value. Also, var str = ''' is not a valid string unless you escape the middle quote.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.