feat($parse): support unicode identifier names #3848

clee704 · 2013-09-03T10:52:58Z

Support unicode identifier names as defined in Section 7.6 Identifier Names
and Identifiers, ECMAScript Language Specification
(http://www.ecma-international.org/ecma-262/5.1/#sec-7.6),
except for unicode escape sequences which is hard to implement
without changing too much existing code.

Closes #3847

clee704 · 2013-09-03T10:55:04Z

FYI, the unicode code points are generated by this python script:

import re
import unicodedata

def ranges(arr):
  intervals = []
  start = None
  prev = None
  for n in arr:
    if start is None:
      start = n
    elif prev + 1 != n:
      if start == prev:
        intervals.append(start)
      else:
        intervals.append((start, prev - start + 1))
      start = n
    prev = n
  if start == prev:
    intervals.append(start)
  else:
    intervals.append((start, prev - start + 1))
  ret = {}
  for r in intervals:
    if isinstance(r, tuple):
      start, length = r
      if length in ret:
        ret[length].append(start - sum(ret[length]))
      else:
        ret.setdefault(length, []).append(start)
    else:
      if 1 in ret:
        ret[1].append(r - sum(ret[1]))
      else:
        ret.setdefault(1, []).append(r)
  for length in ret:
    codes = ret[length]
    if len(codes) == 1:
      ret[length] = codes[0]
  return ret

def codes(*categories):
  if not hasattr(codes, '_map'):
    codes._map = {}
    for i in range(0x10000):
      codes._map.setdefault(unicodedata.category(unichr(i)), []).append(i)
  ret = []
  for cat in categories:
    ret.extend(codes._map[cat])
  ret.sort()
  return ret

def dump(ranges, indent=2, starting_indent=2, line_limit=100):
  indent_level_0 = ' ' * starting_indent
  indent_level_1 = ' ' * (starting_indent + indent)
  indent_level_2 = ' ' * (starting_indent + indent * 2)
  temp = ['{\n']
  prev_linelen = None
  for length in sorted(ranges.keys()):
    line = ''.join(str(length) + ': ' + str(ranges[length]))
    if prev_linelen is None or prev_linelen + 2 + len(line) >= line_limit:
      if prev_linelen is not None:
        temp.append(',\n')
      prev_linelen = len(indent_level_1) + len(line)
      if prev_linelen >= line_limit:
        temp.append(indent_level_1 + str(length) + ': [\n')
        temp2 = []
        prev_linelen2 = None
        for code in ranges[length]:
          if prev_linelen2 is None or prev_linelen2 + 2 + len(str(code)) >= line_limit:
            if prev_linelen2 is not None:
              temp2.append(',\n')
            temp2.append(indent_level_2 + str(code))
            prev_linelen2 = len(indent_level_2) + len(str(code))
          else:
            temp2.append(', ' + str(code))
            prev_linelen2 += 2 + len(str(code))
        temp.extend(temp2)
        temp.append('\n' + indent_level_1 + ']')
      else:
        temp.append(indent_level_1 + line)
    else:
      temp.append(', ')
      temp.append(line)
      prev_linelen += 2 + len(line)
  temp.append('\n' + indent_level_0 + '}')
  return ''.join(temp)

# See http://www.ecma-international.org/ecma-262/5.1/#sec-7.6
UnicodeLetter = dump(ranges(codes('Lu', 'Ll', 'Lt', 'Lm', 'Lo', 'Nl')))
UnicodeCombiningMark = dump(ranges(codes('Mn', 'Mc')))
UnicodeDigit = dump(ranges(codes('Nd')))
UnicodeConnectorPunctuation = dump(ranges(codes('Pc')))

print '  var letterRanges =', UnicodeLetter + ';'
print '  var combiningMarkRanges =', UnicodeCombiningMark + ';'
print '  var digitRanges =', UnicodeDigit + ';'
print '  var connectorPunctuationRanges =', UnicodeConnectorPunctuation + ';'

def load(d):
  return eval(re.sub(r'([0-9]+):', r'"\1":', d))

def test(ranges, *categories):
  arr = [False] * 0x10000
  for length in ranges:
    codes = ranges[length]
    if not isinstance(codes, list):
      codes = [codes]
    start = None
    for inc in codes:
      start = inc if start is None else start + inc
      end = start + int(length)
      for code in range(start, end):
        arr[code] = True
  for i, truth in enumerate(arr):
    assert truth == (unicodedata.category(unichr(i)) in categories)

# test
print 'Running self-tests...'
test(load(UnicodeLetter), 'Lu', 'Ll', 'Lt', 'Lm', 'Lo', 'Nl')
test(load(UnicodeCombiningMark), 'Mn', 'Mc')
test(load(UnicodeDigit), 'Nd')
test(load(UnicodeConnectorPunctuation), 'Pc')
print 'OK'

Support unicode identifier names as defined in Section 7.6 Identifier Names and Identifiers, ECMAScript Language Specification (http://www.ecma-international.org/ecma-262/5.1/#sec-7.6), except for unicode escape sequences which is hard to implement without changing too much existing code. Closes #3847

clee704 · 2013-09-03T13:22:06Z

Sorry for the frequent closing and opening again. Travis is being strange. Everything is fine on local and actually one of the builds has passed; the problem is there are two builds for the same build number.

clee704 · 2013-09-09T06:11:29Z

Now the overhead of this feature is 3.2kb for minified code and 1.9kb for minified & gzipped code, which is about 6% increase from master branch.

petebacondarwin · 2013-09-17T10:36:48Z

The file size increase is a concern. I wonder if one could ship an "international" version of the library that supported unicode identifiers? @IgorMinar - what do you think?

petebacondarwin · 2013-09-17T10:38:25Z

@clee704 - Can you ensure that you have signed the CLA. Thanks

clee704 · 2013-09-17T11:32:44Z

@petebacondarwin Yes, I have. My name is Choongmin Lee.

clee704 · 2013-09-17T11:40:21Z

If the file size is a concern, maybe we could make a separate module for i18n and put this code there. I guess it should not increase the size of $parse much, though I'm not sure as I'm completely new to how angular is built.

clee704 · 2013-09-17T11:54:05Z

src/ng/parse.js

This is not strictly correct, although it was not before this PR, too. After a dot character, a new identifier should start, so we should check for isIdent() instead of ch == '.' || isIdentPart(ch) || isNumber(ch) after dot. If this is eventually getting into the core, I'll fix it.

petebacondarwin · 2013-09-18T12:53:36Z

@clee704 - I think moving this into an optional angular-unicode module would be a good idea. Perhaps we could my isIdent() into an AngularJS service that could be overridden/decorated in this extra module?

gurdiga · 2013-10-23T02:54:07Z

I agree with @petebacondarwin that it is a good idea to have this (un)pluggable and maybe even customizable. It is ~4K of code that will be called a lot and some people may not like incurring the speed and KB penalty.

For my project I only need 10 more letters, it’s 66 letters for Russian, and I guess this may be the case for many european languages. These cases would have a considerable smaller footprint compared to what is the all-in-one thing.

I’m wondering what’s holding this back from getting into the public release.

//cc @clee704

clee704 · 2013-12-22T09:23:05Z

I'm closing this in favor of #4747.

clee704 closed this Sep 3, 2013

clee704 reopened this Sep 3, 2013

clee704 closed this Sep 3, 2013

clee704 reopened this Sep 3, 2013

refactor($parse): reduce code size

8d18710

clee704 reviewed Sep 17, 2013
View reviewed changes

petebacondarwin mentioned this pull request Oct 7, 2013

feat(expressions): allow non-English Unicode letters in identifiers #4308

Closed

clee704 closed this Dec 22, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat($parse): support unicode identifier names #3848

feat($parse): support unicode identifier names #3848

Uh oh!

clee704 commented Sep 3, 2013

Uh oh!

clee704 commented Sep 3, 2013

Uh oh!

clee704 commented Sep 3, 2013

Uh oh!

clee704 commented Sep 9, 2013

Uh oh!

petebacondarwin commented Sep 17, 2013

Uh oh!

petebacondarwin commented Sep 17, 2013

Uh oh!

clee704 commented Sep 17, 2013

Uh oh!

clee704 commented Sep 17, 2013

Uh oh!

clee704 Sep 17, 2013

Uh oh!

petebacondarwin commented Sep 18, 2013

Uh oh!

gurdiga commented Oct 23, 2013

Uh oh!

clee704 commented Dec 22, 2013

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat($parse): support unicode identifier names #3848

feat($parse): support unicode identifier names #3848

Uh oh!

Conversation

clee704 commented Sep 3, 2013

Uh oh!

clee704 commented Sep 3, 2013

Uh oh!

clee704 commented Sep 3, 2013

Uh oh!

clee704 commented Sep 9, 2013

Uh oh!

petebacondarwin commented Sep 17, 2013

Uh oh!

petebacondarwin commented Sep 17, 2013

Uh oh!

clee704 commented Sep 17, 2013

Uh oh!

clee704 commented Sep 17, 2013

Uh oh!

clee704 Sep 17, 2013

Choose a reason for hiding this comment

Uh oh!

petebacondarwin commented Sep 18, 2013

Uh oh!

gurdiga commented Oct 23, 2013

Uh oh!

clee704 commented Dec 22, 2013

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants