Skip to content
This repository was archived by the owner on Apr 12, 2024. It is now read-only.

Conversation

@clee704
Copy link

@clee704 clee704 commented Sep 3, 2013

Support unicode identifier names as defined in Section 7.6 Identifier Names
and Identifiers, ECMAScript Language Specification
(http://www.ecma-international.org/ecma-262/5.1/#sec-7.6),
except for unicode escape sequences which is hard to implement
without changing too much existing code.

Closes #3847

@clee704
Copy link
Author

clee704 commented Sep 3, 2013

FYI, the unicode code points are generated by this python script:

import re
import unicodedata

def ranges(arr):
  intervals = []
  start = None
  prev = None
  for n in arr:
    if start is None:
      start = n
    elif prev + 1 != n:
      if start == prev:
        intervals.append(start)
      else:
        intervals.append((start, prev - start + 1))
      start = n
    prev = n
  if start == prev:
    intervals.append(start)
  else:
    intervals.append((start, prev - start + 1))
  ret = {}
  for r in intervals:
    if isinstance(r, tuple):
      start, length = r
      if length in ret:
        ret[length].append(start - sum(ret[length]))
      else:
        ret.setdefault(length, []).append(start)
    else:
      if 1 in ret:
        ret[1].append(r - sum(ret[1]))
      else:
        ret.setdefault(1, []).append(r)
  for length in ret:
    codes = ret[length]
    if len(codes) == 1:
      ret[length] = codes[0]
  return ret

def codes(*categories):
  if not hasattr(codes, '_map'):
    codes._map = {}
    for i in range(0x10000):
      codes._map.setdefault(unicodedata.category(unichr(i)), []).append(i)
  ret = []
  for cat in categories:
    ret.extend(codes._map[cat])
  ret.sort()
  return ret

def dump(ranges, indent=2, starting_indent=2, line_limit=100):
  indent_level_0 = ' ' * starting_indent
  indent_level_1 = ' ' * (starting_indent + indent)
  indent_level_2 = ' ' * (starting_indent + indent * 2)
  temp = ['{\n']
  prev_linelen = None
  for length in sorted(ranges.keys()):
    line = ''.join(str(length) + ': ' + str(ranges[length]))
    if prev_linelen is None or prev_linelen + 2 + len(line) >= line_limit:
      if prev_linelen is not None:
        temp.append(',\n')
      prev_linelen = len(indent_level_1) + len(line)
      if prev_linelen >= line_limit:
        temp.append(indent_level_1 + str(length) + ': [\n')
        temp2 = []
        prev_linelen2 = None
        for code in ranges[length]:
          if prev_linelen2 is None or prev_linelen2 + 2 + len(str(code)) >= line_limit:
            if prev_linelen2 is not None:
              temp2.append(',\n')
            temp2.append(indent_level_2 + str(code))
            prev_linelen2 = len(indent_level_2) + len(str(code))
          else:
            temp2.append(', ' + str(code))
            prev_linelen2 += 2 + len(str(code))
        temp.extend(temp2)
        temp.append('\n' + indent_level_1 + ']')
      else:
        temp.append(indent_level_1 + line)
    else:
      temp.append(', ')
      temp.append(line)
      prev_linelen += 2 + len(line)
  temp.append('\n' + indent_level_0 + '}')
  return ''.join(temp)

# See http://www.ecma-international.org/ecma-262/5.1/#sec-7.6
UnicodeLetter = dump(ranges(codes('Lu', 'Ll', 'Lt', 'Lm', 'Lo', 'Nl')))
UnicodeCombiningMark = dump(ranges(codes('Mn', 'Mc')))
UnicodeDigit = dump(ranges(codes('Nd')))
UnicodeConnectorPunctuation = dump(ranges(codes('Pc')))

print '  var letterRanges =', UnicodeLetter + ';'
print '  var combiningMarkRanges =', UnicodeCombiningMark + ';'
print '  var digitRanges =', UnicodeDigit + ';'
print '  var connectorPunctuationRanges =', UnicodeConnectorPunctuation + ';'

def load(d):
  return eval(re.sub(r'([0-9]+):', r'"\1":', d))

def test(ranges, *categories):
  arr = [False] * 0x10000
  for length in ranges:
    codes = ranges[length]
    if not isinstance(codes, list):
      codes = [codes]
    start = None
    for inc in codes:
      start = inc if start is None else start + inc
      end = start + int(length)
      for code in range(start, end):
        arr[code] = True
  for i, truth in enumerate(arr):
    assert truth == (unicodedata.category(unichr(i)) in categories)

# test
print 'Running self-tests...'
test(load(UnicodeLetter), 'Lu', 'Ll', 'Lt', 'Lm', 'Lo', 'Nl')
test(load(UnicodeCombiningMark), 'Mn', 'Mc')
test(load(UnicodeDigit), 'Nd')
test(load(UnicodeConnectorPunctuation), 'Pc')
print 'OK'

@clee704 clee704 closed this Sep 3, 2013
Support unicode identifier names as defined in Section 7.6 Identifier Names
and Identifiers, ECMAScript Language Specification
(http://www.ecma-international.org/ecma-262/5.1/#sec-7.6),
except for unicode escape sequences which is hard to implement
without changing too much existing code.

Closes #3847
@clee704 clee704 reopened this Sep 3, 2013
@clee704 clee704 closed this Sep 3, 2013
@clee704 clee704 reopened this Sep 3, 2013
@clee704
Copy link
Author

clee704 commented Sep 3, 2013

Sorry for the frequent closing and opening again. Travis is being strange. Everything is fine on local and actually one of the builds has passed; the problem is there are two builds for the same build number.

@clee704
Copy link
Author

clee704 commented Sep 9, 2013

Now the overhead of this feature is 3.2kb for minified code and 1.9kb for minified & gzipped code, which is about 6% increase from master branch.

@petebacondarwin
Copy link
Contributor

The file size increase is a concern. I wonder if one could ship an "international" version of the library that supported unicode identifiers? @IgorMinar - what do you think?

@petebacondarwin
Copy link
Contributor

@clee704 - Can you ensure that you have signed the CLA. Thanks

@clee704
Copy link
Author

clee704 commented Sep 17, 2013

@petebacondarwin Yes, I have. My name is Choongmin Lee.

@clee704
Copy link
Author

clee704 commented Sep 17, 2013

If the file size is a concern, maybe we could make a separate module for i18n and put this code there. I guess it should not increase the size of $parse much, though I'm not sure as I'm completely new to how angular is built.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not strictly correct, although it was not before this PR, too. After a dot character, a new identifier should start, so we should check for isIdent() instead of ch == '.' || isIdentPart(ch) || isNumber(ch) after dot. If this is eventually getting into the core, I'll fix it.

@petebacondarwin
Copy link
Contributor

@clee704 - I think moving this into an optional angular-unicode module would be a good idea. Perhaps we could my isIdent() into an AngularJS service that could be overridden/decorated in this extra module?

@gurdiga
Copy link
Contributor

gurdiga commented Oct 23, 2013

I agree with @petebacondarwin that it is a good idea to have this (un)pluggable and maybe even customizable. It is ~4K of code that will be called a lot and some people may not like incurring the speed and KB penalty.

For my project I only need 10 more letters, it’s 66 letters for Russian, and I guess this may be the case for many european languages. These cases would have a considerable smaller footprint compared to what is the all-in-one thing.

I’m wondering what’s holding this back from getting into the public release.

//cc @clee704

@clee704
Copy link
Author

clee704 commented Dec 22, 2013

I'm closing this in favor of #4747.

@clee704 clee704 closed this Dec 22, 2013
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support unicode variable names in scope

3 participants