1

I have a predefined character->type dictionary. For example, 'a' - is a lower_case letter, 1 is a digit, ')' is a punctuation symbol etc. With the following script, I label all characters in a given string:

labels=''
for ch in list(example):
    try:
        l = character_type_dict[ch]
        print(l)
        labels = labels+l
    except KeyError:
        labels = labels+'o'
        print('o')
labels

For example, given "1,234.45kg (in metric system)" as input, the code produces dpdddpddwllwpllwllllllwllllllp as output.

Now, I would like to split the string based on the groups. the output should look something like this:

['1',',','234','.','45','kg',' ','(','in',' ','metric',' ','system',')']

That is, it should split based on the character-type borders. Any ideas how this might be done efficiently?

2
  • 2
    I think labels is wrong. It treats k as w and g as l Commented Mar 6, 2018 at 16:13
  • oh, thanks for noticing. I probably need to debug dictionary creation step. Commented Mar 6, 2018 at 17:29

5 Answers 5

3

labels is wrong (it is 'dpdddpddwllwpllwllllllwllllllp' in your example but I believe it should be 'dpdddpddllwpllwllllllwllllllp')

Anyway, you can use abuse itertools.groupby:

from itertools import groupby

example = "1,234.45kg (in metric system)"
labels = 'dpdddpddllwpllwllllllwllllllp'

output = [''.join(group)
          for _, group in groupby(example, key=lambda ch: labels[example.index(ch)])]

print(output)
# ['1', ',', '234', '.', '45', 'kg', ' ', '(', 'in', ' ', 'metric', ' ', 'system', ')']
Sign up to request clarification or add additional context in comments.

Comments

1

You can compute labels more concisely (and quite possibly more quickly):

labels = ''.join(character_type_dict.get(ch, 'o') for ch in example)

Or, with a helper function:

character_type = lambda ch: character_type_dict.get(ch, 'o')
labels = ''.join(map(character_type, example))

But you don't need labels to split the string; with the help of itertools.groupby, you can just split directly:

splits = list(''.join(g)
              for _, g in itertools.groupby(example, key=character_type)

A possibly more interesting result is a vector of tuples of types and associated groupings:

 >>> list((''.join(g), code)
 ...      for code, g in itertools.groupby(example, key=character_type))
 [('1', 'd'), (',', 'p'), ('234', 'd'), ('.', 'p'), ('45', 'd'), ('kg', 'l'),
  (' ', 'w'), ('(', 'p'), ('in', 'l'), (' ', 'w'), ('metric', 'l'), (' ', 'w'),
  ('system', 'l'), (')', 'p')]

I computed character_type_dict as follows:

character_type_dict = {}
for code, chars in (('w', string.whitespace),
                    ('l', string.ascii_letters),
                    ('d', string.digits),
                    ('p', string.punctuation)):
  for char in chars: character_type_dict[char] = code

But I could also have done this (as I figured out later):

from collections import ChainMap
character_type_dict = dict(ChainMap(*({c:t for c in string.__getattribute__(n)}
                                    for t,n in (('w', 'whitespace')
                                               ,('d', 'digits')
                                               ,('l', 'ascii_letters')
                                               ,('p', 'punctuation')))))

1 Comment

Thanks for the really comprehensive answer.
1

Just remember the class of the last type:

import string
character_type = {c: "l" for c in string.ascii_letters}
character_type.update({c: "d" for c in string.digits})
character_type.update({c: "p" for c in string.punctuation})
character_type.update({c: "w" for c in string.whitespace})

example = "1,234.45kg (in metric system)"

x = []
prev = None
for ch in example:
    try:
        l = character_type[ch]
        if l == prev:
            x[-1].append(ch)
        else:
            x.append([ch])
    except KeyError:
        print(ch)
    else:
        prev = l
x = map(''.join, x)
print(list(x))
# ['1', ',', '234', '.', '45', 'kg', ' ', '(', 'in', ' ', 'metric', ' ', 'system', ')']

Comments

1

Another algorithmic approach. Instead of try: except: it is nicer to use dictionaryget(value, default_value) method.

import string

character_type_dict = {}
for ch in string.ascii_lowercase:
    character_type_dict[ch] = 'l'
for ch in string.digits:
    character_type_dict[ch] = 'd'
for ch in string.punctuation:
    character_type_dict[ch] = 'p'
for ch in string.whitespace:
    character_type_dict[ch] = 'w'

example = "1,234.45kg (in metric system)"

split_list = []
split_start = 0
for i in range(len(example) - 1):
    if character_type_dict.get(example[i], 'o') != character_type_dict.get(example[i + 1], 'o'):
        split_list.append(example[split_start: i + 1])
        split_start = i + 1
split_list.append(example[split_start:])

print(split_list)

Comments

1

Taking this as an algorithmic puzzle:

# dummy mapping
character_type_dict = dict({c: "l" for c in string.ascii_letters}.items()  \
                         + {c: "d" for c in string.digits}.items() \
                         + {c: "p" for c in string.punctuation}.items() \
                         + {c: "w" for c in string.whitespace}.items())
example = "1,234.45kg (in metric system)"
last = example[0]
temp = last
res = []
for ch in example[1:]:
  try:
    cur = character_type_dict[ch]
    if cur != last:
      res.append(temp)
      temp = ''
    temp += ch
    last = cur
  except KeyError:
    last = 'o'
res.append(temp)

Results in:

['1', ',', '234', '.', '45', 'kg', ' ', '(', 'in', ' ', 'metric', ' ', 'system', ')']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.