Split string based on predefined character types

Question

I have a predefined character->type dictionary. For example, 'a' - is a lower_case letter, 1 is a digit, ')' is a punctuation symbol etc. With the following script, I label all characters in a given string:

labels=''
for ch in list(example):
    try:
        l = character_type_dict[ch]
        print(l)
        labels = labels+l
    except KeyError:
        labels = labels+'o'
        print('o')
labels

For example, given "1,234.45kg (in metric system)" as input, the code produces dpdddpddwllwpllwllllllwllllllp as output.

Now, I would like to split the string based on the groups. the output should look something like this:

['1',',','234','.','45','kg',' ','(','in',' ','metric',' ','system',')']

That is, it should split based on the character-type borders. Any ideas how this might be done efficiently?

I think labels is wrong. It treats k as w and g as l — DeepSpace
– DeepSpace, Commented Mar 6, 2018 at 16:13
oh, thanks for noticing. I probably need to debug dictionary creation step. — Ahmadov
– Ahmadov, Commented Mar 6, 2018 at 17:29

DeepSpace · Accepted Answer · 2018-03-06 16:29:29Z

3

labels is wrong (it is 'dpdddpddwllwpllwllllllwllllllp' in your example but I believe it should be 'dpdddpddllwpllwllllllwllllllp')

Anyway, you can ~~use~~ abuse itertools.groupby:

from itertools import groupby

example = "1,234.45kg (in metric system)"
labels = 'dpdddpddllwpllwllllllwllllllp'

output = [''.join(group)
          for _, group in groupby(example, key=lambda ch: labels[example.index(ch)])]

print(output)
# ['1', ',', '234', '.', '45', 'kg', ' ', '(', 'in', ' ', 'metric', ' ', 'system', ')']

edited Mar 6, 2018 at 16:29

answered Mar 6, 2018 at 16:19

DeepSpace

82.1k12 gold badges119 silver badges166 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

rici · Accepted Answer · 2018-03-06 17:33:18Z

You can compute labels more concisely (and quite possibly more quickly):

labels = ''.join(character_type_dict.get(ch, 'o') for ch in example)

Or, with a helper function:

character_type = lambda ch: character_type_dict.get(ch, 'o')
labels = ''.join(map(character_type, example))

But you don't need labels to split the string; with the help of itertools.groupby, you can just split directly:

splits = list(''.join(g)
              for _, g in itertools.groupby(example, key=character_type)

A possibly more interesting result is a vector of tuples of types and associated groupings:

 >>> list((''.join(g), code)
 ...      for code, g in itertools.groupby(example, key=character_type))
 [('1', 'd'), (',', 'p'), ('234', 'd'), ('.', 'p'), ('45', 'd'), ('kg', 'l'),
  (' ', 'w'), ('(', 'p'), ('in', 'l'), (' ', 'w'), ('metric', 'l'), (' ', 'w'),
  ('system', 'l'), (')', 'p')]

I computed character_type_dict as follows:

character_type_dict = {}
for code, chars in (('w', string.whitespace),
                    ('l', string.ascii_letters),
                    ('d', string.digits),
                    ('p', string.punctuation)):
  for char in chars: character_type_dict[char] = code

But I could also have done this (as I figured out later):

from collections import ChainMap
character_type_dict = dict(ChainMap(*({c:t for c in string.__getattribute__(n)}
                                    for t,n in (('w', 'whitespace')
                                               ,('d', 'digits')
                                               ,('l', 'ascii_letters')
                                               ,('p', 'punctuation')))))

Graipher · Accepted Answer · 2018-03-06 16:25:58Z

1

Just remember the class of the last type:

import string
character_type = {c: "l" for c in string.ascii_letters}
character_type.update({c: "d" for c in string.digits})
character_type.update({c: "p" for c in string.punctuation})
character_type.update({c: "w" for c in string.whitespace})

example = "1,234.45kg (in metric system)"

x = []
prev = None
for ch in example:
    try:
        l = character_type[ch]
        if l == prev:
            x[-1].append(ch)
        else:
            x.append([ch])
    except KeyError:
        print(ch)
    else:
        prev = l
x = map(''.join, x)
print(list(x))
# ['1', ',', '234', '.', '45', 'kg', ' ', '(', 'in', ' ', 'metric', ' ', 'system', ')']

answered Mar 6, 2018 at 16:25

Graipher

7,24630 silver badges49 bronze badges

Comments

AleksMat · Accepted Answer · 2018-03-06 16:31:53Z

Another algorithmic approach. Instead of try: except: it is nicer to use dictionaryget(value, default_value) method.

import string

character_type_dict = {}
for ch in string.ascii_lowercase:
    character_type_dict[ch] = 'l'
for ch in string.digits:
    character_type_dict[ch] = 'd'
for ch in string.punctuation:
    character_type_dict[ch] = 'p'
for ch in string.whitespace:
    character_type_dict[ch] = 'w'

example = "1,234.45kg (in metric system)"

split_list = []
split_start = 0
for i in range(len(example) - 1):
    if character_type_dict.get(example[i], 'o') != character_type_dict.get(example[i + 1], 'o'):
        split_list.append(example[split_start: i + 1])
        split_start = i + 1
split_list.append(example[split_start:])

print(split_list)

flowjow · Accepted Answer · 2018-03-06 16:36:33Z

1

Taking this as an algorithmic puzzle:

# dummy mapping
character_type_dict = dict({c: "l" for c in string.ascii_letters}.items()  \
                         + {c: "d" for c in string.digits}.items() \
                         + {c: "p" for c in string.punctuation}.items() \
                         + {c: "w" for c in string.whitespace}.items())
example = "1,234.45kg (in metric system)"
last = example[0]
temp = last
res = []
for ch in example[1:]:
  try:
    cur = character_type_dict[ch]
    if cur != last:
      res.append(temp)
      temp = ''
    temp += ch
    last = cur
  except KeyError:
    last = 'o'
res.append(temp)

Results in:

['1', ',', '234', '.', '45', 'kg', ' ', '(', 'in', ' ', 'metric', ' ', 'system', ')']

edited Mar 6, 2018 at 16:36

answered Mar 6, 2018 at 16:24

flowjow

1911 gold badge3 silver badges15 bronze badges

Collectives™ on Stack Overflow

Split string based on predefined character types

5 Answers 5

Comments

1 Comment

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

1 Comment

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related