2

I'm just wondering, I'm trying to make a very simple text processing or reduction. I want to replace all spaces (without these in " ") by one. I also have some semantic action dependent on each character read, so I that's why I don't want to use any regex. It's some kind of pseudo FSM model.

So here's the the deal:

s = '''that's my     string, "   keep these spaces     "    but reduce these '''

Desired ouput:

that's my string, "   keep these spaces    " but reduce these

What I would like to do is something like this: (I don't mention the '"' case to keep the example simple)

out = ""
for i in range(len(s)):

  if s[i].isspace():
    out += ' '
    while s[i].isspace():
      i += 1

  else:
    out += s[i]

I don't quite understand how the scopes are created or shared in this case.

Thank you for advice.

3
  • what are the variables line and lineCpy? Commented Jan 10, 2014 at 20:10
  • The problem is that once you skipped all of the parenthesis on the while loop, the i variable will take the next value after the last "space" that meet the s[i].isspace() condition...so you will not delete all of the parenthesis, you will just iterate over them again... Commented Jan 10, 2014 at 20:16
  • aah, sorry, I've missed them, they are both the s string, I'm blind I guess. Commented Jan 10, 2014 at 20:17

6 Answers 6

1

I also have some semantic action dependent on each character read ... It's some kind of pseudo FSM model.

You could actually implement an FSM:

s = '''that's my     string, "   keep these spaces     "    but reduce these '''


normal, quoted, eating = 0,1,2
state = eating
result = ''
for ch in s:
  if (state, ch) == (eating, ' '):
    continue
  elif (state,ch) == (eating, '"'):
    result += ch
    state = quoted
  elif state == eating:
    result += ch
    state = normal
  elif (state, ch) == (quoted, '"'):
    result += ch
    state = normal
  elif state == quoted:
    result += ch
  elif (state,ch) == (normal, '"'):
    result += ch
    state = quoted
  elif (state,ch) == (normal, ' '):
    result += ch
    state = eating
  else: # state == normal
    result += ch

print result

Or, the data-driven version:

actions = {
    'normal' : {
        ' ' : lambda x: ('eating', ' '),
        '"' : lambda x: ('quoted', '"'),
        None: lambda x: ('normal', x)
    },
    'eating' : {
        ' ' : lambda x: ('eating', ''),
        '"' : lambda x: ('quoted', '"'),
        None: lambda x: ('normal', x)
    },
    'quoted' : {
        '"' : lambda x: ('normal', '"'),
        '\\': lambda x: ('escaped', '\\'),
        None: lambda x: ('quoted', x)
    },
    'escaped' : {
        None: lambda x: ('quoted', x)
    }
}

def reduce(s):
    result = ''
    state = 'eating'
    for ch in s:
        state, ch = actions[state].get(ch, actions[state][None])(ch)
        result += ch
    return result

s = '''that's my     string, "   keep these spaces     "    but reduce these '''
print reduce(s)
Sign up to request clarification or add additional context in comments.

3 Comments

I've started doing that :)
Works nice, I only add escape \" sequence check and it should be enough for my purpose.
Or see the data-driven version for a more explicit state machine, with \" escaping.
1

Use shlex to parse your string to quoted and unquoted parts, then in unquoted parts use regex to replace sequence of whitespace with one space.

1 Comment

That's rather ingenious actually, but its not going to work. It should fail on the single quote in that's in his example, and other similar cases. I wonder if there is an appropriate parser somewhere in the standard library though. EDIT: looks like shlex might be configurable to do this though. I leave it to you to sort this out :)
1

As already suggested, I'd use the standard shlex module instead, with some adjustments:

import shlex

def reduce_spaces(s):
    lex = shlex.shlex(s)
    lex.quotes = '"'             # ignore single quotes
    lex.whitespace_split = True  # use only spaces to separate tokens
    tokens = iter(lex.get_token, lex.eof)  # exhaust the lexer
    return ' '.join(tokens)

>>> s = '''that's my   string, "   keep these spaces     "   but reduce these '''
>>> reduce_spaces(s)
'that\'s my string, "   keep these spaces     " but reduce these'

Comments

0
i = iter((i for i,char in enumerate(s) if char=='"'))
zones = list(zip(*[i]*2))  # a list of all the "zones" where spaces should not be manipulated
answer = []
space = False
for i,char in enumerate(s):
    if not any(zone[0] <= i <= zone[1] for zone in zones):
        if char.isspace():
            if not space:
                answer.append(char)
        else:
            answer.append(char)
    else:
        answer.append(char)
    space = char.isspace()

print(''.join(answer))

And the output:

>>> s = '''that's my     string, "   keep these spaces     "    but reduce these '''
>>> i = iter((i for i,char in enumerate(s) if char=='"'))
>>> zones = list(zip(*[i]*2))
>>> answer = []
>>> space = False
>>> for i,char in enumerate(s):
...     if not any(zone[0] <= i <= zone[1] for zone in zones):
...         if char.isspace():
...             if not space:
...                 answer.append(char)
...         else:
...             answer.append(char)
...     else:
...         answer.append(char)
...     space = char.isspace()
... 
>>> print(''.join(answer))
that's my string, "   keep these spaces     " but reduce these 

Comments

0

It is a bit of a hack but you could do reducing to a single space with a one-liner.

one_space = lambda s : ' '.join([part for part in s.split(' ') if part]

This joins the parts that are not empty, that is they have not space characters, together separated by a single space. The harder part of course is separating out the exceptional part in double quotes. In real production code you would want to be careful of cases like escaped double quotes as well. But presuming that you have only well mannered case you could separate those out as well. I presume in real code you may have more than one double quoted section.

You can do this making a list from your string separated by double quote and using only once one the even indexed items and directly appending the even indexed items I believe from working some examples.

def fix_spaces(s):
  dbl_parts = s.split('"')
  normalize = lambda i: one_space(' ', dbl_parts[i]) if not i%2 else dbl_parts[i]
  return ' '.join([normalize(i) for i in range(len(dbl_parts))])

1 Comment

As I said I need also to assign semantic actions to several characters, so I don't think this approach would be transparent enough to do this.
0

A bit concerned whether this solution will be readable or not. Modified the string OP suggested to include multiple double quote pairs in the given string.

s = '''that's my     string,   "   keep these spaces     "" as    well    as these    "    reduce these"   keep these spaces too   "   but not these  '''
s_split = s.split('"')

# The substrings in odd positions of list s_split should retain their spaces.
# These elements have however lost their double quotes during .split('"'),
# so add them for new string. For the substrings in even postions, remove 
# the multiple spaces in between by splitting them again using .split() 
# and joining them with a single space. However this will not conserve 
# leading and trailing spaces. In order conserve them, add a dummy 
# character (in this case '-') at the start and end of the substring before 
# the split. Remove the dummy bits after the split.
#
# Finally join the elements in new_string_list to create the desired string.

new_string_list = ['"' + x + '"' if i%2 == 1
                   else ' '.join(('-' + x + '-').split())[1:-1]                   
                   for i,x in enumerate(s_split)]
new_string = ''.join(new_string_list)
print(new_string)

Output is

>>>that's my string, "   keep these spaces     "" as    well    as these    " reduce these"   keep these spaces too   " but not these 

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.