8

So, for input:

accessibility,random good bye

I want output:

a11y,r4m g2d bye

So, basically, I have to abbreviate all words of length greater than or equal to 4 in the following format: first_letter + length_of_all_letters_in_between + last_letter

I try to do this:

re.sub(r"([A-Za-z])([A-Za-z]{2,})([A-Za-z])", r"\1" + str(len(r"\2")) + r"\3", s)

But it does not work. In JS, I would easily do:

str.replace(/([A-Za-z])([A-Za-z]{2,})([A-Za-z])/g, function(m, $1, $2, $3){
   return $1 + $2.length + $3;
});

How do I do the same in Python?

EDIT: I cannot afford to lose any punctuation present in original string.

4
  • 2
    re is a bit of an overkill for this, in my opinion. I'd just use mystring[0]+str(len(mystring)-2)+mystring[-1] and an if statement to see when to apply this Commented May 30, 2015 at 10:23
  • @AleksanderLidtke I thought about it but then mystring has separate individual words (like accessibility,random good bye) and not itself is a word. Commented May 30, 2015 at 10:24
  • @AleksanderLidtke, what about the comma? How are you separating the words? Commented May 30, 2015 at 10:27
  • mystring is just one word. If you have comma separated words you can just do mycomaseparatedstring.split(',') to get a list of the contents of mycomaseparatedstring separated by commas. Then proceed as with mystring. Sorry, thought this was clear - it was to me because I know Python, perhaps I should have been clearer. Commented May 30, 2015 at 11:24

7 Answers 7

8

What you are doing in JavaScript is certainly right, you are passing an anonymous function. What you do in Python is to pass a constant expression ("\12\3", since len(r"\2") is evaluated before the function call), it is not a function that can be evaluated for each match!

While anonymous functions in Python aren't quite as useful as they are in JS, they do the job here:

>>> import re
>>> re.sub(r"([A-Za-z])([A-Za-z]{2,})([A-Za-z])", lambda m: "{}{}{}".format(m.group(1), len(m.group(2)), m.group(3)), "accessability, random good bye")
'a11y, r4m g2d bye'

What happens here is that the lambda is called for each substitution, taking a match object. I then retrieve the needed information and build a substitution string from that.

Sign up to request clarification or add additional context in comments.

2 Comments

@Kasra how is that? It does exactly what the author wanted and is a close analogy to his code in JS
@Kasra indeed it does. This is completely punctuation agnostic.
3

The issue you're running into is that len(r'\2') is always 2, not the length of the second capturing group in your regular expression. You can use a lambda expression to create a function that works just like the code you would use in JavaScript:

re.sub(r"([A-Za-z])([A-Za-z]{2,})([A-Za-z])",
       lambda m: m.group(1) + str(len(m.group(2)) + m.group(3),
       s)

The m argument to the lambda is a match object, and the calls to its group method are equivalent to the backreferences you were using before.

It might be easier to just use a simple word matching pattern with no capturing groups (group() can still be called with no argument to get the whole matched text):

re.sub(r'\w{4,}', lambda m: m.group()[0] + str(len(m.group())-2) + m.group()[-1], s)

2 Comments

Very short nag that the author used [A-Za-z] in his original solution and that you may want to change your alternative solution to that instead of \w.
Accepted for giving solution as well as highlighting my issue.
2
tmp, out = "",""
for ch in s:
    if ch.isspace() or ch in {",", "."}:
        out += "{}{}{}{}".format(tmp[0], len(tmp) - 2, tmp[-1], ch) if len(tmp) > 3 else tmp + ch
        tmp = ""
    else:
        tmp += ch
out += "{}{}{}".format(tmp[0], len(tmp) - 2, tmp[-1]) if len(tmp) > 3 else tmp
print(out)

a11y,r4m g2d bye

If you only want alpha characters use str.isalpha:

tmp, out = "", ""
for ch in s:
    if not ch.isalpha():
        out += "{}{}{}{}".format(tmp[0], len(tmp) - 2, tmp[-1], ch) if len(tmp) > 3 else tmp + ch
        tmp = ""
    else:
        tmp += ch
out += "{}{}{}".format(tmp[0], len(tmp) - 2, tmp[-1]) if len(tmp) > 3 else tmp
print(out)
a11y,r4m g2d bye

The logic is the same for both, it is just what we check for that differs, if not ch.isalpha() is False we found a non alpha character so we need to process the tmp string and add it to out output string. if len(tmp) is not greater than 3 as per the requirement we just add the tmp string plus the current char to our out string.

We need a final out += "{}{}{} outside the loop to catch when a string does not end in a comma, space etc.. If the string did end in a non-alpha we would be adding an empty string so it would make no difference to the output.

It will preserve punctuation and spaces:

 s = "accessibility,random   good bye !!    foobar?"
def func(s):
    tmp, out = "", ""
    for ch in s:
        if not ch.isalpha():
            out += "{}{}{}{}".format(tmp[0], len(tmp) - 2, tmp[-1], ch) if len(tmp) > 3 else tmp + ch
            tmp = ""
        else:
            tmp += ch
    return "{}{}{}".format(tmp[0], len(tmp) - 2, tmp[-1]) if len(tmp) > 3 else tmp
print(func(s,3))
a11y,r4m   g2d bye !!    f4r?

Comments

1

Keep it simple...

>>> s = "accessibility,random good bye"
>>> re.sub(r'\B[A-Za-z]{2,}\B', lambda x: str(len(x.group())), s)
'a11y,r4m g2d bye'

\B which matches between two word characters or two non-word chars helps to match all the chars except first and last.

1 Comment

Excellent! Never thought of that!
1

As an alternative precise way you can use a separate function for re.sub and use the simple regex r"(\b[a-zA-Z]+\b)".

>>> def replacer(x): 
...    g=x.group(0)
...    if len(g)>3:
...        return '{}{}{}'.format(g[0],len(g)-2,g[-1])
...    else :
...        return g
... 
>>> re.sub(r"(\b[a-zA-Z]+\b)", replacer, s)
'a11y,r4m g2d bye'

Also as a pythonic and general way, to get the replaced words within a list you can use a list comprehension using re.finditer :

>>> from operator import sub
>>> rep=['{}{}{}'.format(i.group(0)[0],abs(sub(*i.span()))-2,i.group(0)[-1]) if len(i.group(0))>3 else i.group(0) for i in re.finditer(r'(\w+)',s)]
>>> rep
['a11y', 'r4m', 'g2d', 'bye']

The re.finditer will returns a generator contains all matchobjects then you can iterate over it and get the start and end of matchobjects with span() method.

Comments

0

Using regex and comprehension:

import re
s = "accessibility,random good bye"
print "".join(w[0]+str(len(w)-2)+w[-1] if len(w) > 3 else w for w in re.split("(\W)", s))

Gives:

a11y,r4m g2d bye

1 Comment

This will abbreviate any four-or-more character long run of non-word characters. Try s='foo... bar' to see for yourself!
-1

Have a look at the following code

sentence = "accessibility,random good bye"
sentence = sentence.replace(',', " ")
sentence_list = sentence.split(" ")
for item in sentence_list:
    if len(item) >= 4:
        print item[0]+str(len(item[1:len(item)-1]))+item[len(item)-1]

The only thing you should take care of comma and other punctuation characters.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.