61
votes

Python has a built in function sum, which is effectively equivalent to:

def sum2(iterable, start=0):
    return start + reduce(operator.add, iterable)

for all types of parameters except strings. It works for numbers and lists, for example:

 sum([1,2,3], 0) = sum2([1,2,3],0) = 6    #Note: 0 is the default value for start, but I include it for clarity
 sum({888:1}, 0) = sum2({888:1},0) = 888

Why were strings specially left out?

 sum( ['foo','bar'], '') # TypeError: sum() can't sum strings [use ''.join(seq) instead]
 sum2(['foo','bar'], '') = 'foobar'

I seem to remember discussions in the Python list for the reason, so an explanation or a link to a thread explaining it would be fine.

Edit: I am aware that the standard way is to do "".join. My question is why the option of using sum for strings was banned, and no banning was there for, say, lists.

Edit 2: Although I believe this is not needed given all the good answers I got, the question is: Why does sum work on an iterable containing numbers or an iterable containing lists but not an iterable containing strings?

16
  • 19
    @NullUserException: it makes as much sense to "sum" strings as it is to "sum" lists. Commented Aug 19, 2010 at 19:31
  • 2
    @NullUserException: It would be great if you were right, but sadly the + operation on strings is already overloaded to mean concatentation. So with + we already construct string "sums". Commented Aug 19, 2010 at 19:57
  • 3
    @S.Lott: I meant summing a sequence of lists compared to summing a sequence of strings. As it happens, "sum" of a list of lists concatenates the lists. You can sum two lists using + to concatenate them. You can sum two strings using + to concatenate them. So it makes as much sense to define sum as concatenation for strings as it is for lists. That is what I meant. Whether this is good or is bad is beside the question. Commented Aug 19, 2010 at 20:12
  • 2
    @S.Lott: read my question again. It is quite clear there. I said: "for all types of parameters except strings. It works for numbers and lists, for example." Which means that numbers and lists are parameters in much the same way strings are. How did you understand the comparison between sum and "".join? Commented Aug 19, 2010 at 20:57
  • 3
    @S.Lott Not to beat a dead horse, but I read the question and understood it instantly. And on a more technical level, characters in a Python string are just strings themselves, you can technically /can/ sum the characters, resulting in ordinary concatenation. (','.join('foo'), for example, returns 'f,o,o'.) Commented Dec 4, 2012 at 1:34

8 Answers 8

49
votes

Python tries to discourage you from "summing" strings. You're supposed to join them:

"".join(list_of_strings)

It's a lot faster, and uses much less memory.

A quick benchmark:

$ python -m timeit -s 'import operator; strings = ["a"]*10000' 'r = reduce(operator.add, strings)'
100 loops, best of 3: 8.46 msec per loop
$ python -m timeit -s 'import operator; strings = ["a"]*10000' 'r = "".join(strings)'
1000 loops, best of 3: 296 usec per loop

Edit (to answer OP's edit): As to why strings were apparently "singled out", I believe it's simply a matter of optimizing for a common case, as well as of enforcing best practice: you can join strings much faster with ''.join, so explicitly forbidding strings on sum will point this out to newbies.

BTW, this restriction has been in place "forever", i.e., since the sum was added as a built-in function (rev. 32347)

Sign up to request clarification or add additional context in comments.

9 Comments

Actually, I think it's the other way around: reduce and sum are similar for lists, because they do more or less the same thing. That's why I used this benchmark: reduce, representing sum, versus join, which is an optimised way of "adding" strings.
I am feeling old now. I understand "forever" in Python as before 2.0. Things done after the introduction of new style classes are not that "forever".
It does seem "unpythonic" to discourage this usage. I agree with Muhammad that it's a strange exception. But it seems this is the best reason available.
I think it's there because of the chronological sequence of implementations: when they implemented the sum builtin, they already had join for strings. So, to avoid people unknowingly (or knowingly, but lazily/mischievously) using sum for strings, they specifically disallowed it. Since there wasn't a specific "sum" for lists (or other types), they made the exception only for strings. Nowadays, I think they'd keep it this way, even if someone came up with a specific "sum", for backwards compatibility.
It's not optimizing for a common case. Optimization would be using that same if statement blocking it to instead invoke the (already written) str.join code. Since that would be such an easy optimization to make, I'd assume the intent is to deliberate fail noisily to teach people about the existence of str.join.
|
27
votes

You can in fact use sum(..) to concatenate strings, if you use the appropriate starting object! Of course, if you go this far you have already understood enough to use "".join(..) anyway..

>>> class ZeroObject(object):
...  def __add__(self, other):
...   return other
...
>>> sum(["hi", "there"], ZeroObject())
'hithere'

3 Comments

I find this very interesting. Of course it is too clever by half, but it adds to the understanding of this feature.
And it’s a little faster than the reduce version.
Still it's weird that Python checks for strings, but not for lists or tuples.
17
votes

Here's the source: http://svn.python.org/view/python/trunk/Python/bltinmodule.c?revision=81029&view=markup

In the builtin_sum function we have this bit of code:

     /* reject string values for 'start' parameter */
        if (PyObject_TypeCheck(result, &PyBaseString_Type)) {
            PyErr_SetString(PyExc_TypeError,
                "sum() can't sum strings [use ''.join(seq) instead]");
            Py_DECREF(iter);
            return NULL;
        }
        Py_INCREF(result);
    }

So.. that's your answer.

It's explicitly checked in the code and rejected.

4 Comments

It's interesting to see the code, but the question was "Why aren't strings summed", not "How did they exclude strings from summing?" ...
Well, I meant, the reason why its not working is because its explicitly banned in the code. Others seemed to answer by explaining why you shouldn't do it.
Thanks for sharing the source code. But I get a different traceback when I try to sum two strings. sum(['foo', 'bar']) -> TypeError: unsupported operand type(s) for +: 'int' and 'str'. But sum(['foo', 'bar'], '') -> TypeError: sum() can't sum strings [use ''.join(seq) instead]. Any ideas why the first one returns a different traceback? I think it's more likely to occur that way.
I just noticed the answer to my above comment is explained by @u0b34a0f6ae below. I now realise the start=0 argument is the initial value for the summation process.
14
votes

From the docs:

The preferred, fast way to concatenate a sequence of strings is by calling ''.join(sequence).

By making sum refuse to operate on strings, Python has encouraged you to use the correct method.

3 Comments

So it's part of the 'one way to do it' philosophy. (As they could just have made sum treat strings specially)
@ThomasAhle: There is a computational reason why sum is the wrong tool for the job. sum builds the result incrementally. Doing so with strings requires (potentially many) temporary immutable strings. If done naively, this process requires lots of temporary memory and lots of copying. In contrast, if done with ''.join(), the appropriate amount of memory is allocated from the beginning, and result is built right there at once. See Martelli's exhortation.
Sure, but you can do lots of inefficient things in Python, and often it doesn't matter. I can't think of any other cases than 'sum of strings' where there are tests against doing something, hard coded in the Python C code!
11
votes

Short answer: Efficiency.

Long answer: The sum function has to create an object for each partial sum.

Assume that the amount of time required to create an object is directly proportional to the size of its data. Let N denote the number of elements in the sequence to sum.

doubles are always the same size, which makes sum's running time O(1)×N = O(N).

int (formerly known as long) is arbitary-length. Let M denote the absolute value of the largest sequence element. Then sum's worst-case running time is lg(M) + lg(2M) + lg(3M) + ... + lg(NM) = N×lg(M) + lg(N!) = O(N log N).

For str (where M = the length of the longest string), the worst-case running time is M + 2M + 3M + ... + NM = M×(1 + 2 + ... + N) = O(N²).

Thus, summing strings would be much slower than summing numbers.

str.join does not allocate any intermediate objects. It preallocates a buffer large enough to hold the joined strings, and copies the string data. It runs in O(N) time, much faster than sum.

3 Comments

That argument is wrong because the same would hold for lists and tuples, which can be summed. As stated by HS, Python explicitly check for strings, and only for strings, which just doesn't make sense.
@Philipp: The missing piece is that many, many people were trying to use sum for strings, and not many use sum for lists and tuples. The trap is that sum works just fine for short lists of strings, but then gets put in production where the lists can be huge, and the performance slows to a crawl. This was such a common trap that the decision was made to ignore duck-typing in this instance, and not allow strings to be used with sum.
Excellent answer. Except for "would be much slower" has to be "is much slower" as has been proven.
10
votes
+25

The Reason Why

@dan04 has an excellent explanation for the costs of using sum on large lists of strings.

The missing piece as to why str is not allowed for sum is that many, many people were trying to use sum for strings, and not many use sum for lists and tuples and other O(n**2) data structures. The trap is that sum works just fine for short lists of strings, but then gets put in production where the lists can be huge, and the performance slows to a crawl. This was such a common trap that the decision was made to ignore duck-typing in this instance, and not allow strings to be used with sum.

2 Comments

Is there any reason why the sum() implementation doesn't just call ''.join() when a string is encountered in sum() instead of returning an error instructing you to call ''.join()?
@Slater: Yes. But I don't recall what it was. :(
4
votes

Edit: Moved the parts about immutability to history.

Basically, its a question of preallocation. When you use a statement such as

sum(["a", "b", "c", ..., ])

and expect it to work similar to a reduce statement, the code generated looks something like

v1 = "" + "a" # must allocate v1 and set its size to len("") + len("a")
v2 = v1 + "b" # must allocate v2 and set its size to len("a") + len("b")
...
res = v10000 + "$" # must allocate res and set its size to len(v9999) + len("$")

In each of these steps a new string is created, which for one might give some copying overhead as the strings are getting longer and longer. But that’s maybe not the point here. What’s more important, is that every new string on each line must be allocated to it’s specific size (which. I don’t know it it must allocate in every iteration of the reduce statement, there might be some obvious heuristics to use and Python might allocate a bit more here and there for reuse – but at several points the new string will be large enough that this won’t help anymore and Python must allocate again, which is rather expensive.

A dedicated method like join, however has the job to figure out the real size of the string before it starts and would therefore in theory only allocate once, at the beginning and then just fill that new string, which is much cheaper than the other solution.

5 Comments

It's true, but that's not the whole story. Integers are immutable as well. But the join operation on strings is specialised, takes the whole list into account, and, therefore, is much faster.
Yeah, ok maybe immutability is not the real problem here. (Though I can imagine that register-sized integers suffer less from immutability on the Python-side than strings do.) But then I think that preallocation actually is the whole story.
@Debiliski: The preallocation was probably the missing link for me; it tells why "",join behaves so much better than a generic sum. I had to look at the code to understand. I think the immutability is a red herring.
@Debiliski: Unfortunately I have accepted the other answer, else I would have suggested that you edit yours to make the preallocation and explanation more prominent.
Does it actually pre-allocate? Join's argument can be an arbitrary iterator, which you might only be able to traverse once. Without being familiar with the code, I'd guess the key point is at the C level you've got an over-allocated array which you can mutate in place; since it hasn't been released to python code yet this doesn't break the "python strings are immutable" law. The repeated allocations/copying can be amortized to O(n) if you always increase the size by at least double when you need more (rather than doing it for every string).
3
votes

I dont know why, but this works!

import operator
def sum_of_strings(list_of_strings):
    return reduce(operator.add, list_of_strings)

1 Comment

Because reduce is not sum -- but it will still have horrible performance for large lists.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.