Which is faster, string interpolation or addition in python?

Back

Which is faster? String Interpolation or Addition (in Python)?

There comes a time in every young girls life where she asks the age old question: "Which one of these methods of doing string operations is more performant?"

Okay, so maybe not every child in the world is asking these questions, but I am. And after briefly critiquing a friend's code and hearing about the performance issues he had fixed and worked with I got to thinking. Well, actually it was a different friend jokingly calling the code academic and not enterprise that made me look twice.

For whatever reason, I honed in on the use of string addition, namely:

print("          >>>>>>>>>>>>> Start String: "+self.final_tape+" <<<<<<<<<<<<<<<")

Where I was left wondering if the use of two strings and a variable would be very different from the use of a single string and an interpolation of a variable. My intuition told me interpolation would be faster.

Hoping to prove this to myself, I asked my friend to change his code (and trust me he was so pleased to touch month old code again -- not) to use string interpolation to find out. His busy beaver code was a good way to see if there was a performance difference. His results?

Interpolation: 1 minute 8 seconds
Addition: 1 minute 11 seconds

So not much of a difference really, but a difference. So the next question is why? The best way to find out is to ask the code!

As you can see in the gist, the addition function has 3 loads, 2 adds and then the return. While the interpolation method has 2 loads, 1 module and 1 return. While it's only 2 operations difference, it does make a difference if the code is in a critical path. It also makes a difference depending on how many strings we are stringing together (excuse the pun).

>>> def addStr3(s,s2):
...     return ">" + s + ":" + s2 + "<"
... 
>>> def addStr4(s,s2):
...     return ">%s:%s<" % (s,s2)
... 

>>> dis.dis(addStr3)
  2           0 LOAD_CONST               1 ('>')
              3 LOAD_FAST                0 (s)
              6 BINARY_ADD          
              7 LOAD_CONST               2 (':')
             10 BINARY_ADD          
             11 LOAD_FAST                1 (s2)
             14 BINARY_ADD          
             15 LOAD_CONST               3 ('<')
             18 BINARY_ADD          
             19 RETURN_VALUE        
>>> dis.dis(addStr4)
  2           0 LOAD_CONST               1 ('>%s:%s<')
              3 LOAD_FAST                0 (s)
              6 LOAD_FAST                1 (s2)
              9 BUILD_TUPLE              2
             12 BINARY_MODULO       
             13 RETURN_VALUE        

As you can see, If we are trying to put in 2 variable strings with text inbetween we'll end up with 5 loads, 4 adds, and 1 return in the addition case. But the interpolation case, with 3 loads, 1 tuple build, and 1 modulo and return has increased only by 2 operations. Compared to the increase from 6 to 10 operations, this is a big deal for micro optimizations.

Continuing further, the question begs to be asked, is it a sequence? After all, computers do things by the book and we can expect instructions to be compiled down to op codes consistently. So adding yet another variable in should result in similar results:

>>> def addStr5(s,s2,s3):
...     return ">"+s+":"+s2+":"+s3+"<"
... 
>>> def addStr6(s,s2,s3):
...     return ">%s:%s:%s<" % (s,s2,s3)
... 
>>> dis.dis(addStr5)
  2           0 LOAD_CONST               1 ('>')
              3 LOAD_FAST                0 (s)
              6 BINARY_ADD          
              7 LOAD_CONST               2 (':')
             10 BINARY_ADD          
             11 LOAD_FAST                1 (s2)
             14 BINARY_ADD          
             15 LOAD_CONST               2 (':')
             18 BINARY_ADD          
             19 LOAD_FAST                2 (s3)
             22 BINARY_ADD          
             23 LOAD_CONST               3 ('<')
             26 BINARY_ADD          
             27 RETURN_VALUE        
>>> dis.dis(addStr6)
  2           0 LOAD_CONST               1 ('>%s:%s:%s<')
              3 LOAD_FAST                0 (s)
              6 LOAD_FAST                1 (s2)
              9 LOAD_FAST                2 (s3)
             12 BUILD_TUPLE              3
             15 BINARY_MODULO       
             16 RETURN_VALUE        

This time, the addition function uses 7 loads, 6 adds, and 1 return and the interpolation functions uses 4 loads, 1 tuple build, 1 modulo, and 1 return. It should be fairly obvious at this point that each additional string, if continued in the way we're adding the string in, will result in 4 additional operations for the addition case, and 1 for the interpolation.

Total Operations
-----------------
Add     Interpolate
6       4
10      5
14      6 
18      8

This does not neccesary mean that interpolation is faster though! The number of operations doesn't matter if one of those is a blocking or long process. So what we really need to do is compare similarities and diffferences between the operations. How many LOAD_CONST? LOAD_FAST? What's the difference between BINARY_ADD vs BINARY_MODULO? How long does it take to BUILD_TUPLE?

Unfortunately, I'm not as familiar with Python OpCodes as some people are. I can see that a peephole optimizer might be able to make the interpolated code even faster by replacing the LOAD_FAST in a row with LOAD_TWO_FAST to load the variables faster. The key questions of comparing Binary Add to Binary Modulo is a bigger issue though. So let us consider how the two are implemented. Binary Add is simple enough, simply add each bit, then possibly carry a 1 over. The computer does this with an AND gate and an XOR gate for each bit. While there are likely plenty of optimizations and clever tricks used in today's computers, it's semi-safe to assume that the BINARY_ADD will hit each group of bytes with a few operations per bit to add the two areas of memory together. Of course, that's with numbers and we're working with strings so we'll need to come back to this.

So, what about BINARY_MODULO? Well, the documentation doesn't say much besides it performing the operation on the tops of the stack, so our only real option is to look into the source code. Looking at ceval.c (In the Python folder) we can find the C code that will execute when we reach any op code. For example, here's BUILD_TUPLE

case BUILD_TUPLE:
    x = PyTuple_New(oparg);
    if (x != NULL) {
        for (; --oparg >= 0;) {
            w = POP();
            PyTuple_SET_ITEM(x, oparg, w);
        }
        PUSH(x);
        continue;
    }
    break;

The PyTuple_* functions are defined in the Objects folder, in tupleobject.c and the tuples themselves are implemented as simple lists. The performance of which is determined by the size of the list itself. In the case of addStr4, N=2 in what is likely to be an O(N) operation (If I'm reading tupleobject.c correctly). The PyTuple_New function is swiftly followed by the items on the stack (from LOAD_FAST) being placed into the tuple by PyTuple_SET_ITEM, this being a constant time operation as it is simply pointer arithmetic under the hood:

int
PyTuple_SetItem(register PyObject *op, register Py_ssize_t i, PyObject
*newitem)
{
    register PyObject *olditem;
    register PyObject **p;
    if (!PyTuple_Check(op) || op->ob_refcnt != 1) {
        Py_XDECREF(newitem);
        PyErr_BadInternalCall();
        return -1;
    }
    if (i < 0 || i >= Py_SIZE(op)) {
        Py_XDECREF(newitem);
        PyErr_SetString(PyExc_IndexError,
                        "tuple assignment index out of range");
        return -1;
    }
    p = ((PyTupleObject *)op) -> ob_item + i;
    olditem = *p;
    *p = newitem;
    Py_XDECREF(olditem);
    return 0;
}

Overall the build tuple function is, roughly, 2*O(N) or simply O(N) where N is the size of the tuple.

What about some other operations? Let's look at BINARY_ADD:

case BINARY_ADD:
    w = POP();
    v = TOP();
    if (PyInt_CheckExact(v) && PyInt_CheckExact(w)) {
        /* INLINE: int + int */
        register long a, b, i;
        a = PyInt_AS_LONG(v);
        b = PyInt_AS_LONG(w);
        /* cast to avoid undefined behaviour
           on overflow */
        i = (long)((unsigned long)a + b);
        if ((i^a) < 0 && (i^b) < 0)
            goto slow_add;
        x = PyInt_FromLong(i);
    }
    else if (PyString_CheckExact(v) &&
             PyString_CheckExact(w)) {
        x = string_concatenate(v, w, f, next_instr);
        /* string_concatenate consumed the ref to v */
        goto skip_decref_vx;
    }
    else {
      slow_add:
        x = PyNumber_Add(v, w);
    }
    Py_DECREF(v);
  skip_decref_vx:
    Py_DECREF(w);
    SET_TOP(x);
    if (x != NULL) continue;
    break;

Ok, so let's attempt to break this down a bit and understand it! The PyInt_CheckExact is a simple macro defined in Include/intobject.h which compares the object type to a reference of the PyInt_Type, so nothing crazy or time consuming there. The code does basic arithmetic on longs unless it's adding two strings or non-integers. The string_concatenate function does different work depending on the next instruction, fortunately in our case we're not doing any STORE_ operations, so we don't fall into any of the special cases defined around line 4811 in ceval.c and the function does a small amount of processing in order to add the two strings together. Taking the size of the two strings and then using memcpy to quickly move the text over. In other words, we'll count the string size (2 O(N) operations), then copy the text into the newly allocated memory (another 2 O(N) operations).

Intuitively, we can now understand why multiple additions of strings together might have a performance issue now. In the same way that java allocates a new String object to do string concatenation (if you're doing "" + "" in Java), Python has to move the strings into a new zone of memory and copy them each time. Essentially the same way. In java, we get around this issue (and in fact the JVM optimizes to this case:) by using a StringBuilder. Which uses a list, much like our Tuple Interpolation does.

Finally, for completeness, let's look at BINARY_MODULO.

case BINARY_MODULO:
    w = POP();
    v = TOP();
    if (PyString_CheckExact(v))
        x = PyString_Format(v, w);
    else
        x = PyNumber_Remainder(v, w);
    Py_DECREF(v);
    Py_DECREF(w);
    SET_TOP(x);
    if (x != NULL) continue;
    break;

Surprised that it's shorter than the add function? Don't be! The code is a deceptive beast! In our case, the PyString_CheckExact returns true (this is the same type of check as the Integer type check as before) and we run the PyString_Format function on our two variables. This function is nearly 500 lines long and defined in Objects/stringobject.c on line 4226. Despite it's length it's not actually that complicated. It steps through the string looking for a format specifier. In our case, we'll skip the first 146 lines or so through if statements and end up at line 4440 where the case 's': is defined.

Within this case (and falling through to the r case as well) we'll do a few type checks before retrieving the variable as a string through the PyString_AS_STRING macro. This macro does no error checking and is quite fast. As noted by its definition in Include/stringobject.h:

/* Macro, trading safety for speed */
#define PyString_AS_STRING(op) (((PyStringObject *)(op))->ob_sval)

So long as our string isn't unicode, we'll use this fast version. Otherwise we'll drop into the last 44 lines of the function and have to do a bit of decoding and argument shifting with Tuples. Since we're just talking about normal use cases with typical strings, we won't get into this. And in short can say that we'll take one pass through the string (with possible, but unlikely resizing), and have a constant time operation to retrieve the string values of our values.

A side note: The reason I say "unlikely resizing" is because the length of the initial buffer is:

fmtcnt = PyString_GET_SIZE(format);
reslen = rescnt = fmtcnt + 100;

Which will be 100 characters greater than whatever we've already put into the formatter. If you're working with numbers or small strings, then it's unlikely you'd hit this length. Even if you do, the _PyString_Resize function (defined in Object/stringobject.c) uses realloc under the hood so we don't need to copy the old string since we're extending the memory to fit the requested size. In other words, this method is going to be more efficient than the BINARY_ADD pretty much always unless your machine has very little memory.

In conclusion, interpolating strings is faster than adding them together. There is a caveat to this of course, in the case of a single string adding to a single string, it wouldn't surprise me to see the normal addition do better than the interpolation. This thought being based off of the logic surrounding the formatting code being a bit slower than a single copy from one call to BINARY_ADD. You can affirm this behavior yourself by noting how as the number of arguments to process increases, the better the interpolation code does versus the addition.

Feel free to clone my example repository and run the makefile to view how the interpolation fairs on your machine compared to my results! It would be interesting to see if different OS's or versions show difference in this micro optimization.

Other Posts

comments powered by Disqus