### Which is faster? String Interpolation or Addition (in Python)?

There comes a time in every young girls life where she asks the age old question: "Which one of these methods of doing string operations is more performant?"

Okay, so maybe not every child in the world is asking these questions, but I am. And after briefly critiquing a friend's code and hearing about the performance issues he had fixed and worked with I got to thinking. Well, actually it was a different friend jokingly calling the code academic and not enterprise that made me look twice.

For whatever reason, I honed in on the use of string addition, namely:

```
print(" >>>>>>>>>>>>> Start String: "+self.final_tape+" <<<<<<<<<<<<<<<")
```

Where I was left wondering if the use of two strings and a variable
would be very different from the use of a single string and an
*interpolation of a variable*. My intuition told me interpolation would
be faster.

Hoping to prove this to myself, I asked my friend to change his code (and trust me he was so pleased to touch month old code again -- not) to use string interpolation to find out. His busy beaver code was a good way to see if there was a performance difference. His results?

Interpolation: 1 minute 8 seconds

Addition: 1 minute 11 seconds

So not much of a difference really, but a difference. So the next question is why? The best way to find out is to ask the code!

As you can see in the gist, the addition function has 3 loads, 2 adds and then the return. While the interpolation method has 2 loads, 1 module and 1 return. While it's only 2 operations difference, it does make a difference if the code is in a critical path. It also makes a difference depending on how many strings we are stringing together (excuse the pun).

```
>>> def addStr3(s,s2):
... return ">" + s + ":" + s2 + "<"
...
>>> def addStr4(s,s2):
... return ">%s:%s<" % (s,s2)
...
>>> dis.dis(addStr3)
2 0 LOAD_CONST 1 ('>')
3 LOAD_FAST 0 (s)
6 BINARY_ADD
7 LOAD_CONST 2 (':')
10 BINARY_ADD
11 LOAD_FAST 1 (s2)
14 BINARY_ADD
15 LOAD_CONST 3 ('<')
18 BINARY_ADD
19 RETURN_VALUE
>>> dis.dis(addStr4)
2 0 LOAD_CONST 1 ('>%s:%s<')
3 LOAD_FAST 0 (s)
6 LOAD_FAST 1 (s2)
9 BUILD_TUPLE 2
12 BINARY_MODULO
13 RETURN_VALUE
```

As you can see, If we are trying to put in 2 variable strings with text inbetween we'll end up with 5 loads, 4 adds, and 1 return in the addition case. But the interpolation case, with 3 loads, 1 tuple build, and 1 modulo and return has increased only by 2 operations. Compared to the increase from 6 to 10 operations, this is a big deal for micro optimizations.

Continuing further, the question begs to be asked, is it a sequence? After all, computers do things by the book and we can expect instructions to be compiled down to op codes consistently. So adding yet another variable in should result in similar results:

```
>>> def addStr5(s,s2,s3):
... return ">"+s+":"+s2+":"+s3+"<"
...
>>> def addStr6(s,s2,s3):
... return ">%s:%s:%s<" % (s,s2,s3)
...
>>> dis.dis(addStr5)
2 0 LOAD_CONST 1 ('>')
3 LOAD_FAST 0 (s)
6 BINARY_ADD
7 LOAD_CONST 2 (':')
10 BINARY_ADD
11 LOAD_FAST 1 (s2)
14 BINARY_ADD
15 LOAD_CONST 2 (':')
18 BINARY_ADD
19 LOAD_FAST 2 (s3)
22 BINARY_ADD
23 LOAD_CONST 3 ('<')
26 BINARY_ADD
27 RETURN_VALUE
>>> dis.dis(addStr6)
2 0 LOAD_CONST 1 ('>%s:%s:%s<')
3 LOAD_FAST 0 (s)
6 LOAD_FAST 1 (s2)
9 LOAD_FAST 2 (s3)
12 BUILD_TUPLE 3
15 BINARY_MODULO
16 RETURN_VALUE
```

This time, the addition function uses 7 loads, 6 adds, and 1 return and the interpolation functions uses 4 loads, 1 tuple build, 1 modulo, and 1 return. It should be fairly obvious at this point that each additional string, if continued in the way we're adding the string in, will result in 4 additional operations for the addition case, and 1 for the interpolation.

```
Total Operations
-----------------
Add Interpolate
6 4
10 5
14 6
18 8
```

This does not neccesary mean that interpolation is faster though! The
number of operations doesn't matter if one of those is a blocking or
long process. So what we really need to do is compare similarities and
diffferences between the operations. How many `LOAD_CONST`

? `LOAD_FAST`

?
What's the difference between `BINARY_ADD`

vs `BINARY_MODULO`

? How long
does it take to `BUILD_TUPLE`

?

Unfortunately, I'm not as familiar with Python OpCodes as some people
are. I can see that a peephole optimizer might be able to make the
interpolated code even faster by replacing the `LOAD_FAST`

in a row with
`LOAD_TWO_FAST`

to load the variables faster. The key questions of
comparing Binary Add to Binary Modulo is a bigger issue though. So let
us consider how the two are implemented. Binary Add is simple enough,
simply add each bit, then possibly carry a 1 over. The computer does
this with an AND gate and an XOR gate for each bit. While there are
likely plenty of optimizations and clever tricks used in today's
computers, it's semi-safe to assume that the `BINARY_ADD`

will hit each
group of bytes with a few operations per bit to add the two areas of
memory together. Of course, that's with numbers and we're working with
strings so we'll need to come back to this.

So, what about `BINARY_MODULO`

? Well, the documentation doesn't say
much besides it performing the operation on the tops of the stack, so
our only real option is to look into the source code. Looking at
ceval.c (In the Python folder) we can find the C code that will execute
when we reach any op code. For example, here's `BUILD_TUPLE`

```
case BUILD_TUPLE:
x = PyTuple_New(oparg);
if (x != NULL) {
for (; --oparg >= 0;) {
w = POP();
PyTuple_SET_ITEM(x, oparg, w);
}
PUSH(x);
continue;
}
break;
```

The PyTuple_* functions are defined in the Objects folder, in
tupleobject.c and the tuples themselves are implemented as simple lists.
The performance of which is determined by the size of the list itself.
In the case of addStr4, N=2 in what is likely to be an O(N) operation
(If I'm reading tupleobject.c correctly). The `PyTuple_New`

function is
swiftly followed by the items on the stack (from `LOAD_FAST`

) being
placed into the tuple by `PyTuple_SET_ITEM`

, this being a constant time
operation as it is simply pointer arithmetic under the hood:

```
int
PyTuple_SetItem(register PyObject *op, register Py_ssize_t i, PyObject
*newitem)
{
register PyObject *olditem;
register PyObject **p;
if (!PyTuple_Check(op) || op->ob_refcnt != 1) {
Py_XDECREF(newitem);
PyErr_BadInternalCall();
return -1;
}
if (i < 0 || i >= Py_SIZE(op)) {
Py_XDECREF(newitem);
PyErr_SetString(PyExc_IndexError,
"tuple assignment index out of range");
return -1;
}
p = ((PyTupleObject *)op) -> ob_item + i;
olditem = *p;
*p = newitem;
Py_XDECREF(olditem);
return 0;
}
```

Overall the build tuple function is, roughly, 2*O(N) or simply O(N) where N is the size of the tuple.

What about some other operations? Let's look at `BINARY_ADD`

:

```
case BINARY_ADD:
w = POP();
v = TOP();
if (PyInt_CheckExact(v) && PyInt_CheckExact(w)) {
/* INLINE: int + int */
register long a, b, i;
a = PyInt_AS_LONG(v);
b = PyInt_AS_LONG(w);
/* cast to avoid undefined behaviour
on overflow */
i = (long)((unsigned long)a + b);
if ((i^a) < 0 && (i^b) < 0)
goto slow_add;
x = PyInt_FromLong(i);
}
else if (PyString_CheckExact(v) &&
PyString_CheckExact(w)) {
x = string_concatenate(v, w, f, next_instr);
/* string_concatenate consumed the ref to v */
goto skip_decref_vx;
}
else {
slow_add:
x = PyNumber_Add(v, w);
}
Py_DECREF(v);
skip_decref_vx:
Py_DECREF(w);
SET_TOP(x);
if (x != NULL) continue;
break;
```

Ok, so let's attempt to break this down a bit and understand it! The
`PyInt_CheckExact`

is a simple macro defined in Include/intobject.h
which compares the object type to a reference of the `PyInt_Type`

, so
nothing crazy or time consuming there. The code does basic arithmetic on
longs unless it's adding two strings or non-integers. The
`string_concatenate`

function does different work depending on the next
instruction, fortunately in our case we're not doing any `STORE_`

operations, so we don't fall into any of the special cases defined
around line 4811 in ceval.c and the function does a small amount of
processing in order to add the two strings together. Taking the size of
the two strings and then using memcpy to quickly move the text over.
In other words, we'll count the string size (2 O(N) operations), then
copy the text into the newly allocated memory (another 2 O(N)
operations).

Intuitively, we can now understand why multiple additions of strings together might have a performance issue now. In the same way that java allocates a new String object to do string concatenation (if you're doing "" + "" in Java), Python has to move the strings into a new zone of memory and copy them each time. Essentially the same way. In java, we get around this issue (and in fact the JVM optimizes to this case:) by using a StringBuilder. Which uses a list, much like our Tuple Interpolation does.

Finally, for completeness, let's look at `BINARY_MODULO`

.

```
case BINARY_MODULO:
w = POP();
v = TOP();
if (PyString_CheckExact(v))
x = PyString_Format(v, w);
else
x = PyNumber_Remainder(v, w);
Py_DECREF(v);
Py_DECREF(w);
SET_TOP(x);
if (x != NULL) continue;
break;
```

Surprised that it's shorter than the add function? Don't be! The code is
a deceptive beast! In our
case, the `PyString_CheckExact`

returns true (this is the same type of
check as the Integer type check as before) and we run the
`PyString_Format`

function on our two variables. This function is nearly
500 lines long and defined in Objects/stringobject.c on line 4226.
Despite it's length it's not actually *that* complicated. It steps
through the string looking for a format specifier. In our case, we'll
skip the first 146 lines or so through if statements and end up at line
4440 where the `case 's':`

is defined.

Within this case (and falling through to the `r`

case as well) we'll do
a few type checks before retrieving the variable as a string through the
`PyString_AS_STRING`

macro. This macro does no error checking and is
quite fast. As noted by its definition in Include/stringobject.h:

```
/* Macro, trading safety for speed */
#define PyString_AS_STRING(op) (((PyStringObject *)(op))->ob_sval)
```

So long as our string isn't unicode, we'll use this fast version. Otherwise we'll drop into the last 44 lines of the function and have to do a bit of decoding and argument shifting with Tuples. Since we're just talking about normal use cases with typical strings, we won't get into this. And in short can say that we'll take one pass through the string (with possible, but unlikely resizing), and have a constant time operation to retrieve the string values of our values.

A side note: The reason I say "unlikely resizing" is because the length of the initial buffer is:

```
fmtcnt = PyString_GET_SIZE(format);
reslen = rescnt = fmtcnt + 100;
```

Which will be 100 characters greater than whatever we've already put
into the formatter. If you're working with numbers or small strings,
then it's unlikely you'd hit this length. Even if you do, the
`_PyString_Resize`

function (defined in Object/stringobject.c) uses
realloc under the hood so we don't need to copy the old string since
we're extending the memory to fit the requested size. In other words,
this method is going to be more efficient than the `BINARY_ADD`

pretty
much always unless your machine has very little memory.

In conclusion, interpolating strings is faster than adding them
together. There is a caveat to this of course, in the case of a single
string adding to a single string, it wouldn't surprise me to see the
normal addition do better than the interpolation. This thought being
based off of the logic surrounding the formatting code being a bit
slower than a single copy from one call to `BINARY_ADD`

. You can affirm
this behavior yourself by noting how as the number of arguments to
process increases, the better the interpolation code does versus the
addition.

Feel free to clone my example repository and run the makefile to view how the interpolation fairs on your machine compared to my results! It would be interesting to see if different OS's or versions show difference in this micro optimization.