Understanding Python Bytecode

Hello World!

That’s our favorite phrase to print every time we start learning a new programming language, but here I try to explain how the interpreter understands the code.

Python compiles the program to the bytecode which is an internal representation of python to the interpreter.

Consider the following code which takes two function arguments that returns the multiplication of those two values,

def multiplication(value1, value2):
      multiplied_value = value1 * value2
      return multiplied_value

The raw compiled byte code of the function gives,

multiplication.__code__.co_code
b'|\x00|\x01\x14\x00}\x02|\x02S\x00'

and the bytecodes as numbers are,

list(multiplication.__code__.co_code)
[124, 0, 124, 1, 20, 0, 125, 2, 124, 2, 83, 0]

It’s difficult(almost impossible) for a human to read and understand the bytecode conversion directly. We have a python package(called dis) that is available to convert the bytecode into human-readable instruction.

On converting the above bytecode(production function) into machine instruction(as compiler sees it):

import dis
dis.dis(multiplication)

2           0 LOAD_FAST                0 (value1)
            2 LOAD_FAST                1 (value2)
            4 BINARY_MULTIPLY
            6 STORE_FAST               2 (multiplied_value)

3           8 LOAD_FAST                2 (multiplied_value)
            10 RETURN_VALUE

Cpython uses the stack-based virtual machine where the instructions take place using push and pop(stack data structure)

Three types of stacks:

Call stack - it always have one active item which is called ‘frame’. main of the program where it belongs at the bottom. Every function call pushes a new frame to the call stack and each time the function returns, its frames get popped out of the stack
Each frame there is an evaluation stack, (i.e, data stack) it’s mainly responsible for the execution of python function, pushing the frames to stack, executing them and popping them.
Each frame contains block stack, which helps the program to keep track of control structures ie blocks, it gets popped off every time program or loop comes out of the block.

Based on the stacks if we see out bytecode listings for multiplication() function it executes in the order:

LOAD_FAST for the value 1 and value 2 gets pushed sequentially to the top of the evaluation stack.
BINARY_MULTIPLY is the operation expression representation on the evaluation stack which is explained here, which multiplies TOS 1(Top Of Stack) and TOS 2.
STORE_FAST gets pushed to the top of stack after the BINARY_MULTIPLY operation, which stores the result in a local variable(multiplied_value) as per the funtion definition
LOAD_FAST instruction loads the variable(multiplied_value) from the local memory on top of the evaluation stack before returning the value
RETURN_VALUE instruction, returns the TOS position to the caller function so that instruction continues from there and it goes on.

There is always the question of why python execution is faster inside the function unlike outside. just like the scenario mentioned below,

case 1:

def main():
    for i in range(10**9):
        pass

if __name__ == '__main__':
    main()

The execution time for the above script is,

22.35s user 0.15s system 94% cpu 23.838 total

bytecode instructions for the disassembly part(inside function) looks like,

2           0 SETUP_LOOP              16 (to 18)
            2 LOAD_GLOBAL              0 (range)
            4 LOAD_CONST               1 (1000000000)
            6 CALL_FUNCTION            1
            8 GET_ITER
      >>   10 FOR_ITER                 4 (to 16)
            12 STORE_FAST               0 (i)

3          14 JUMP_ABSOLUTE           10
      >>   16 POP_BLOCK
      >>   18 LOAD_CONST               0 (None)
            20 RETURN_VALUE

case 2:

for i in range(10**9):
    pass

41.65s user 0.34s system 85% cpu 49.053 total

1           0 SETUP_LOOP              16 (to 18)
            2 LOAD_NAME                0 (range)
            4 LOAD_CONST               0 (1000000000)
            6 CALL_FUNCTION            1
            8 GET_ITER
      >>   10 FOR_ITER                 4 (to 16)
            12 STORE_NAME               1 (i)

2          14 JUMP_ABSOLUTE           10
      >>   16 POP_BLOCK
      >>   18 LOAD_CONST               1 (None)
            20 RETURN_VALUE

If it is called directly without any function, the execution time is observed that it take almost double the time to complete.

But what makes it different when the loop is executed inside the function like in the first case? Digging the bytecode instructions for these scenarios, STORE_FAST is used in case 1 which makes the execution faster but the latter uses STORE_NAME instruction.

why can’t CPython has the implementation of STORE_FAST everywhere if it can improve the performance?

STORE_FAST is used for referencing the local variables whereas the STORE_NAME is referred to as the global one.

if you look in-depth, you can see the implementation for global variables are in dictionary(which takes dictionary name, key elemen, and value as its argument) and for local variables are in lists with variable as its index(so that makes the lookup pretty fast). The convincing reason behind this implementation is assumed to be the local variable address will get flushed out after the termination of particular loop/function but the global variable has to keep track of variable with class and its object properties.