Data Types in Algorand Python

In this article, we’ll explore the differences between Python’s built-in data types, Algorand Virtual Machine types, and Algorand Python types.

What is a Data Type?

I like to think of a data type as a set of values, and a set of operations on those values.

For example, we can think of Python’s bool type as the set containing True and False:

$$Bool = \{True, False\}$$

And Python’s int type as the (infinite) set of all integers:

$$Int = \mathbb{Z} = \{...,-2,-1,0,1,2,...\}$$

While Python supports arbitrarily-large integers, smart contracts run in much more constrained environments.

The set of possible values for AVM types is always finite, and that has important implications for how we develop smart contracts.

Working with Integers

The AVM is a stack interpreter, with elements of the stack restricted to two types: uint64 and bytes.

These are represented by the UInt64 and Bytes types in Algorand Python.

The set of possible uint64 values is:

$$U_{64} = \{x \in \mathbb{N} \mid x \lt 2^{64}\}$$

So unlike int in Python, a uint64 cannot be negative, nor can it exceed \(2^{64}-1\).

This means the Python function:

def add(a: int, b: int) -> int:
    return a + b

Does not have the same domain and codomain as the Algorand Python subroutine:

@subroutine
def add(a: UInt64, b: UInt64) -> UInt64:
    return a + b

In the latter, we can’t add negative numbers, for example.

But there’s one more important difference to note: add in Python is a binary operation. In Algorand Python, it’s a partial binary operation.

What’s the difference?

A binary operation is a function that combines two elements of a set to produce another element of the same set:

$$f: S \times S \to S$$

The domain here is the set of ordered pairs:

$$dom(f) = S \times S = \{(a,b) \mid a \in S, b \in S\}$$

And by definition, the function has to be closed on \(S\), meaning:

$$\forall a,b \in S,\ f(a, b) \in S$$

This is true for int in Python: adding any two integers will return another integer.

But in the AVM, adding some pairs of (a, b) is not possible, because the result would exceed the maximum possible value of a uint64.

That’s why the official definition of + in TEAL is:

A partial binary function, then, is one that maps a subset of pairs from \(S \times S \to S\).

The subset of addable pairs is:

$$\{(a, b) \in U_{64} \times U_{64} \mid a + b \in U_{64}\}$$

Which is the same as saying:

$$\{(a, b) \in U_{64} \times U_{64} \mid b <= \max{U_{64}} - a\}$$

We can use this formula to define a total function for adding uint64s:

MAX_UINT64 = 2**64 - 1

@subroutine
def safe_add(a: UInt64, b: UInt64) -> UInt64:
    return a + b if b <= MAX_UINT64 - a else UInt64(MAX_UINT64)

But in doing so we’ve lost some information. Did we prevent the result from overflowing, or would it have fallen within the bounds anyway?

Let’s see if we can track this with product types.

Product Types

A product type is a composite structure formed from combining one or more types.

If we have a tuple: (a: bool, b: bool), what’s the cardinality of the set of possible values?

It’s the product of the cardinality of each type:

$$|Bool| \times |Bool| = |\{True, False\}| \times |\{True, False\}|$$

Which is 4.

What about (a: bool, b: int)?

$$|Bool| \times |\mathbb{Z}| = 2 \times \infty$$

That’s a bit weird to think about.

Remembering that all Algorand Python types are finite, we know that the cardinality of (a: bool, b: UInt64) is the much more friendly \(2 \times 2^{64}-1\).

Incidentally, bool and tuple are two of the built-in Python types that we can use in Algorand Python.

Let’s modify the safe_add function we wrote earlier, so that it returns a pair of UInt64, bool:

MAX_UINT64 = 2**64 - 1

@subroutine
def safe_add(a: UInt64, b: UInt64) -> tuple[UInt64, bool]:
    if b <= MAX_UINT64 - a:
        return a + b, False
    return UInt64(MAX_UINT64), True

Now we know the result of the addition, and whether or not the result was limited.

But the meaning of the return type isn’t exactly obvious - what do True and False represent?

We can give them descriptive names by turning the tuple into a NamedTuple:

from typing import NamedTuple


MAX_UINT64 = 2**64 - 1

# Errors
NONE = False
OVERFLOW = True

class Result(NamedTuple):
    data: UInt64
    error: bool

@subroutine
def safe_add(a: UInt64, b: UInt64) -> Result:
    if b <= MAX_UINT64 - a:
        return Result(data=a + b, error=NONE)
    return Result(data=UInt64(MAX_UINT64), error=OVERFLOW)
💡
An Enum would be nice here for the constants, but it’s not supported yet in Algorand Python.

In Python, NamedTuple is a subclass of tuple. You can do some funky things with it like accessing the element names dynamically:

>>> class Result(NamedTuple):
...     data: int
...     error: bool

>>> r = Result(data=1, error=False)
>>> print(r.data)
1
>>> print(getattr(r, 'data'))
1

If you attempt that in Algorand Python, you’ll get a compiler error:

Returning Any from function declared to return "UInt64"

So what have we actually defined here with the Result class?

The best way to think of it, in my opinion, is as a record type.

The identifiers we’ve given to the elements (data and error) are figments of the compiler - they do not exist in the runtime representation.

When this gets compiled down to AVM bytecode, the data will be represented the same way a tuple is represented.

That seems odd, right? It means the two are in some sense equivalent.

Technically speaking, tuples and records are the same up to an isomorphism, because there exists an invertible, structure-preserving mapping between the two.

Let’s define this in Algorand Python:

class Result(NamedTuple):
    data: UInt64
    error: bool

@subroutine
def tuple_to_result(pair: tuple[UInt64, bool]) -> Result:
    data, error = pair
    return Result(data, error)

@subroutine
def result_to_tuple(result: Result) -> tuple[UInt64, bool]:
    return result.data, result.error

tuple_to_result is a morphism \(f: a \to b\)

result_to_tuple is a morphism \(g: b \to a\)

We know that \(g\) is the inverse of \(f\), because their composition is the same as the identity morphism: lambda x: x.

In other words, applying the two in sequence gets us back the original argument:

>>> result_to_tuple(tuple_to_result(pair)) == pair
True
>>> tuple_to_result(result_to_tuple(result)) == result
True

But here’s where it gets really funky.

Python tuples are supported as arguments to subroutines, local variables, return types.

Source: Algorand Python Documentation - Types

You may have wondered why tuples are allowed at all.

If each element in the AVM stack can only be a uint64 or a bytes type, how can we have a tuple of (uint64, uint64)?

Tuples as Arguments to Subroutines

Let’s define a Python function to construct a pair of (a, b) from its constituent parts:

def pair (a, b): return (a, b)

Notice this looks awfully like the identity function: \(f: (a, b) \to (a, b)\).

In Python, there’s a difference between a function that takes multiple arguments, and a function that takes a tuple as an argument:

def pair(ab: tuple): return ab

But I’m going to argue that in the category of Python functions, the two are the same up to isomorphism.

To convert the first function, which takes an a and a b, into a function that takes a pair of (a, b):

>>> def pair(a, b): return (a, b)
>>> def to_unary_fn(f):
...     return lambda ab: f(*ab)
>>> ab = (1, 2)
>>> to_unary_fn(pair)(ab)
(1, 2)

And inversely, to convert a function that takes a pair of (a, b) into one that takes two arguments, a and b:

>>> def pair(ab: tuple): return ab
>>> def to_binary_fn(f):
...     return lambda *ab: f(ab)
>>> to_binary_fn(pair)(1, 2)
(1, 2)

Composing the two should of course give us the identity morphism:

>>> def pair(ab: tuple): return ab
>>> def to_unary_fn(f):
...     return lambda ab: f(*ab)
>>> def to_binary_fn(f):
...     return lambda *ab: f(ab)
>>> ab = (1, 2)
>>> to_unary_fn(to_binary_fn(pair))(ab) == pair(ab)
True

In fact any combination of arguments, pairs, and nested pairs are isomorphic.

I’ll spare you the details, but we could easily map (a, (b, c)) to ((a, b), c) and back.

The power of the isomorphism lies in the fact that the AVM doesn’t need to support tuples natively.

Subroutines can have multiple arguments, so the compiler can simply transform a function that takes a tuple into a function that takes two arguments.

To prove this, let’s write a simple contract:

class Contract(ARC4Contract):

    @arc4.abimethod
    def add(self, a: UInt64, b: UInt64) -> UInt64:
        return a + b

    @arc4.abimethod
    def sum(self, pair: tuple[UInt64, UInt64]) -> UInt64:
        a, b = pair
        return a + b

The first method takes two arguments (ignoring self): a and b.

The second method takes a tuple of (a, b), unpacks it into separate variables, and adds the elements.

If we compile the contract and look at the generated TEAL code for the corresponding subroutines:

// contract.Contract.add(a: uint64, b: uint64) -> uint64:
add:
    // contract.py:41-42
    // @arc4.abimethod
    // def add(self, a: UInt64, b: UInt64) -> UInt64:
    proto 2 1
    // contract.py:43
    // return a + b
    frame_dig -2
    frame_dig -1
    +
    retsub


// contract.Contract.sum(pair.0: uint64, pair.1: uint64) -> uint64:
sum:
    // contract.py:45-46
    // @arc4.abimethod
    // def sum(self, pair: tuple[UInt64, UInt64]) -> UInt64:
    proto 2 1
    // contract.py:48
    // return a + b
    frame_dig -2
    frame_dig -1
    +
    retsub

They’re the same.

Notice that in the comments, we can see that the signature of add is still: add(a: uint64, b: uint64) -> uint64

But the signature of sum has been transformed: sum(pair.0: uint64, pair.1: uint64) -> uint64

It’s a thing of beauty.

But how was the tuple provided to the contract, and how did it get converted into two uint64s?

Let’s take a look at the Application Binary Interface (ABI).

Application Binary Interface (ABI)

The AVM only supports two types: uint64 and bytes. So any time we want to send something more complex to a smart contract, it has to be encoded in binary (as a bytes type).

In the example shown previously, we were able to concatenate the bytes of two uint64s together to form an encoded tuple.

The compiler is clever enough to know how to decode this into two uint64s, and push them onto the stack:

    // class Contract(ARC4Contract):
    txna ApplicationArgs 1
    dup
    intc_1 // 0
    extract_uint64
    swap
    pushint 8 // 8
    extract_uint64
    // @arc4.abimethod
    callsub sum

The extract_uint64 operation allows us to take any 64 bits (8 bytes) from a bytes type and turn it into a uint64.

So we can concatenate n integers, and extract the nth element by calling extract_uint64(..., n * 8).

Notice how efficient this is. There’s no need to encode the length of each element, since it’s static and known at compile-time.

But how is our off-chain code and on-chain code acting so harmoniously?

How does the compiler know exactly how we’ve encoded a tuple of integers?

This is all specified in the ARC-4 Application Binary Interface standard.

It’s supported by both the SDKs for off-chain code (py-algorand-sdk and algokit-utils) and the compiler for on-chain code (Algorand Python).

It works so seamlessly that often you won’t notice the encoding and decoding taking place.

ARC-4 Data Types

ARC-4 gives us a bunch of additional data types to work with in our smart contracts, but with one important caveat: all ARC-4 types are represented as bytes in the AVM.

This is the source of a lot of confusion, so let’s walk through it in detail.

Algorand Python internally has a concept of a ‘bytes-backed type’.

This means that for any ARC-4 type you use, you can access the underlying Bytes, exactly as it’s represented in the AVM:

from typing import TypeAlias

Arc4Pair: TypeAlias = arc4.Tuple[arc4.UInt64, arc4.UInt64]

class Contract(ARC4Contract):

    @arc4.abimethod
    def to_bytes(self, pair: Arc4Pair) -> Bytes:
        return pair.bytes

So we have a mapping from an ARC-4 type to a Bytes type.

What about the inverse?

class Contract(ARC4Contract):

    ...

    @arc4.abimethod
    def from_bytes(self, pair: Bytes) -> Arc4Pair:
        return Arc4Pair.from_bytes(pair)

I’m sure you can see where I’m going with this:

class Contract(ARC4Contract):

    @arc4.abimethod
    def to_bytes(self, pair: Arc4Pair) -> Bytes:
        return pair.bytes

    @arc4.abimethod
    def from_bytes(self, pair: Bytes) -> Arc4Pair:
        return Arc4Pair.from_bytes(pair)

    @arc4.abimethod
    def from_after_to(self, pair: Arc4Pair) -> Arc4Pair:
        return self.from_bytes(self.to_bytes(pair))

    @arc4.abimethod
    def to_after_from(self, pair: Bytes) -> Bytes:
        return self.to_bytes(self.from_bytes(pair))

But there’s one thing worth noting here: while every Arc4Pair has a bytes representation, not every byte sequence corresponds to a valid Arc4Pair.

We can always go from an ARC-4 type to Bytes and back, but the reverse is not always true - some Bytes cannot be parsed as an ARC-4 type.

Here’s an interesting question (I hope!): is there a mapping between the operations on Arc4Pair and the operations on Bytes?

In a sense, there has to be. The AVM only has two types: uint64 and bytes, and operations on those types.

So, not only does every ARC-4 type have to be representable as bytes, but every operation on an ARC-4 type has to composable from the operations on bytes.

Operations on ARC-4 Types

Let’s write a simple contract that concatenates two arc4.Strings:

class Contract(ARC4Contract):

    @arc4.abimethod
    def concat(self, a: arc4.String, b: arc4.String) -> arc4.String:
        return a + b

An arc4.String is encoded as ‘a variable length byte array prefixed with a 16-bit big endian header indicating the length of the data’ (source).

So to concatenate two strings in the AVM, it would make sense to:

  1. Extract the two byte sequences (ignoring the header)

  2. Concatenate those bytes together

  3. Calculate the length of the concatenated bytes

  4. Convert the length integer to bytes and prepend it to the concatenated bytes

And that’s exactly what the compiled TEAL code does:

// contract.Contract.concat(a: bytes, b: bytes) -> bytes:
concat:
    // contract.py:70-71
    // @arc4.abimethod
    // def concat(self, a: arc4.String, b: arc4.String) -> arc4.String:
    proto 2 1
    // contract.py:72
    // return a + b
    frame_dig -2
    extract 2 0
    frame_dig -1
    extract 2 0
    concat
    dup
    len
    itob
    extract 6 2
    swap
    concat
    retsub
💡
You can find the complete list of TEAL opcodes here.

When we transform ARC-4 types in a smart contract, we effectively lift the object into an AVM bytes type, apply the equivalent transformation, and then lower it back to an ARC-4 type:

If you find yourself asking the question: ‘why can't I do x with an ARC-4 type?’, the answer is likely that there is no way to express the transformation using AVM operations on bytes, or that doing so would be computationally infeasible in a smart contract.

And that is a real sticking point - these abstractions come at a cost.

It took 12 operations just to concatenate two strings.

Surely we can do better.

Native Types

Algorand Python also offers ‘native’ types, which are mostly bounded variants of uint64 or bytes.

For example, bool is a uint64 bound to the range \(\{0, 1\}\).

Thinking back to types as sets, we can say that the possible values of bool are:

$$\{x \in U_{64} \mid x \in \{0, 1\}\}$$

And the set of operations on uint64 and bool are exactly the same.

There’s often a clear path from an ARC-4 type to a native type, which can be accessed via the native property:

@subroutine
def to_native(x: arc4.Bool) -> bool:
    return x.native

Interestingly, there’s a native String type in Algorand Python that represents UTF-8 encoded strings.

It’s not as powerful as the Python str type, but it is more efficient than arc4.String for operations like concatenation.

We can use the native property to convert an arc4.String to a String:

@subroutine
def to_native(x: arc4.String) -> String:
    return x.native

Which removes the header indicating the length of the string.

We can also pass String arguments to ABI methods:

class Contract(ARC4Contract):

    @arc4.abimethod
    def concat(self, a: String, b: String) -> String:
        return a + b

But how is that possible?

The ABI method signature is actually identical in either case:

# ABI: concat(string,string)string
def concat(self, a: String, b: String) -> String: ...

# ABI: concat(string,string)string
def concat(self, a: arc4.String, b: arc4.String) -> arc4.String: ...

The difference is how the two are treated internally.

An ABI string is immediately parsed into a String by removing the header:

    txna ApplicationArgs 1
    extract 2 0
    txna ApplicationArgs 2
    extract 2 0

From that point onwards, operations like concatenation can be performed much more efficiently:

// contract.Contract.concat(a: bytes, b: bytes) -> bytes:
concat:
    // contract.py:83-84
    // @arc4.abimethod
    // def concat(self, a: String, b: String) -> String:
    proto 2 1
    // contract.py:85
    // return a + b
    frame_dig -2
    frame_dig -1
    concat
    retsub

One Final Musing

If you’ve made it this far in the blog post, you are clearly a glutton for punishment.

Earlier, I mentioned that:

We can always go from an ARC-4 type to Bytes and back, but the reverse is not always true…

Isomorphic types are those which can be cast to each other without loss of information.

A Bytes value can hold up to 4096 bytes of information.

An arc4.String can also have a length of 4096 bytes, but the first two bytes are always reserved for the header.

So what happens if we try to concatenate two arc4.Strings, each containing 2048 bytes?:

class Contract(ARC4Contract):

    @arc4.abimethod
    def arc4_string_length(self) -> UInt64:
        return (arc4.String.from_bytes(op.bzero(2048)) + arc4.String.from_bytes(op.bzero(2048))).bytes.length

First, the initial two bytes are removed from each of the strings (the compiler assumes these are encoded headers).

This leaves us with two byte-arrays of length 2046.

These are then concatenated together, forming a byte-array of length 4092.

And finally, a new 2-byte header is prepended, leaving us with a byte-array of length 4094.

By converting Bytes into an arc4.String, we’ve lost some information.

The compiled TEAL code is below:

// contract.Contract.arc4_string_length() -> uint64:
arc4_string_length:
    // contract.py:51
    // arc4.String.from_bytes(op.bzero(2048)) + arc4.String.from_bytes(op.bzero(2048))
    intc_2 // 2048
    bzero
    extract 2 0
    dup
    concat
    dup
    len
    itob
    extract 6 2
    swap
    concat
    // contract.py:50-52
    // return (
    //     arc4.String.from_bytes(op.bzero(2048)) + arc4.String.from_bytes(op.bzero(2048))
    // ).bytes.length
    len
    retsub

It is possible to end up with an arc4.String that takes up exactly 4096 bytes - we just have to change the code accordingly:

class Contract(ARC4Contract):

    @arc4.abimethod
    def arc4_string_length(self) -> UInt64:
        return (arc4.String.from_bytes(op.bzero(2049)) + arc4.String.from_bytes(op.bzero(2049))).bytes.length

But critically, the first two bytes are always going to be reserved for the header.

So concatenating two arc4.Strings isn’t exactly the same as concatenating their bytes representations.

By contrast, if we use the native String type:

class Contract(ARC4Contract):    

    @arc4.abimethod
    def native_string_length(self) -> UInt64:
        return (String.from_bytes(op.bzero(2048)) + String.from_bytes(op.bzero(2048))).bytes.length

The result is 4096, as you might expect.

So, I would imagine, String and bytes are isomorphic.

Happy coding! 🥳