Data Types in Algorand Python
In this article, we’ll explore the differences between Python’s built-in data types, Algorand Virtual Machine types, and Algorand Python types.
What is a Data Type?
I like to think of a data type as a set of values, and a set of operations on those values.
For example, we can think of Python’s bool
type as the set containing True
and False
:
$$Bool = \{True, False\}$$
And Python’s int
type as the (infinite) set of all integers:
$$Int = \mathbb{Z} = \{...,-2,-1,0,1,2,...\}$$
While Python supports arbitrarily-large integers, smart contracts run in much more constrained environments.
The set of possible values for AVM types is always finite, and that has important implications for how we develop smart contracts.
Working with Integers
The AVM is a stack interpreter, with elements of the stack restricted to two types: uint64
and bytes
.
These are represented by the UInt64
and Bytes
types in Algorand Python.
The set of possible uint64
values is:
$$U_{64} = \{x \in \mathbb{N} \mid x \lt 2^{64}\}$$
So unlike int
in Python, a uint64
cannot be negative, nor can it exceed \(2^{64}-1\).
This means the Python function:
def add(a: int, b: int) -> int:
return a + b
Does not have the same domain and codomain as the Algorand Python subroutine:
@subroutine
def add(a: UInt64, b: UInt64) -> UInt64:
return a + b
In the latter, we can’t add negative numbers, for example.
But there’s one more important difference to note: add
in Python is a binary operation. In Algorand Python, it’s a partial binary operation.
What’s the difference?
A binary operation is a function that combines two elements of a set to produce another element of the same set:
$$f: S \times S \to S$$
The domain here is the set of ordered pairs:
$$dom(f) = S \times S = \{(a,b) \mid a \in S, b \in S\}$$
And by definition, the function has to be closed on \(S\), meaning:
$$\forall a,b \in S,\ f(a, b) \in S$$
This is true for int
in Python: adding any two integers will return another integer.
But in the AVM, adding some pairs of (a, b)
is not possible, because the result would exceed the maximum possible value of a uint64
.
That’s why the official definition of +
in TEAL is:
A partial binary function, then, is one that maps a subset of pairs from \(S \times S \to S\).
The subset of addable pairs is:
$$\{(a, b) \in U_{64} \times U_{64} \mid a + b \in U_{64}\}$$
Which is the same as saying:
$$\{(a, b) \in U_{64} \times U_{64} \mid b <= \max{U_{64}} - a\}$$
We can use this formula to define a total function for adding uint64
s:
MAX_UINT64 = 2**64 - 1
@subroutine
def safe_add(a: UInt64, b: UInt64) -> UInt64:
return a + b if b <= MAX_UINT64 - a else UInt64(MAX_UINT64)
But in doing so we’ve lost some information. Did we prevent the result from overflowing, or would it have fallen within the bounds anyway?
Let’s see if we can track this with product types.
Product Types
A product type is a composite structure formed from combining one or more types.
If we have a tuple: (a: bool, b: bool)
, what’s the cardinality of the set of possible values?
It’s the product of the cardinality of each type:
$$|Bool| \times |Bool| = |\{True, False\}| \times |\{True, False\}|$$
Which is 4.
What about (a: bool, b: int)
?
$$|Bool| \times |\mathbb{Z}| = 2 \times \infty$$
That’s a bit weird to think about.
Remembering that all Algorand Python types are finite, we know that the cardinality of (a: bool, b: UInt64)
is the much more friendly \(2 \times 2^{64}-1\).
Incidentally, bool
and tuple
are two of the built-in Python types that we can use in Algorand Python.
Let’s modify the safe_add
function we wrote earlier, so that it returns a pair of UInt64, bool
:
MAX_UINT64 = 2**64 - 1
@subroutine
def safe_add(a: UInt64, b: UInt64) -> tuple[UInt64, bool]:
if b <= MAX_UINT64 - a:
return a + b, False
return UInt64(MAX_UINT64), True
Now we know the result of the addition, and whether or not the result was limited.
But the meaning of the return type isn’t exactly obvious - what do True
and False
represent?
We can give them descriptive names by turning the tuple
into a NamedTuple
:
from typing import NamedTuple
MAX_UINT64 = 2**64 - 1
# Errors
NONE = False
OVERFLOW = True
class Result(NamedTuple):
data: UInt64
error: bool
@subroutine
def safe_add(a: UInt64, b: UInt64) -> Result:
if b <= MAX_UINT64 - a:
return Result(data=a + b, error=NONE)
return Result(data=UInt64(MAX_UINT64), error=OVERFLOW)
Enum
would be nice here for the constants, but it’s not supported yet in Algorand Python.In Python, NamedTuple
is a subclass of tuple
. You can do some funky things with it like accessing the element names dynamically:
>>> class Result(NamedTuple):
... data: int
... error: bool
>>> r = Result(data=1, error=False)
>>> print(r.data)
1
>>> print(getattr(r, 'data'))
1
If you attempt that in Algorand Python, you’ll get a compiler error:
Returning Any from function declared to return "UInt64"
So what have we actually defined here with the Result
class?
The best way to think of it, in my opinion, is as a record type.
The identifiers we’ve given to the elements (data
and error
) are figments of the compiler - they do not exist in the runtime representation.
When this gets compiled down to AVM bytecode, the data will be represented the same way a tuple
is represented.
That seems odd, right? It means the two are in some sense equivalent.
Technically speaking, tuples and records are the same up to an isomorphism, because there exists an invertible, structure-preserving mapping between the two.
Let’s define this in Algorand Python:
class Result(NamedTuple):
data: UInt64
error: bool
@subroutine
def tuple_to_result(pair: tuple[UInt64, bool]) -> Result:
data, error = pair
return Result(data, error)
@subroutine
def result_to_tuple(result: Result) -> tuple[UInt64, bool]:
return result.data, result.error
tuple_to_result
is a morphism \(f: a \to b\)
result_to_tuple
is a morphism \(g: b \to a\)
We know that \(g\) is the inverse of \(f\), because their composition is the same as the identity morphism: lambda x: x
.
In other words, applying the two in sequence gets us back the original argument:
>>> result_to_tuple(tuple_to_result(pair)) == pair
True
>>> tuple_to_result(result_to_tuple(result)) == result
True
But here’s where it gets really funky.
Python tuples are supported as arguments to subroutines, local variables, return types.
You may have wondered why tuples are allowed at all.
If each element in the AVM stack can only be a uint64
or a bytes
type, how can we have a tuple
of (uint64
, uint64
)?
Tuples as Arguments to Subroutines
Let’s define a Python function to construct a pair of (a, b)
from its constituent parts:
def pair (a, b): return (a, b)
Notice this looks awfully like the identity function: \(f: (a, b) \to (a, b)\).
In Python, there’s a difference between a function that takes multiple arguments, and a function that takes a tuple as an argument:
def pair(ab: tuple): return ab
But I’m going to argue that in the category of Python functions, the two are the same up to isomorphism.
To convert the first function, which takes an a
and a b
, into a function that takes a pair of (a, b)
:
>>> def pair(a, b): return (a, b)
>>> def to_unary_fn(f):
... return lambda ab: f(*ab)
>>> ab = (1, 2)
>>> to_unary_fn(pair)(ab)
(1, 2)
And inversely, to convert a function that takes a pair of (a, b)
into one that takes two arguments, a
and b
:
>>> def pair(ab: tuple): return ab
>>> def to_binary_fn(f):
... return lambda *ab: f(ab)
>>> to_binary_fn(pair)(1, 2)
(1, 2)
Composing the two should of course give us the identity morphism:
>>> def pair(ab: tuple): return ab
>>> def to_unary_fn(f):
... return lambda ab: f(*ab)
>>> def to_binary_fn(f):
... return lambda *ab: f(ab)
>>> ab = (1, 2)
>>> to_unary_fn(to_binary_fn(pair))(ab) == pair(ab)
True
In fact any combination of arguments, pairs, and nested pairs are isomorphic.
I’ll spare you the details, but we could easily map (a, (b, c))
to ((a, b), c)
and back.
The power of the isomorphism lies in the fact that the AVM doesn’t need to support tuples natively.
Subroutines can have multiple arguments, so the compiler can simply transform a function that takes a tuple into a function that takes two arguments.
To prove this, let’s write a simple contract:
class Contract(ARC4Contract):
@arc4.abimethod
def add(self, a: UInt64, b: UInt64) -> UInt64:
return a + b
@arc4.abimethod
def sum(self, pair: tuple[UInt64, UInt64]) -> UInt64:
a, b = pair
return a + b
The first method takes two arguments (ignoring self
): a
and b
.
The second method takes a tuple
of (a, b)
, unpacks it into separate variables, and adds the elements.
If we compile the contract and look at the generated TEAL code for the corresponding subroutines:
// contract.Contract.add(a: uint64, b: uint64) -> uint64:
add:
// contract.py:41-42
// @arc4.abimethod
// def add(self, a: UInt64, b: UInt64) -> UInt64:
proto 2 1
// contract.py:43
// return a + b
frame_dig -2
frame_dig -1
+
retsub
// contract.Contract.sum(pair.0: uint64, pair.1: uint64) -> uint64:
sum:
// contract.py:45-46
// @arc4.abimethod
// def sum(self, pair: tuple[UInt64, UInt64]) -> UInt64:
proto 2 1
// contract.py:48
// return a + b
frame_dig -2
frame_dig -1
+
retsub
They’re the same.
Notice that in the comments, we can see that the signature of add
is still: add(a: uint64, b: uint64) -> uint64
But the signature of sum
has been transformed: sum(pair.0: uint64, pair.1: uint64) -> uint64
It’s a thing of beauty.
But how was the tuple provided to the contract, and how did it get converted into two uint64
s?
Let’s take a look at the Application Binary Interface (ABI).
Application Binary Interface (ABI)
The AVM only supports two types: uint64
and bytes
. So any time we want to send something more complex to a smart contract, it has to be encoded in binary (as a bytes
type).
In the example shown previously, we were able to concatenate the bytes of two uint64
s together to form an encoded tuple.
The compiler is clever enough to know how to decode this into two uint64
s, and push them onto the stack:
// class Contract(ARC4Contract):
txna ApplicationArgs 1
dup
intc_1 // 0
extract_uint64
swap
pushint 8 // 8
extract_uint64
// @arc4.abimethod
callsub sum
The extract_uint64
operation allows us to take any 64 bits (8 bytes) from a bytes
type and turn it into a uint64
.
So we can concatenate n integers, and extract the nth element by calling extract_uint64(..., n * 8)
.
Notice how efficient this is. There’s no need to encode the length of each element, since it’s static and known at compile-time.
But how is our off-chain code and on-chain code acting so harmoniously?
How does the compiler know exactly how we’ve encoded a tuple of integers?
This is all specified in the ARC-4 Application Binary Interface standard.
It’s supported by both the SDKs for off-chain code (py-algorand-sdk
and algokit-utils
) and the compiler for on-chain code (Algorand Python).
It works so seamlessly that often you won’t notice the encoding and decoding taking place.
ARC-4 Data Types
ARC-4 gives us a bunch of additional data types to work with in our smart contracts, but with one important caveat: all ARC-4 types are represented as bytes
in the AVM.
This is the source of a lot of confusion, so let’s walk through it in detail.
Algorand Python internally has a concept of a ‘bytes-backed type’.
This means that for any ARC-4 type you use, you can access the underlying Bytes
, exactly as it’s represented in the AVM:
from typing import TypeAlias
Arc4Pair: TypeAlias = arc4.Tuple[arc4.UInt64, arc4.UInt64]
class Contract(ARC4Contract):
@arc4.abimethod
def to_bytes(self, pair: Arc4Pair) -> Bytes:
return pair.bytes
So we have a mapping from an ARC-4 type to a Bytes
type.
What about the inverse?
class Contract(ARC4Contract):
...
@arc4.abimethod
def from_bytes(self, pair: Bytes) -> Arc4Pair:
return Arc4Pair.from_bytes(pair)
I’m sure you can see where I’m going with this:
class Contract(ARC4Contract):
@arc4.abimethod
def to_bytes(self, pair: Arc4Pair) -> Bytes:
return pair.bytes
@arc4.abimethod
def from_bytes(self, pair: Bytes) -> Arc4Pair:
return Arc4Pair.from_bytes(pair)
@arc4.abimethod
def from_after_to(self, pair: Arc4Pair) -> Arc4Pair:
return self.from_bytes(self.to_bytes(pair))
@arc4.abimethod
def to_after_from(self, pair: Bytes) -> Bytes:
return self.to_bytes(self.from_bytes(pair))
But there’s one thing worth noting here: while every Arc4Pair
has a bytes
representation, not every byte sequence corresponds to a valid Arc4Pair
.
We can always go from an ARC-4 type to Bytes
and back, but the reverse is not always true - some Bytes
cannot be parsed as an ARC-4 type.
Here’s an interesting question (I hope!): is there a mapping between the operations on Arc4Pair
and the operations on Bytes
?
In a sense, there has to be. The AVM only has two types: uint64
and bytes
, and operations on those types.
So, not only does every ARC-4 type have to be representable as bytes
, but every operation on an ARC-4 type has to composable from the operations on bytes
.
Operations on ARC-4 Types
Let’s write a simple contract that concatenates two arc4.String
s:
class Contract(ARC4Contract):
@arc4.abimethod
def concat(self, a: arc4.String, b: arc4.String) -> arc4.String:
return a + b
An arc4.String
is encoded as ‘a variable length byte array prefixed with a 16-bit big endian header indicating the length of the data’ (source).
So to concatenate two strings in the AVM, it would make sense to:
Extract the two byte sequences (ignoring the header)
Concatenate those
bytes
togetherCalculate the length of the concatenated
bytes
Convert the length integer to
bytes
and prepend it to the concatenatedbytes
And that’s exactly what the compiled TEAL code does:
// contract.Contract.concat(a: bytes, b: bytes) -> bytes:
concat:
// contract.py:70-71
// @arc4.abimethod
// def concat(self, a: arc4.String, b: arc4.String) -> arc4.String:
proto 2 1
// contract.py:72
// return a + b
frame_dig -2
extract 2 0
frame_dig -1
extract 2 0
concat
dup
len
itob
extract 6 2
swap
concat
retsub
When we transform ARC-4 types in a smart contract, we effectively lift the object into an AVM bytes
type, apply the equivalent transformation, and then lower it back to an ARC-4 type:
If you find yourself asking the question: ‘why can't I do x with an ARC-4 type?’, the answer is likely that there is no way to express the transformation using AVM operations on bytes
, or that doing so would be computationally infeasible in a smart contract.
And that is a real sticking point - these abstractions come at a cost.
It took 12 operations just to concatenate two strings.
Surely we can do better.
Native Types
Algorand Python also offers ‘native’ types, which are mostly bounded variants of uint64
or bytes
.
For example, bool
is a uint64
bound to the range \(\{0, 1\}\).
Thinking back to types as sets, we can say that the possible values of bool
are:
$$\{x \in U_{64} \mid x \in \{0, 1\}\}$$
And the set of operations on uint64
and bool
are exactly the same.
There’s often a clear path from an ARC-4 type to a native type, which can be accessed via the native
property:
@subroutine
def to_native(x: arc4.Bool) -> bool:
return x.native
Interestingly, there’s a native String
type in Algorand Python that represents UTF-8 encoded strings.
It’s not as powerful as the Python str
type, but it is more efficient than arc4.String
for operations like concatenation.
We can use the native
property to convert an arc4.String
to a String
:
@subroutine
def to_native(x: arc4.String) -> String:
return x.native
Which removes the header indicating the length of the string.
We can also pass String
arguments to ABI methods:
class Contract(ARC4Contract):
@arc4.abimethod
def concat(self, a: String, b: String) -> String:
return a + b
But how is that possible?
The ABI method signature is actually identical in either case:
# ABI: concat(string,string)string
def concat(self, a: String, b: String) -> String: ...
# ABI: concat(string,string)string
def concat(self, a: arc4.String, b: arc4.String) -> arc4.String: ...
The difference is how the two are treated internally.
An ABI string
is immediately parsed into a String
by removing the header:
txna ApplicationArgs 1
extract 2 0
txna ApplicationArgs 2
extract 2 0
From that point onwards, operations like concatenation can be performed much more efficiently:
// contract.Contract.concat(a: bytes, b: bytes) -> bytes:
concat:
// contract.py:83-84
// @arc4.abimethod
// def concat(self, a: String, b: String) -> String:
proto 2 1
// contract.py:85
// return a + b
frame_dig -2
frame_dig -1
concat
retsub
One Final Musing
If you’ve made it this far in the blog post, you are clearly a glutton for punishment.
Earlier, I mentioned that:
We can always go from an ARC-4 type to
Bytes
and back, but the reverse is not always true…
Isomorphic types are those which can be cast to each other without loss of information.
A Bytes
value can hold up to 4096 bytes of information.
An arc4.String
can also have a length of 4096 bytes, but the first two bytes are always reserved for the header.
So what happens if we try to concatenate two arc4.String
s, each containing 2048 bytes?:
class Contract(ARC4Contract):
@arc4.abimethod
def arc4_string_length(self) -> UInt64:
return (arc4.String.from_bytes(op.bzero(2048)) + arc4.String.from_bytes(op.bzero(2048))).bytes.length
First, the initial two bytes are removed from each of the strings (the compiler assumes these are encoded headers).
This leaves us with two byte-arrays of length 2046.
These are then concatenated together, forming a byte-array of length 4092.
And finally, a new 2-byte header is prepended, leaving us with a byte-array of length 4094.
By converting Bytes
into an arc4.String
, we’ve lost some information.
The compiled TEAL code is below:
// contract.Contract.arc4_string_length() -> uint64:
arc4_string_length:
// contract.py:51
// arc4.String.from_bytes(op.bzero(2048)) + arc4.String.from_bytes(op.bzero(2048))
intc_2 // 2048
bzero
extract 2 0
dup
concat
dup
len
itob
extract 6 2
swap
concat
// contract.py:50-52
// return (
// arc4.String.from_bytes(op.bzero(2048)) + arc4.String.from_bytes(op.bzero(2048))
// ).bytes.length
len
retsub
It is possible to end up with an arc4.String
that takes up exactly 4096 bytes - we just have to change the code accordingly:
class Contract(ARC4Contract):
@arc4.abimethod
def arc4_string_length(self) -> UInt64:
return (arc4.String.from_bytes(op.bzero(2049)) + arc4.String.from_bytes(op.bzero(2049))).bytes.length
But critically, the first two bytes are always going to be reserved for the header.
So concatenating two arc4.String
s isn’t exactly the same as concatenating their bytes
representations.
By contrast, if we use the native String
type:
class Contract(ARC4Contract):
@arc4.abimethod
def native_string_length(self) -> UInt64:
return (String.from_bytes(op.bzero(2048)) + String.from_bytes(op.bzero(2048))).bytes.length
The result is 4096, as you might expect.
So, I would imagine, String
and bytes
are isomorphic.
Happy coding! 🥳