Example 1: Versioning
Welcome to Taichi AOT by Examples. This is a step-by-step tutorial to help you master everything about Taichi program deployment with ahead-of-time (AOT) compilation and the Taichi Runtime C-API. Because there are already lots of documentation about writing Taichi programs in Python, we will focus on the integration of Taichi Runtime and AOT-compiled kernels in your native applications.
To start with, let's make a simplest API call to ensure the Taichi Runtime
is in. We can check the currently install Taichi Runtime version with
get_version
.
ti::Version version = ti::get_version();
Taichi Runtime version is synchronized with the Python package, and it's recommended to use AOT modules and the runtime library from the same Taichi version.
std::cout << "hello, this is taichi runtime " << version.major() << "."
<< version.minor() << "." << version.patch() << "!" << std::endl;
The above C++ code may give the following output:
hello, this is taichi runtime 1.5.0!
Check out this example on Github: https://github.com/PENGUINLIONG/TaichiAotByExamples/tree/main/01-version
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Example 2: Taichi Arch
An arch is an execution backend of Taichi Runtime. Depending on build-time
flags and current platform installation, Taichi Runtime can support one or
more archs. You can use get_available_archs
to enumerate all the archs
available in the current environment.
std::vector<TiArch> archs = ti::get_available_archs();
std::cout << "the following archs are supported:" << std::endl;
for (TiArch arch : archs) {
switch (arch) {
case TI_ARCH_VULKAN:
std::cout << "- vulkan" << std::endl;
break;
case TI_ARCH_METAL:
std::cout << "- metal" << std::endl;
break;
case TI_ARCH_CUDA:
std::cout << "- cuda" << std::endl;
break;
case TI_ARCH_X64:
std::cout << "- x64" << std::endl;
break;
case TI_ARCH_ARM64:
std::cout << "- arm64" << std::endl;
break;
case TI_ARCH_OPENGL:
std::cout << "- opengl" << std::endl;
break;
case TI_ARCH_GLES:
std::cout << "- gles" << std::endl;
break;
default:
break;
}
}
The above C++ code may give the following output:
the following archs are supported:
- metal
Check out this example on Github: https://github.com/PENGUINLIONG/TaichiAotByExamples/tree/main/02-arch
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Example 3: Runtime Instance
ti::Runtime
is an instance context of the Taichi Runtime. It is
responsible of device memory allocation, memory transfer, device state
tracking and Taichi kernel launches.
You can create a runtime instance targeting a specific API backend like this.
#ifdef __APPLE__
ti::Runtime runtime(TI_ARCH_METAL);
#else
ti::Runtime runtime(TI_ARCH_VULKAN);
#endif
std::cout << "created runtime" << std::endl;
If your working environment has multiple GPUs installed, you can choose which one to use with its device index. By default, Taichi selects the most powerful compute device according to an internal algorithm.
#ifdef __APPLE__
ti::Runtime runtime(TI_ARCH_METAL, 0);
#else
ti::Runtime runtime(TI_ARCH_VULKAN, 0);
#endif
std::cout << "created runtime with device #0" << std::endl;
The above C++ code may give the following output:
created runtime
created runtime with device #0
Check out this example on Github: https://github.com/PENGUINLIONG/TaichiAotByExamples/tree/main/03-runtime
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Example 4: Memory Allocation
Allocate a piece of memory on device. Device memory is usually local to the compute device and is not accessible from the CPU.
ti::Memory device_memory = runtime.allocate_memory(4 * sizeof(uint32_t));
Host accessible memory can be accessed from the CPU but on-device memory traffic during kernel launches could be much slower.
ti::Memory host_accessible_memory =
runtime.allocate_memory(4 * sizeof(uint32_t), /*host_access=*/true);
You can map the device memory to get a host visible pointer to the memory content.
void *mapped = host_accessible_memory.map();
for (uint32_t i = 0; i < 4; ++i) {
((uint32_t*)mapped)[i] = i;
}
After host memory access, don't forget to unmap the memory. Some platforms don't allow the CPU and the GPU to access the same piece of memory at the same time and it can lead to a crash.
host_accessible_memory.unmap();
You can also use read()
and write()
for convenience.
std::vector<uint32_t> readback_data(4);
host_accessible_memory.read(readback_data.data(),
readback_data.size() * sizeof(uint32_t));
std::cout << "readback data has the following values:";
for (uint32_t x : readback_data) {
std::cout << " " << x;
}
std::cout << std::endl;
Please note that Taichi Runtime doesn't check on memory mapping. Attempts
to map non-host-accessible memory can lead to unrecoverable program
termination (usually a segfault). So please do not map any device only
memory. The same rule applies to read()
and write()
methods too.
//void *a_null_ptr = device_memory.map();
The above C++ code may give the following output:
readback data has the following values: 0 1 2 3
Check out this example on Github: https://github.com/PENGUINLIONG/TaichiAotByExamples/tree/main/04-memory
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Example 5: ND-Array
Compared with ti::Memory
, ND-array provides a more structured view over
the underlying memory.
ND-array is a multi dimensional dense matrix, similar to ndarray
in
NumPy or Tensor
in PyTorch. But they are not exactly the same. Because
in simulation and graphics we not only operate on scalars but more often
vectors and matrices, Taichi's ND-array thus have generalized the concept
of Tensor element. A Tensor can not only contain scalar elements but also
vectors and matrices, leading to its elem_shape
attribute to represent
the dimension of the shape of each element.
You can allocate a host-accessible 4x4 array of 2-component vectors like this:
ti::NdArray<float> arr = runtime.allocate_ndarray<float>({4, 4}, {2}, true);
You can access host-accessible ND-array data from the CPU with read()
and write()
.
Note that ND-array has the same host-accessibility restriction as raw memory allocations. Accessing non-host-accessible ND-arrays from the CPU can lead to unrecoverable program termination (usually segfault).
std::vector<float> canvas(4 * 4 * 2);
for (size_t h = 0; h < 4; ++h) {
for (size_t w = 0; w < 4; ++w) {
canvas[(h * 4 + w) * 2 + 0] = (w + 0.5f) / 4.0f;
canvas[(h * 4 + w) * 2 + 1] = (h + 0.5f) / 4.0f;
}
}
arr.write(canvas);
To help communicating vectors and matrices between the CPU and the GPU,
you can directly read()
and write()
composite types (structures) from
ti::NdArray
, as long as the composite type sizes are multiples of ND-
array scalar types.
struct Vec2 {
float x;
float y;
};
static_assert(sizeof(Vec2) == sizeof(uint32_t) * 2, "mismatched size");
std::cout << "ndarray data:" << std::setprecision(3) << std::fixed << std::endl;
std::vector<Vec2> canvas2(4 * 4);
arr.read(canvas2);
for (size_t h = 0; h < 4; ++h) {
for (size_t w = 0; w < 4; ++w) {
const Vec2 &vec = canvas2[h * 4 + w];
std::cout << "(" << vec.x << "," << vec.y << ")" << " ";
}
std::cout << std::endl;
}
The above C++ code may give the following output:
ndarray data:
(0.125,0.125) (0.375,0.125) (0.625,0.125) (0.875,0.125)
(0.125,0.375) (0.375,0.375) (0.625,0.375) (0.875,0.375)
(0.125,0.625) (0.375,0.625) (0.625,0.625) (0.875,0.625)
(0.125,0.875) (0.375,0.875) (0.625,0.875) (0.875,0.875)
Check out this example on Github: https://github.com/PENGUINLIONG/TaichiAotByExamples/tree/main/05-ndarray
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Example 6: Device Command
Taichi Runtime C-API interface functions not only have host-side procedures like memory allocation and mapping, there are also device commands like kernel launches and memory copy. In this example we will demonstrate the concept of device commands and the correct usage of them, with an example of device-to-device memory copy.
ti::NdArray<float> src = runtime.allocate_ndarray<float>({4}, {}, true);
ti::NdArray<float> dst = runtime.allocate_ndarray<float>({4}, {}, true);
src.write({1.0f, 2.0f, 3.0f, 4.0f});
Enqueue memory copy command to runtime
's default queue; copy the
underlying data of ND-array src
to dst
.
src.slice().copy_to(dst.slice());
Without synchronization we have no idea what's in dst
right now. In this
case, dst
is very likely to be filled with zero or random uninitialized
data.
std::vector<float> data(4);
dst.read(data);
std::cout << "floats in `dst`:" << std::endl;
for (float x : data) {
std::cout << x << " ";
}
std::cout << std::endl;
Note that the compute device might not immediately receive your device
commands so the computation could be deferred. It's recommended to call
flush()
first to send all enqueued device commands to the device and
start execution. However, this is not necessary.
runtime.flush();
To guarantee all previously enqueued device commands have finished execution. We have to wait on the runtime's default queue until it's done.
It might block your CPU thread for a significant length of time. Again,
it's recommended to call flush()
first to asynchronize GPU and CPU task.
So that you can do something else on the CPU while the GPU is busy
calculating.
runtime.wait();
After synchronization, we can be sure that the data in src
has been
copied to dst
.
std::vector<float> data(4);
dst.read(data);
std::cout << "floats in `dst`:" << std::endl;
for (float x : data) {
std::cout << x << " ";
}
std::cout << std::endl;
The above C++ code may give the following output:
floats in `dst`:
0 0 0 0
floats in `dst`:
1 2 3 4
Check out this example on Github: https://github.com/PENGUINLIONG/TaichiAotByExamples/tree/main/06-device-command
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Example 7: AOT Module
AOT modules are containers of ahead-of-time compiled Taichi kernels. You
can compile AOT modules with ti.aot.Module
APIs in Python.
After compilation, you can load AOT modules from the filesystem directly like this:
ti::AotModule aot_module =
runtime.load_aot_module("07-aot-module/assets/module.tcm");
std::cout << "loaded aot module from filesystem" << std::endl;
But if you want more control over how the module is loaded, you can
implement the loading logic yourself, and create the AOT module from data
buffer. You can also use tools like
bin2c
to embed module data in your source code.
std::ifstream f("07-aot-module/assets/module.tcm",
std::ios::binary | std::ios::in | std::ios::ate);
std::vector<uint8_t> tcm_data(f.tellg());
f.seekg(std::ios::beg);
f.read((char*)tcm_data.data(), tcm_data.size());
ti::AotModule aot_module = runtime.create_aot_module(tcm_data);
std::cout << "created aot module from buffer data" << std::endl;
If you build a Taichi AOT module with the following Python script:
module = ti.aot.Module()
module.archive(module_path)
The above C++ code may give the following output:
loaded aot module from filesystem
created aot module from buffer data
Check out this example on Github: https://github.com/PENGUINLIONG/TaichiAotByExamples/tree/main/07-aot-module
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Example 8: Kernel and Positional Argument
Taichi AOT modules may contain precompiled Taichi kernels. You can extract the kernels by their names.
For example, we have a kernel called chessboard
that prints a chessboard
pattern of interleaving zeros and ones to an ND-array.
ti::AotModule aot_module =
runtime.load_aot_module("08-kernel/assets/module.tcm");
ti::Kernel k_chessboard = aot_module.get_kernel("chessboard");
To launch the kernel, we first need to assign the positional arguments. There are two ways to setup kernel arguments. First, you can assign arguments by their indices. This is the best if you have many arguments but only few of them are updated frequently.
k_chessboard[0] = arr;
If the argument list frequently changes, like when you are prototyping a new algorithm. You might want to clear and sequentially set the list of argument so you don't have to reorder the indices all the time.
k_chessboard.clear_args();
k_chessboard.push_arg(arr);
When all the arguments are ready, you can launch the kernel with
launch()
. Kernel launch is a device command, so it will be pushed to the
default queue of runtime
.
k_chessboard.launch();
Remind that, we don't know if a device command has finished execution
unless we explicitly wait()
upon it.
runtime.wait();
std::vector<uint32_t> arr_data(16);
arr.read(arr_data);
for (size_t h = 0; h < 4; ++h) {
for (size_t w = 0; w < 4; ++w) {
std::cout << arr_data.at(h * 4 + w) << " ";
}
std::cout << std::endl;
}
If you build a Taichi AOT module with the following Python script:
@ti.kernel
def chessboard(arr: ti.types.ndarray(dtype=ti.u32, ndim=2)):
for i, j in arr:
arr[i, j] = (i % 2) ^ (j % 2)
arr = ti.ndarray(ti.u32, (4, 4))
chessboard(arr)
print(arr.to_numpy())
module = ti.aot.Module()
module.add_kernel(chessboard, template_args={ "arr": arr })
module.archive(module_path)
The above C++ code may give the following output:
0 1 0 1
1 0 1 0
0 1 0 1
1 0 1 0
Check out this example on Github: https://github.com/PENGUINLIONG/TaichiAotByExamples/tree/main/08-kernel
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Example 9: Error Handling
When your project is complicated enough, or you are targeting multiple platforms that might not be your development environment, errors could occur, either by incorrect usage in your code or an implementation defact in Taichi Runtime.
Taichi Runtime tries its best to ensure your incorrect usage doesn't propagate to crash your application. You might want to regularly check if any error occurred during your previous API calls.
ti::Error error = ti::get_last_error();
In most cases it reports TI_ERROR_SUCCESS
, indicating everything is
fine.
assert(error.error == TI_ERROR_SUCCESS);
But if you missed something, the error code will give you a semantical error code and a message telling you what exactly gone wrong.
ti::Runtime runtime(TI_ARCH_MAX_ENUM);
error = ti::get_last_error();
std::cout << "error code: " << error.error << std::endl;
std::cout << "error message: " << error.message << std::endl;
For any APIs that constructs or returns a new object, the returned object won't be valid.
std::cout << "runtime is valid? " << (runtime.is_valid() ? "true" : "false")
<< std::endl;
The above C++ code may give the following output:
error code: -1
error message: arch
runtime is valid? false
Check out this example on Github: https://github.com/PENGUINLIONG/TaichiAotByExamples/tree/main/09-error
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.