Example 1: Versioning

Welcome to Taichi AOT by Examples. This is a step-by-step tutorial to help you master everything about Taichi program deployment with ahead-of-time (AOT) compilation and the Taichi Runtime C-API. Because there are already lots of documentation about writing Taichi programs in Python, we will focus on the integration of Taichi Runtime and AOT-compiled kernels in your native applications.

To start with, let's make a simplest API call to ensure the Taichi Runtime is in. We can check the currently install Taichi Runtime version with get_version.

ti::Version version = ti::get_version();

Taichi Runtime version is synchronized with the Python package, and it's recommended to use AOT modules and the runtime library from the same Taichi version.

std::cout << "hello, this is taichi runtime " << version.major() << "."
          << version.minor() << "." << version.patch() << "!" << std::endl;

The above C++ code may give the following output:

hello, this is taichi runtime 1.5.0!

Check out this example on Github: https://github.com/PENGUINLIONG/TaichiAotByExamples/tree/main/01-version

_{This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.}

Example 2: Taichi Arch

An arch is an execution backend of Taichi Runtime. Depending on build-time flags and current platform installation, Taichi Runtime can support one or more archs. You can use get_available_archs to enumerate all the archs available in the current environment.

std::vector<TiArch> archs = ti::get_available_archs();

std::cout << "the following archs are supported:" << std::endl;
for (TiArch arch : archs) {
  switch (arch) {
  case TI_ARCH_VULKAN:
    std::cout << "- vulkan" << std::endl;
    break;
  case TI_ARCH_METAL:
    std::cout << "- metal" << std::endl;
    break;
  case TI_ARCH_CUDA:
    std::cout << "- cuda" << std::endl;
    break;
  case TI_ARCH_X64:
    std::cout << "- x64" << std::endl;
    break;
  case TI_ARCH_ARM64:
    std::cout << "- arm64" << std::endl;
    break;
  case TI_ARCH_OPENGL:
    std::cout << "- opengl" << std::endl;
    break;
  case TI_ARCH_GLES:
    std::cout << "- gles" << std::endl;
    break;
  default:
    break;
  }
}

The above C++ code may give the following output:

the following archs are supported:
- metal

Check out this example on Github: https://github.com/PENGUINLIONG/TaichiAotByExamples/tree/main/02-arch

_{This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.}

Example 3: Runtime Instance

ti::Runtime is an instance context of the Taichi Runtime. It is responsible of device memory allocation, memory transfer, device state tracking and Taichi kernel launches.

You can create a runtime instance targeting a specific API backend like this.

#ifdef __APPLE__
ti::Runtime runtime(TI_ARCH_METAL);
#else
ti::Runtime runtime(TI_ARCH_VULKAN);
#endif
std::cout << "created runtime" << std::endl;

If your working environment has multiple GPUs installed, you can choose which one to use with its device index. By default, Taichi selects the most powerful compute device according to an internal algorithm.

#ifdef __APPLE__
ti::Runtime runtime(TI_ARCH_METAL, 0);
#else
ti::Runtime runtime(TI_ARCH_VULKAN, 0);
#endif
std::cout << "created runtime with device #0" << std::endl;

The above C++ code may give the following output:

created runtime
created runtime with device #0

Check out this example on Github: https://github.com/PENGUINLIONG/TaichiAotByExamples/tree/main/03-runtime

_{This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.}

Example 4: Memory Allocation

Allocate a piece of memory on device. Device memory is usually local to the compute device and is not accessible from the CPU.

ti::Memory device_memory = runtime.allocate_memory(4 * sizeof(uint32_t));

Host accessible memory can be accessed from the CPU but on-device memory traffic during kernel launches could be much slower.

ti::Memory host_accessible_memory =
    runtime.allocate_memory(4 * sizeof(uint32_t), /*host_access=*/true);

You can map the device memory to get a host visible pointer to the memory content.

void *mapped = host_accessible_memory.map();
for (uint32_t i = 0; i < 4; ++i) {
  ((uint32_t*)mapped)[i] = i;
}

After host memory access, don't forget to unmap the memory. Some platforms don't allow the CPU and the GPU to access the same piece of memory at the same time and it can lead to a crash.

host_accessible_memory.unmap();

You can also use read() and write() for convenience.

std::vector<uint32_t> readback_data(4);
host_accessible_memory.read(readback_data.data(),
                            readback_data.size() * sizeof(uint32_t));

std::cout << "readback data has the following values:";
for (uint32_t x : readback_data) {
  std::cout << " " << x;
}
std::cout << std::endl;

Please note that Taichi Runtime doesn't check on memory mapping. Attempts to map non-host-accessible memory can lead to unrecoverable program termination (usually a segfault). So please do not map any device only memory. The same rule applies to read() and write() methods too.

//void *a_null_ptr = device_memory.map();

The above C++ code may give the following output:

readback data has the following values: 0 1 2 3

Check out this example on Github: https://github.com/PENGUINLIONG/TaichiAotByExamples/tree/main/04-memory

_{This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.}

Example 5: ND-Array

Compared with ti::Memory, ND-array provides a more structured view over the underlying memory.

ND-array is a multi dimensional dense matrix, similar to ndarray in NumPy or Tensor in PyTorch. But they are not exactly the same. Because in simulation and graphics we not only operate on scalars but more often vectors and matrices, Taichi's ND-array thus have generalized the concept of Tensor element. A Tensor can not only contain scalar elements but also vectors and matrices, leading to its elem_shape attribute to represent the dimension of the shape of each element.

You can allocate a host-accessible 4x4 array of 2-component vectors like this:

ti::NdArray<float> arr = runtime.allocate_ndarray<float>({4, 4}, {2}, true);

You can access host-accessible ND-array data from the CPU with read() and write().

Note that ND-array has the same host-accessibility restriction as raw memory allocations. Accessing non-host-accessible ND-arrays from the CPU can lead to unrecoverable program termination (usually segfault).

std::vector<float> canvas(4 * 4 * 2);
for (size_t h = 0; h < 4; ++h) {
  for (size_t w = 0; w < 4; ++w) {
    canvas[(h * 4 + w) * 2 + 0] = (w + 0.5f) / 4.0f;
    canvas[(h * 4 + w) * 2 + 1] = (h + 0.5f) / 4.0f;
  }
}
arr.write(canvas);

To help communicating vectors and matrices between the CPU and the GPU, you can directly read() and write() composite types (structures) from ti::NdArray, as long as the composite type sizes are multiples of ND- array scalar types.

struct Vec2 {
  float x;
  float y;
};
static_assert(sizeof(Vec2) == sizeof(uint32_t) * 2, "mismatched size");

std::cout << "ndarray data:" << std::setprecision(3) << std::fixed << std::endl;
std::vector<Vec2> canvas2(4 * 4);
arr.read(canvas2);
for (size_t h = 0; h < 4; ++h) {
  for (size_t w = 0; w < 4; ++w) {
    const Vec2 &vec = canvas2[h * 4 + w];
    std::cout << "(" << vec.x << "," << vec.y << ")" << " ";
  }
  std::cout << std::endl;
}

The above C++ code may give the following output:

ndarray data:
(0.125,0.125) (0.375,0.125) (0.625,0.125) (0.875,0.125)
(0.125,0.375) (0.375,0.375) (0.625,0.375) (0.875,0.375)
(0.125,0.625) (0.375,0.625) (0.625,0.625) (0.875,0.625)
(0.125,0.875) (0.375,0.875) (0.625,0.875) (0.875,0.875)

Check out this example on Github: https://github.com/PENGUINLIONG/TaichiAotByExamples/tree/main/05-ndarray

_{This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.}

Example 6: Device Command

Taichi Runtime C-API interface functions not only have host-side procedures like memory allocation and mapping, there are also device commands like kernel launches and memory copy. In this example we will demonstrate the concept of device commands and the correct usage of them, with an example of device-to-device memory copy.

ti::NdArray<float> src = runtime.allocate_ndarray<float>({4}, {}, true);
ti::NdArray<float> dst = runtime.allocate_ndarray<float>({4}, {}, true);
src.write({1.0f, 2.0f, 3.0f, 4.0f});

Enqueue memory copy command to runtime's default queue; copy the underlying data of ND-array src to dst.

src.slice().copy_to(dst.slice());

Without synchronization we have no idea what's in dst right now. In this case, dst is very likely to be filled with zero or random uninitialized data.

std::vector<float> data(4);
dst.read(data);

std::cout << "floats in `dst`:" << std::endl;
for (float x : data) {
  std::cout << x << " ";
}
std::cout << std::endl;

Note that the compute device might not immediately receive your device commands so the computation could be deferred. It's recommended to call flush() first to send all enqueued device commands to the device and start execution. However, this is not necessary.

runtime.flush();

To guarantee all previously enqueued device commands have finished execution. We have to wait on the runtime's default queue until it's done.

It might block your CPU thread for a significant length of time. Again, it's recommended to call flush() first to asynchronize GPU and CPU task. So that you can do something else on the CPU while the GPU is busy calculating.

runtime.wait();

After synchronization, we can be sure that the data in src has been copied to dst.

std::vector<float> data(4);
dst.read(data);

std::cout << "floats in `dst`:" << std::endl;
for (float x : data) {
  std::cout << x << " ";
}
std::cout << std::endl;

The above C++ code may give the following output:

floats in `dst`:
0 0 0 0
floats in `dst`:
1 2 3 4

Check out this example on Github: https://github.com/PENGUINLIONG/TaichiAotByExamples/tree/main/06-device-command

_{This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.}

Example 7: AOT Module

AOT modules are containers of ahead-of-time compiled Taichi kernels. You can compile AOT modules with ti.aot.Module APIs in Python.

After compilation, you can load AOT modules from the filesystem directly like this:

ti::AotModule aot_module =
    runtime.load_aot_module("07-aot-module/assets/module.tcm");
std::cout << "loaded aot module from filesystem" << std::endl;

But if you want more control over how the module is loaded, you can implement the loading logic yourself, and create the AOT module from data buffer. You can also use tools like bin2c to embed module data in your source code.

std::ifstream f("07-aot-module/assets/module.tcm",
                std::ios::binary | std::ios::in | std::ios::ate);
std::vector<uint8_t> tcm_data(f.tellg());
f.seekg(std::ios::beg);
f.read((char*)tcm_data.data(), tcm_data.size());

ti::AotModule aot_module = runtime.create_aot_module(tcm_data);
std::cout << "created aot module from buffer data" << std::endl;

If you build a Taichi AOT module with the following Python script:

module = ti.aot.Module()
module.archive(module_path)

The above C++ code may give the following output:

loaded aot module from filesystem
created aot module from buffer data

Check out this example on Github: https://github.com/PENGUINLIONG/TaichiAotByExamples/tree/main/07-aot-module

_{This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.}

Example 8: Kernel and Positional Argument

Taichi AOT modules may contain precompiled Taichi kernels. You can extract the kernels by their names.

For example, we have a kernel called chessboard that prints a chessboard pattern of interleaving zeros and ones to an ND-array.

ti::AotModule aot_module =
    runtime.load_aot_module("08-kernel/assets/module.tcm");
ti::Kernel k_chessboard = aot_module.get_kernel("chessboard");

To launch the kernel, we first need to assign the positional arguments. There are two ways to setup kernel arguments. First, you can assign arguments by their indices. This is the best if you have many arguments but only few of them are updated frequently.

k_chessboard[0] = arr;

If the argument list frequently changes, like when you are prototyping a new algorithm. You might want to clear and sequentially set the list of argument so you don't have to reorder the indices all the time.

k_chessboard.clear_args();
k_chessboard.push_arg(arr);

When all the arguments are ready, you can launch the kernel with launch(). Kernel launch is a device command, so it will be pushed to the default queue of runtime.

k_chessboard.launch();

Remind that, we don't know if a device command has finished execution unless we explicitly wait() upon it.

runtime.wait();

std::vector<uint32_t> arr_data(16);
arr.read(arr_data);
for (size_t h = 0; h < 4; ++h) {
  for (size_t w = 0; w < 4; ++w) {
    std::cout << arr_data.at(h * 4 + w) << " ";
  }
  std::cout << std::endl;
}

If you build a Taichi AOT module with the following Python script:

@ti.kernel
def chessboard(arr: ti.types.ndarray(dtype=ti.u32, ndim=2)):
    for i, j in arr:
        arr[i, j] = (i % 2) ^ (j % 2)

arr = ti.ndarray(ti.u32, (4, 4))
chessboard(arr)
print(arr.to_numpy())

module = ti.aot.Module()
module.add_kernel(chessboard, template_args={ "arr": arr })
module.archive(module_path)

The above C++ code may give the following output:

Check out this example on Github: https://github.com/PENGUINLIONG/TaichiAotByExamples/tree/main/08-kernel

_{This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.}

Example 9: Error Handling

When your project is complicated enough, or you are targeting multiple platforms that might not be your development environment, errors could occur, either by incorrect usage in your code or an implementation defact in Taichi Runtime.

Taichi Runtime tries its best to ensure your incorrect usage doesn't propagate to crash your application. You might want to regularly check if any error occurred during your previous API calls.

ti::Error error = ti::get_last_error();

In most cases it reports TI_ERROR_SUCCESS, indicating everything is fine.

assert(error.error == TI_ERROR_SUCCESS);

But if you missed something, the error code will give you a semantical error code and a message telling you what exactly gone wrong.

ti::Runtime runtime(TI_ARCH_MAX_ENUM);
error = ti::get_last_error();
std::cout << "error code: " << error.error << std::endl;
std::cout << "error message: " << error.message << std::endl;

For any APIs that constructs or returns a new object, the returned object won't be valid.

std::cout << "runtime is valid? " << (runtime.is_valid() ? "true" : "false")
          << std::endl;

The above C++ code may give the following output:

error code: -1
error message: arch
runtime is valid? false

Check out this example on Github: https://github.com/PENGUINLIONG/TaichiAotByExamples/tree/main/09-error

_{This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.}