mlkem-native and mldsa-native in CHERIoT – Post-Quantum Cryptography Alliance

The CHERIoT platform is a hardware-software co-designed system that provides memory safety and fine-grained compartmentalisation in embedded devices. It began as a Microsoft Research project that was open sourced in 2023. It is now collaboratively developed with three maintainers: myself, Yucong Tao (Microsoft), and Ben Laurie (Google). The first CHERIoT microcontrollers taped out last year and they’re expected to be available as commodity parts this year.

CHERIoT is a CHERI system. C pointers in CHERI systems are represented by a hardware data type, a CHERI capability, that both represents an address and gives the rights to a region of memory. CHERI capabilities have bounds and permissions associated with them. The bounds can be narrowed and the permissions removed via a simple register-to-register operation, but they cannot be increased. This means that you can pass some code a pointer to an object and it can then derive a pointer whose bounds cover a single field then pass that on to another function. The final function can access that field, but not the enclosing object.

CHERIoT uses this core primitive to create a rich compartmentalisation model. A CHERIoT compartment is similar to a shared library, except that its globals and code are private. You can invoke entry points in a compartment as if you were calling any other function. You share data with other compartments simply by passing pointers to the objects that you wish to share. This makes it easy to build components with mutual distrust relationships between them and that respect the principle of least privilege.

CHERIoT also provides auditing tools that let you inspect the communications between compartments. At firmware-link time, the linker produces a report including which entry points every compartment exports, which ones it calls, what pre-shared objects it has access to, which devices it can directly access, and a variety of other things. This lets you reason about the worst case for what damage a complete compromise of a compartment can do.

I wrote about our initial experiences with mlkem-native / mldsa-native from a CHERIoT perspective on the CHERIoT blog, but that post assumes a lot of knowledge of CHERIoT. This post is intended to provide that context.

Building ML-KEM and ML-DSA libraries

CHERIoT has a notion of shared libraries that differs somewhat from most mainstream system. A CHERIoT shared library has no global mutable state. The threat model for a shared library is roughly that it is equivalent to embedding the code in your compartment. This means that you get code-size savings at the expense of not being protected against supply-chain attacks.

Code built as a CHERIoT shared library can be shared between multiple compartments without the (low, but non-zero) cost of a compartment transition. For this sharing to be safe, CHERIoT shared libraries may not have mutable globals. All mutable state that they operate over is owned by the calling compartment and is passed as globals. This means that shared libraries can’t accidentally leak state between compartments.

Building mldsa-native and mlkem-native as a CHERIoT shared library was relatively easy. Both require platform-specific hooks for providing entropy and securely erasing a buffer, which are trivial to provide. The only other thing that we need to do for a CHERIoT shared library is explicitly annotate the functions that are exported from the library. The ML[DK]_CONFIG_EXTERNAL_API_QUALIFIER macro in the two projects is already placed on all public functions and so we just needed to define this to __cheriot_libcall.

With that done, it was possible to build both libraries as CHERIoT shared libraries. This makes it possible to use them from multiple CHERIoT compartments. The shared library for mldsa-native is about 18 KiB of code, mlkem-native around 12 KiB. Both include copies of the FIPS-202 (SHA3) implementation, so there’s some scope for code-size reduction in the future.

Reducing stack usage

Post-quantum crypto is famous for using a lot of state. When we started CHERIoT support, both mlkem-native and mldsa-native allocated all intermediate state on the stack. ML-DSA, in particular, was very stack-intensive. Signing a message with ML-DSA44 used around 60 KiB of stack space.

This doesn’t seem like a lot of space on a desktop system, but embedded stacks are typically much smaller. A CHERIoT thread typically has a 1-2 KiB stack, often smaller. Threads doing TLS handshakes had stacks that were very large by embedded-systems standards: around 6 KiB.

Stack memory is annoying on embedded devices because there’s no demand paging: stack memory must be reserved for the worst case and remains reserved even when not in use. If two threads need to be able to use 60 KiB of stack at different phases of computation, you need to reserve 120 KiB. It’s fairly common for embedded devices to have 128 KiB of total data memory, so that’s pretty much all of it reserved for two stacks. In contrast, heap memory can be dynamically allocated and freed, so two different components doing PQC operations that require 60 KiB of space at different times can reuse the same 60 KiB.

The CHERIoT heap is unusual in embedded systems: it provides deterministic spatial and temporal memory safety. As soon as an object has been freed, any thread that attempts to dereference a pointer to it will trap. For more details on how this works, see our 2023 MICRO paper. This means that it’s safe to use even for sensitive things such as the intermediate state of crypto operations: use-after-free or bounds-safety bugs elsewhere won’t accidentally leak keys or tamper with the results.

There’s one additional feature of the CHERIoT heap that requires some special handling. CHERIoT is a capability system. This extends from the lowest level (pointers are capabilities, the hardware won’t let you do load or store operations unless you present an authorising capability in one of the register operands to the instruction) to the higher-level parts. Privileged operations that another compartment performs on your behalf are authorised by a capability. The shared heap requires you to pass it a capability that authorises allocation.

The software capability mechanism is built on top of CHERI’s sealing mechanism, which lets you build type-safe tamper-proof opaque pointers. The capability that authorises allocation is a sealed pointer to an object that holds a quota and some other metadata. This mechanism lets a compartment allocate memory up to some defined limit and lets firmware authors audit how much memory each component can allocate.

This all works nicely for code that is statically linked into a compartment but the PQC code is in a shared library. As such, for it to perform heap allocation, it must receive a context parameter from the caller and forward it to the allocator. PR 1467 in mlkem-native (and a follow-on port to mldsa-native) added support for this via an optional context parameter that’s plumbed through to the macros for custom allocation and free.

With this merged, the same ML-DSA operations that required 60 KiB of stack now work nicely on 4 KiB stacks, with a negligible impact on performance.

Performance

Part of our motivation for this work was to give a baseline for performance comparison for hardware implementations of ML-KEM and ML-DSA. I was surprised at the performance, since I’d heard that the performance of PQC algorithms was very bad on embedded devices.

For performance measurement, we used Microsoft’s CHERIoT CHERIoT small and fast FPGA emulator (SAFE) test platform. This provides, among other things, a verilator build that’s a software simulation of a CHERIoT core. We used CHERIoT Ibex, the lower-performance area-optimised core. ML-KEM768 key encryption and decryption were both around two million cycles. ML-DSA operations were also on the order of a few million cycles.

The first devices using the CHERIoT Ibex are expected to ship later this year and run at 250 MHz, so that’s still 50 signature-verification operations per second with the software implementations. There are some use cases that require better performance, but for things like TLS session establishment this is in the noise for embedded devices (which typically have 1-2 TLS sessions active at most and rarely perform the handshake frequently than about once per hour).

Hardware implementations are likely to be useful for avoiding power side channels, but for anything where an attacker with physical access being able to extract the key is not part of the threat model, the software implementations are likely to be sufficient.

Future plans

Having the existing code work as a shared library is a first step for metaphorically kicking the tyres but it’s not the end. We plan on incorporating the ML-DSA code into a secure update service for CHERIoT, for example.

More importantly, we plan to wrap the library in a compartment and use the same sealing mechanism that encapsulates allocator capabilities to protect keys. This will let a compartment have an opaque handle to a key, but not directly access it. All the compartment will be able to do is present it to a crypto service compartment to perform an operation, such as signature verification. No matter how buggy the consumer of the verification API is, it never has direct access to its own keys and so can’t accidentally leak them.

The experience of working with the mlkem-native and mldsa-native project has been overwhelmingly positive. I can think of very few open source projects that have been as pleasant to work with.

Contributed By: David Chisnall

Co-founder and Director of Systems Architecture

SCI Semiconductor

David Chisnall’s background spans operating systems, compilers, security, and computer architecture. He is the author of The Definitive Guide to the Xen Hypervisor and was elected for two terms to the FreeBSD Core Team. He joined the CHERI project at the University of Cambridge in 2012 to lead the compilers / languages strand of the research project. He remains a Visiting Researcher in this group. He joined Microsoft in 2018 to, among other things, lead their engagement with the Digital Security by Design Programme, a £170M programme that included the creation of Arm Morello, a CHERI extension to the ARMv8.2 ISA and a test chip based on the Neoverse N1. As part of this, he led the creation of the CHERIoT Platform, an open-source CHERI ISA co-designed with an RTOS, programming model, and set of language extensions. In 2023, he left Microsoft to co-found SCI Semiconductor, which is now co-maintaining the CHERIoT along with other industry partners. SCI taped out the first CHERIoT chip in 2025 and aims for mass production in 2026.