LFX Mentorship: Contributing to the first release of mldsa-native

Contributed By: Chun-Ming Chiu (Academia Sinica); LFX Fall 2025 Mentee

In September 2025, I began my journey on LFX Mentorship and started contributing to the mldsa-native project. Now, as the 12-weeks mentorship program is nearing its end, I want to record my learning experience and share my reflection in this blog post.

What is LFX Mentorship

The Linux Foundation Mentorship Program, or LFX Mentorship for short, offers up-and-coming developers a welcoming and guided experience to open source contributions. During the 12- or 24-week mentorship program, mentees will work under mentors’ guidance and make meaningful contributions to the specific project they chose back in the application process.

What is mldsa-native

The mldsa-native project is a C implementation of the ML-DSA post-quantum signature standard, with emphasis on security, performance, and portability. It is intended to be imported and integrated into larger cryptographic libraries that also offer other cryptographic schemes.

As one of the newest additions to the Post-Quantum Code Package (PQCP) project, mldsa-native is supported by the Post-Quantum Cryptography Alliance (PQCA) under the Linux Foundation. Similar to the sibling project mlkem-native, mldsa-native is anticipated to be merged into liboqs and aws-lc. In fact, the integration is currently underway, as we have just released the first version on Nov. 14 2025.

What Did I Do

The goal of my 12-week mentorship was to contribute to the codebase by improving its performance, security and maintainability, and ultimately to help push the project to reach its first release. More specifically, the team planned out a list of features/issues to complete/resolve before the first release, and my mission was to implement them and submit pull requests (PRs).

Before our first meeting, my mentor Matthias Kannwischer had carefully picked out some of the issues according to my skill set and sorted them by difficulty. It was extremely helpful for my learning experience, as I could familiarize myself with the codebase with some easier tasks, and gradually move towards the more challenging ones.

The First Step

The first task I worked on is cleaning up the mld_attempt_signature_generation function. More specifically, I refactored the code to use a centralized exit path, and improved readability/minimized confusion by renaming variables to match the FIPS 204 specification better. Just two days into the mentorship, I was able to create my first PR for this project, which thankfully passed CI and code review without problems and got merged on the same day.

Looking back on it now, I think this first task really jump-started my learning process. It encouraged me to start playing with the code and try out the build system and other automation scripts. More importantly, it gave me a chance to check out the specification document and learn more about the signature scheme.

Performance Optimization

After the first PR, I felt confident and was ready to tackle the next few tasks on the list, which are mainly about improving the performance.

Obtaining performance competitive to state-of-the-art implementations was a key objective of the mldsa-native project, and the team aimed to achieve this goal on x86-64/AVX2 and AArch64/NEON before the first release. In the literature of optimized cryptographic implementations, the use of vector instructions to exploit data-level parallelism is a well-established practice. As such, to even have a chance to compete with the state of the art, it was clear that the performance-sensitive parts of the cryptographic scheme must be implemented with native code (assembly or intrinsics).

On the other hand though, we still want mldsa-native to be portable to other ISAs/platforms. To get the best of both worlds, the team chose to follow the same approach in mlkem-native and adopted a split frontend/backend design:

The frontend is written in portable C code. It handles all the higher-level logic of the cryptographic scheme. During its computation, the frontend will invoke the backend for the lower-level primitives such as polynomial arithmetic and the FIPS202/SHA3 hash functions.
The backends are exposed to the frontend through an unified interface. When a routine is invoked, the computation will be delegated to one of the offered implementations, which will either be an optimized native implementation targeting that specific platform, or a default portable C implementation if a specialized version is unavailable.

This extra level of indirection allows us to optimize for performance on common platforms while still guaranteeing portability to all others.

When I started working on this, the infrastructure was already in place, and native implementations for FIPS202 were already completed. As for the polynomial arithmetic side, although the most critical ones like NTT and rejection sampling were also done, many other simpler but nonetheless time-consuming routines were still missing their native implementations. With Matthias’s support and our weekly discussions, I was able to complete 3 PRs (#483, #501, #510) which added native backends for poly_use_hint, poly_chknorm, polyz_unpack respectively. I also assisted in the PRs (#411, #468) for the poly_decompose and base-case multiplications (poly_pointwise_montgomery/polyvecl_pointwise_acc_montgomery). The AVX2 implementations are ported from Dilithium team’s reference implementation with some minor adjustments/improvements, whereas the NEON implementations are basically written from scratch: Although in some cases the AVX2 code can be translated into NEON (more or less) instruction-by-instruction, usually the architectural differences, for example the number/size of vector registers or the available types of multiplication instructions, mean that novel approaches are required to implement the same algorithm more efficiently.

With all these optimizations combined, along with other improvements from various contributors, we were able to close the gap compared to the state-of-the-art implementations, even outperforming them on some benchmark environments.

Table 1: Comparison of performance between the reference AVX2 implementation and ours, before and after the optimization efforts. The benchmark was done on Intel Sapphire Rapids Xeon Platinum 8488C (AWS EC2, c7i.metal-24xl), with SMT and TurboBoost disabled, compiled with gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 and flags -march=native -mtune=native -O3. Please see (the history of) this issue for more information on the benchmarking environment, and instructions on how to perform the benchmark yourself.

Algorithm	Operation	Reference	mldsa-native (Before)	vs Reference	mldsa-native (After)	vs Reference	vs Before
ML-DSA-44	Keypair	74205	102580	0.723x	56936	1.303x	1.802x
ML-DSA-44	Sign	214186	366215	0.585x	201768	1.062x	1.815x
ML-DSA-44	Verify	78043	99831	0.782x	62133	1.256x	1.607x
ML-DSA-65	Keypair	129274	224941	0.575x	102210	1.265x	2.201x
ML-DSA-65	Sign	348492	610618	0.571x	326290	1.068x	1.871x
ML-DSA-65	Verify	125750	162640	0.773x	101816	1.235x	1.597x
ML-DSA-87	Keypair	204270	252407	0.809x	152578	1.339x	1.654x
ML-DSA-87	Sign	426632	728542	0.586x	366239	1.165x	1.989x
ML-DSA-87	Verify	200015	240790	0.831x	151741	1.318x	1.587x

Bound Reasoning Comments

Obtaining competitive performance was a huge step towards the goal of reaching the first release. Nonetheless, there was still plenty of work left to do. After discussing with Matthias, I decided to take on the challenge of adding bound reasoning comments to the AVX2 and NEON backends.

The necessity of bounding the magnitude of intermediate values during computation, sometimes referred to as “range analysis”, is a critical challenge in implementing algorithms that rely on modular arithmetic. Although in some sense it’s similar to error analysis in floating-/fixed-point arithmetic, the stakes are significantly higher here, especially in the context of cryptography. First and foremost, the output of the algorithm, and by extension, the output of the whole cryptographic scheme, would be incorrect if (unexpected) integer overflows occur during the computation. Moreover, if an attacker can control the input, in some cases it’s possible for him to recover the secret key by observing whether such errors occurred or not.

To ensure that this will never happen, and to convince developers downstream (or any curious readers) that our guarantee holds, it’s desirable to supplement the code with pen-and-paper bounds for each of the intermediate values, along with their proofs/justifications. However, one can imagine that this is no trivial task and, more often than not, the publicly available implementations won’t have such documentation. Nevertheless, to improve credibility and also maintainability of the codebase, the team decided that, for all non-trivial arithmetic backends, the bounds and their explanation must be documented (in the form of C/assembly comments within the source files), as part of the first release.

When I started working on the comments, there were already CBMC contracts for each of the routines, consisting of preconditions and postconditions annotated at the routine’s declaration. These were for the portable C implementation, and CBMC can automatically check that, as long as the input satisfies the pre-conditions, the C code always returns an output that satisfies the post-conditions. We wanted to use the same pre-conditions/post-conditions also for the native implementations, and we plan to use HOL-Light to formally prove that the native code indeed complies with the contract. (We are actually working on that right now. See this PR by Jake Massimo for example.) Before that though, my job was to manually annotate bounds for the intermediate values within the routine, and make sure that the final output is indeed within the bound given in the post-conditions.

One may think that this is only a temporary measure and isn’t really necessary if we plan to do formal verification anyway. However, our perspective is that, while formal verification undoubtedly gives us the strongest guarantee, it doesn’t replace documentation that actually records the intuition behind the implementation details. (Why did we choose to perform modular reduction here? How did we know that it never overflows before that?) By writing these design ideas down, we expect the codebase to serve as a useful reference for future research and educational purposes.

During the process, I actually found out that the contract for the inverse NTT doesn’t work for the NEON implementation. Specifically, the original exclusive bound ⌈3q/4⌉ = 6285313 on the output coefficients’ magnitude, which the AVX2 implementation is able to meet just fine, is actually too strict for the NEON one: In the worst case, one of the output coefficient actually reaches 6390331 ≈ 0.76q. Fortunately, this is still small enough for the rest of the computation to work normally.

Now, we could have just relax the exclusive bound from ⌈3q/4⌉ = 6285313 to 6390332 = 6390331 + 1 and be done with it. However, we disliked how implementation-specific this bound is, and worried that we would possibly need to relax it yet again when we eventually want to add new implementations to support other ISAs. After extensive discussions, we decided to do it once and for all and relax it all the way to q. This allows future implementations to use different modular multiplication techniques and NTT tricks. For the existing implementation, the bound q is also cleaner to state and much easier to prove.

We do have to be careful about this though: As detected by CBMC, if we relax the bound this much, the output of the mld_polyveck_caddq during key generation would no longer be guaranteed to be (unsigned-)canonical (because of this addition in between). To rectify this, we had to retroactively add an extra reduction. Thankfully, benchmark results showed that the overhead incurred by the extra reduction is insignificant (less than 1% on most platforms), so the benefits are definitely worth the cost. Please check out this PR by Matthias for more details and comparisons.

I understand that this example is perhaps too technical, but I hope it can still give you a rough idea on the importance and challenges of range analysis, and illustrate the delicate balance between performance, security and maintainability that we are striving to achieve.

In the end, adding the bound reasoning comments was a lot more time-consuming than I expected originally. I started working on this right after #468 got merged on Oct. 23, and finished my 3 related PRs (#560, #629, #667) on Nov. 12. Nevertheless, I am very thankful to be entrusted with this important mission, and I am glad that my comments can help people understand the beautiful algorithms better.

Conclusion

The mentorship program was an unforgettable experience for me. It is a privilege to meet and work with the brilliant members in the team, and getting to contribute to an impactful open source project was a dream come true.

I want to thank my mentor Matthias Kannwischer again for all the guidance and support throughout the whole mentorship. I also want to thank Hanno Becker for all the detailed code reviews and suggestions, and especially the timely assistance during my bound reasoning endeavor.