Discussion:
curve25519-donna stack usage
(too old to reply)
Jason A. Donenfeld
2016-11-09 18:00:20 UTC
Permalink
Hey guys,

I use a curve25519-donna variant inside of WireGuard [1]. It runs in a
kthread in kernel space, which only has 8k of stack in total. Some
circuitous paths in the kernel into code actually amount to having
much less stack available. I could allocate curve25519 variables on
the heap instead, or try to do various other traditional programming
techniques to reduce usage. But before I put too much time into that,
I was wondering if anybody else has ran into this limitation with
-donna and if there are other common portable implementations of
curve25519 that use less stack while remaining performant, or if there
are various other tricks to reduce stack usage.

Regards,
Jason


[1] https://www.wireguard.io/
Mike Hamburg
2016-11-09 20:00:09 UTC
Permalink
Hi Jason,

By my measurements, curve25519-donna-c64 -O3 uses 840 bytes of stack on x86-64. With -O2, it’s 1128 bytes. But the x86 version uses much more stack, so maybe that’s your problem.

I have a tiny implementation of Curve25519. According to -fstack-usage, it uses as few as 372 bytes of stack on x86-64 and 336 bytes on x86, depending on compilation options.

Since it’s optimized for size, it doesn’t perform as well as Donna. The factor is 2-4 on x86-64 depending on compilation options, but only ~25% slower on x32 if I’m measuring correctly. The code is relatively portable, detecting bit size using __SIZEOF_INT128__. My code also has ARM asm intrinsics, so it might outperform Donna on some ARM platforms. I haven’t benched this.

My code also supports nonstandard x-only signature production and verification at the cost of slightly higher stack usage.

This implementation is part of a package that I wrote at work, so I can’t share it with you yet. I’m trying to get it open sourced under an MIT license, but I have to talk to legal about this. So it’s portable but not common. But let me know if you want it, it might help me get it through legal.

Cheers,
— Mike
Post by Jason A. Donenfeld
Hey guys,
I use a curve25519-donna variant inside of WireGuard [1]. It runs in a
kthread in kernel space, which only has 8k of stack in total. Some
circuitous paths in the kernel into code actually amount to having
much less stack available. I could allocate curve25519 variables on
the heap instead, or try to do various other traditional programming
techniques to reduce usage. But before I put too much time into that,
I was wondering if anybody else has ran into this limitation with
-donna and if there are other common portable implementations of
curve25519 that use less stack while remaining performant, or if there
are various other tricks to reduce stack usage.
Regards,
Jason
[1] https://www.wireguard.io/
_______________________________________________
Curves mailing list
https://moderncrypto.org/mailman/listinfo/curves
Jason A. Donenfeld
2016-11-09 21:01:47 UTC
Permalink
Hi Mike,

Indeed the stack usage for 32bit is huge. On MIPS32r2 with -O3, donna
uses 2872 bytes of stack!

I'd love to see your micro implementation! Let me know if you ever
succeed in MITing that. The performance hit might be a bit of a
bummer, though, but on 32bit 25% seems quite acceptable.

Jason
Mike Hamburg
2016-11-09 21:21:03 UTC
Permalink
OK, I’ll let you know.

It might be that you can undo some of my size optimizations, and set compile options, to get some or all of the perf back. For example, I don’t have a dedicated squaring routine because I’m optimizing for code size as well as stack size.

— Mike
Post by Jason A. Donenfeld
Hi Mike,
Indeed the stack usage for 32bit is huge. On MIPS32r2 with -O3, donna
uses 2872 bytes of stack!
I'd love to see your micro implementation! Let me know if you ever
succeed in MITing that. The performance hit might be a bit of a
bummer, though, but on 32bit 25% seems quite acceptable.
Jason
Jason A. Donenfeld
2016-11-10 00:13:36 UTC
Permalink
I just tried out the so called "tweet nacl implementation", because it
has very tiny stack requirements. It was 26 times slower than donna.
Wow!
Mike Hamburg
2016-12-09 18:52:33 UTC
Permalink
OK, I’ve released my tiny x25519 code as open source. This is the platform-agnostic version. The ARM asm version isn’t there, it’s staying proprietary for now :-/. But you can get most of the effect by intrinsic’ing umaal and friends.

https://sourceforge.net/p/strobe/code/ci/master/tree/x25519.c <https://sourceforge.net/p/strobe/code/ci/master/tree/x25519.c>
https://sourceforge.net/p/strobe/code/ci/master/tree/x25519.h <https://sourceforge.net/p/strobe/code/ci/master/tree/x25519.h>

Presumably this code could be accelerated somewhat by using a dedicated sqr() routine, or by unrolling loops and inlining code. Maybe I’ll get around to that at some point, but there’s a bunch more to be done with that repository to make it useful.

These files also have a totally nonstandard signature implementation, the only real advantage of which is that it adds very little code.

Let me know what you think, or if you find any bugs or missing features.

Cheers,
— Mike
Post by Jason A. Donenfeld
I just tried out the so called "tweet nacl implementation", because it
has very tiny stack requirements. It was 26 times slower than donna.
Wow!
Thomas DuBuisson
2016-12-09 19:03:57 UTC
Permalink
OK, I’ve released my tiny x25519 code as open source.
Do you think it would be worth proving equivalence of your code with
another implementation, such as -donna? If so, how similar are the
structures and fundamental operations?

-Thomas
This is the
platform-agnostic version. The ARM asm version isn’t there, it’s staying
proprietary for now :-/. But you can get most of the effect by
intrinsic’ing umaal and friends.
https://sourceforge.net/p/strobe/code/ci/master/tree/x25519.c
https://sourceforge.net/p/strobe/code/ci/master/tree/x25519.h
Presumably this code could be accelerated somewhat by using a dedicated
sqr() routine, or by unrolling loops and inlining code. Maybe I’ll get
around to that at some point, but there’s a bunch more to be done with that
repository to make it useful.
These files also have a totally nonstandard signature implementation, the
only real advantage of which is that it adds very little code.
Let me know what you think, or if you find any bugs or missing features.
Cheers,
— Mike
I just tried out the so called "tweet nacl implementation", because it
has very tiny stack requirements. It was 26 times slower than donna.
Wow!
_______________________________________________
Curves mailing list
https://moderncrypto.org/mailman/listinfo/curves
Mike Hamburg
2016-12-09 19:46:21 UTC
Permalink
Post by Thomas DuBuisson
Post by Mike Hamburg
OK, I’ve released my tiny x25519 code as open source.
Do you think it would be worth proving equivalence of your code with
another implementation, such as -donna? If so, how similar are the
structures and fundamental operations?
-Thomas
Lots of things are worth proving if you’re not the one doing it :-)

Seriously though. Donna uses unsaturated arithmetic, but my code uses saturated arithmetic to save memory, and because on some platforms like ARM that’s more efficient anyway. So I’d have to check mainly against carry-handling bugs.

My code uses a very similar formula for the ladder step to Donna. However, it uses the modified condswap schedule from RFC 7748 (a trivial change), and it recomputes BB = AA-E to save memory.

My code uses a different power chain than Donna for the inversion. With X25519_USE_POWER_CHAIN set, it uses a power chain that’s 2M longer but uses one fewer temporary, again to save memory. Without that flag, it uses a slower algorithm (square and usually multiply) that saves code size.

In sum, it should be possible to verify it using gfverif, but it might be easier to just show equivalence to some master Python or SAGE implementation instead of donna.

— Mike
Post by Thomas DuBuisson
Post by Mike Hamburg
This is the
platform-agnostic version. The ARM asm version isn’t there, it’s staying
proprietary for now :-/. But you can get most of the effect by
intrinsic’ing umaal and friends.
https://sourceforge.net/p/strobe/code/ci/master/tree/x25519.c
https://sourceforge.net/p/strobe/code/ci/master/tree/x25519.h
Presumably this code could be accelerated somewhat by using a dedicated
sqr() routine, or by unrolling loops and inlining code. Maybe I’ll get
around to that at some point, but there’s a bunch more to be done with that
repository to make it useful.
These files also have a totally nonstandard signature implementation, the
only real advantage of which is that it adds very little code.
Let me know what you think, or if you find any bugs or missing features.
Cheers,
— Mike
I just tried out the so called "tweet nacl implementation", because it
has very tiny stack requirements. It was 26 times slower than donna.
Wow!
_______________________________________________
Curves mailing list
https://moderncrypto.org/mailman/listinfo/curves
Jason A. Donenfeld
2016-11-09 20:24:47 UTC
Permalink
Post by Jason A. Donenfeld
much less stack available. I could allocate curve25519 variables on
the heap instead,
This disgusting solution lives here in this branch:
https://git.zx2c4.com/WireGuard/commit/?h=jd/curve25519-kmalloc

Barf.
Loading...