[curves] curve25519-donna stack usage

Discussion:

Jason A. Donenfeld

2016-11-09 18:00:20 UTC

Hey guys,

I use a curve25519-donna variant inside of WireGuard [1]. It runs in a
kthread in kernel space, which only has 8k of stack in total. Some
circuitous paths in the kernel into code actually amount to having
much less stack available. I could allocate curve25519 variables on
the heap instead, or try to do various other traditional programming
techniques to reduce usage. But before I put too much time into that,
I was wondering if anybody else has ran into this limitation with
-donna and if there are other common portable implementations of
curve25519 that use less stack while remaining performant, or if there
are various other tricks to reduce stack usage.

Regards,
Jason

[1] https://www.wireguard.io/

Mike Hamburg

2016-11-09 20:00:09 UTC

Permalink

Hi Jason,

By my measurements, curve25519-donna-c64 -O3 uses 840 bytes of stack on x86-64. With -O2, itâs 1128 bytes. But the x86 version uses much more stack, so maybe thatâs your problem.

I have a tiny implementation of Curve25519. According to -fstack-usage, it uses as few as 372 bytes of stack on x86-64 and 336 bytes on x86, depending on compilation options.

Since itâs optimized for size, it doesnât perform as well as Donna. The factor is 2-4 on x86-64 depending on compilation options, but only ~25% slower on x32 if Iâm measuring correctly. The code is relatively portable, detecting bit size using __SIZEOF_INT128__. My code also has ARM asm intrinsics, so it might outperform Donna on some ARM platforms. I havenât benched this.

My code also supports nonstandard x-only signature production and verification at the cost of slightly higher stack usage.

This implementation is part of a package that I wrote at work, so I canât share it with you yet. Iâm trying to get it open sourced under an MIT license, but I have to talk to legal about this. So itâs portable but not common. But let me know if you want it, it might help me get it through legal.

Cheers,
â Mike

Post by Jason A. Donenfeld
Hey guys,
I use a curve25519-donna variant inside of WireGuard [1]. It runs in a
kthread in kernel space, which only has 8k of stack in total. Some
circuitous paths in the kernel into code actually amount to having
much less stack available. I could allocate curve25519 variables on
the heap instead, or try to do various other traditional programming
techniques to reduce usage. But before I put too much time into that,
I was wondering if anybody else has ran into this limitation with
-donna and if there are other common portable implementations of
curve25519 that use less stack while remaining performant, or if there
are various other tricks to reduce stack usage.
Regards,
Jason
[1] https://www.wireguard.io/
_______________________________________________
Curves mailing list
https://moderncrypto.org/mailman/listinfo/curves

Jason A. Donenfeld

2016-11-09 21:01:47 UTC

Permalink

Hi Mike,

Indeed the stack usage for 32bit is huge. On MIPS32r2 with -O3, donna
uses 2872 bytes of stack!

I'd love to see your micro implementation! Let me know if you ever
succeed in MITing that. The performance hit might be a bit of a
bummer, though, but on 32bit 25% seems quite acceptable.

Jason

Mike Hamburg

2016-11-09 21:21:03 UTC

Permalink

OK, Iâll let you know.

It might be that you can undo some of my size optimizations, and set compile options, to get some or all of the perf back. For example, I donât have a dedicated squaring routine because Iâm optimizing for code size as well as stack size.

â Mike

Post by Jason A. Donenfeld
Hi Mike,
Indeed the stack usage for 32bit is huge. On MIPS32r2 with -O3, donna
uses 2872 bytes of stack!
I'd love to see your micro implementation! Let me know if you ever
succeed in MITing that. The performance hit might be a bit of a
bummer, though, but on 32bit 25% seems quite acceptable.
Jason

Jason A. Donenfeld

2016-11-10 00:13:36 UTC

Permalink

I just tried out the so called "tweet nacl implementation", because it
has very tiny stack requirements. It was 26 times slower than donna.
Wow!

Mike Hamburg

2016-12-09 18:52:33 UTC

Permalink

OK, Iâve released my tiny x25519 code as open source. This is the platform-agnostic version. The ARM asm version isnât there, itâs staying proprietary for now :-/. But you can get most of the effect by intrinsicâing umaal and friends.

https://sourceforge.net/p/strobe/code/ci/master/tree/x25519.c <https://sourceforge.net/p/strobe/code/ci/master/tree/x25519.c>
https://sourceforge.net/p/strobe/code/ci/master/tree/x25519.h <https://sourceforge.net/p/strobe/code/ci/master/tree/x25519.h>

Presumably this code could be accelerated somewhat by using a dedicated sqr() routine, or by unrolling loops and inlining code. Maybe Iâll get around to that at some point, but thereâs a bunch more to be done with that repository to make it useful.

These files also have a totally nonstandard signature implementation, the only real advantage of which is that it adds very little code.

Let me know what you think, or if you find any bugs or missing features.

Cheers,
â Mike

Post by Jason A. Donenfeld
I just tried out the so called "tweet nacl implementation", because it
has very tiny stack requirements. It was 26 times slower than donna.
Wow!

Thomas DuBuisson

2016-12-09 19:03:57 UTC

Permalink

OK, I’ve released my tiny x25519 code as open source.

Do you think it would be worth proving equivalence of your code with
another implementation, such as -donna? If so, how similar are the
structures and fundamental operations?

-Thomas

This is the
platform-agnostic version. The ARM asm version isn’t there, it’s staying
proprietary for now :-/. But you can get most of the effect by
intrinsic’ing umaal and friends.
https://sourceforge.net/p/strobe/code/ci/master/tree/x25519.c
https://sourceforge.net/p/strobe/code/ci/master/tree/x25519.h
Presumably this code could be accelerated somewhat by using a dedicated
sqr() routine, or by unrolling loops and inlining code. Maybe I’ll get
around to that at some point, but there’s a bunch more to be done with that
repository to make it useful.
These files also have a totally nonstandard signature implementation, the
only real advantage of which is that it adds very little code.
Let me know what you think, or if you find any bugs or missing features.
Cheers,
— Mike
I just tried out the so called "tweet nacl implementation", because it
has very tiny stack requirements. It was 26 times slower than donna.
Wow!
_______________________________________________
Curves mailing list
https://moderncrypto.org/mailman/listinfo/curves

Mike Hamburg

2016-12-09 19:46:21 UTC

Permalink

Post by Thomas DuBuisson

Post by Mike Hamburg
OK, Iâve released my tiny x25519 code as open source.

Do you think it would be worth proving equivalence of your code with
another implementation, such as -donna? If so, how similar are the
structures and fundamental operations?
-Thomas

Lots of things are worth proving if youâre not the one doing it :-)

Seriously though. Donna uses unsaturated arithmetic, but my code uses saturated arithmetic to save memory, and because on some platforms like ARM thatâs more efficient anyway. So Iâd have to check mainly against carry-handling bugs.

My code uses a very similar formula for the ladder step to Donna. However, it uses the modified condswap schedule from RFC 7748 (a trivial change), and it recomputes BB = AA-E to save memory.

My code uses a different power chain than Donna for the inversion. With X25519_USE_POWER_CHAIN set, it uses a power chain thatâs 2M longer but uses one fewer temporary, again to save memory. Without that flag, it uses a slower algorithm (square and usually multiply) that saves code size.

In sum, it should be possible to verify it using gfverif, but it might be easier to just show equivalence to some master Python or SAGE implementation instead of donna.

â Mike

Post by Thomas DuBuisson

Post by Mike Hamburg
This is the
platform-agnostic version. The ARM asm version isnât there, itâs staying
proprietary for now :-/. But you can get most of the effect by
intrinsicâing umaal and friends.
https://sourceforge.net/p/strobe/code/ci/master/tree/x25519.c
https://sourceforge.net/p/strobe/code/ci/master/tree/x25519.h
Presumably this code could be accelerated somewhat by using a dedicated
sqr() routine, or by unrolling loops and inlining code. Maybe Iâll get
around to that at some point, but thereâs a bunch more to be done with that
repository to make it useful.
These files also have a totally nonstandard signature implementation, the
only real advantage of which is that it adds very little code.
Let me know what you think, or if you find any bugs or missing features.
Cheers,
â Mike
I just tried out the so called "tweet nacl implementation", because it
has very tiny stack requirements. It was 26 times slower than donna.
Wow!
_______________________________________________
Curves mailing list
https://moderncrypto.org/mailman/listinfo/curves

Jason A. Donenfeld

2016-11-09 20:24:47 UTC

Permalink

Post by Jason A. Donenfeld
much less stack available. I could allocate curve25519 variables on
the heap instead,

This disgusting solution lives here in this branch:
https://git.zx2c4.com/WireGuard/commit/?h=jd/curve25519-kmalloc

Barf.