Discussion:
new 25519 measurements of formally verified implementations
(too old to reply)
Jason A. Donenfeld
2018-01-17 19:44:01 UTC
Permalink
Hello,

For use in WireGuard [0], I've been evaluating hacl-star [1] and
fiat-crypto [2], which both have formally verified implementations of
25519. The former has a 64-bit implementation that has made its way
into NSS and the latter has a 32-bit implementation that has made its
way into BoringSSL.

In trying to benchmark these, I found ordinary techniques much too
noisy, so I've devised a way of loading this code into the kernel and
disabling all interrupts. This results in very little variation
between trials. If you're running this on your workstation, you'll
encounter a complete lockup of all interactivity while the tests are
in progress. You can find the tool here:
https://git.zx2c4.com/kbench9000/about/

Loading in a few implementations and running them on my big laptop and
on my mini laptop yielded these results:

Xeon E3-1505M v5 [3] -- 14nm Skylake Laptop
donna64: 121794 cycles per call
hacl64: 109783 cycles per call
sandy2x: 103071 cycles per call
amd64: 108580 cycles per call
fiat32: 232710 cycles per call
donna32: 411588 cycles per call

i5-5200U [4] -- 14 nm Broadwell Laptop
donna64: 137200 cycles per call
hacl64: 127001 cycles per call
sandy2x: 126201 cycles per call
amd64: 123694 cycles per call
fiat32: 253234 cycles per call
donna32: 438816 cycles per call

So it seems that the formally verified implementations are now quite
competitive. The speed-up with fiat also comes with the advantage of
using significantly less stack space, which is a big help (to me at
least) on smaller systems.

Regards,
Jason


[0] https://www.wireguard.com
[1] https://github.com/mitls/hacl-star
[2] https://github.com/mit-plv/fiat-crypto
[3] https://ark.intel.com/products/89608/Intel-Xeon-Processor-E3-1505M-v5-8M-Cache-2_80-GHz
[4] https://ark.intel.com/products/85212/Intel-Core-i5-5200U-Processor-3M-Cache-up-to-2_70-GHz
D. J. Bernstein
2018-01-26 12:06:00 UTC
Permalink
Tung Chou's sandy2x code was (as the name suggests) optimized for Sandy
Bridge. For Haswell and Skylake, the slides from Julio Lopez in

https://hyperelliptic.org/tanja/lc17/ascrypto.html

report two followup implementations producing roughly 25% speedups for
Curve25519; see slide 67/83.

I do think that the hacl64 Curve25519 speeds are fast enough for pretty
much everybody, and verification is certainly a huge plus, but people
who want more speed should be aware of what's possible---and people
working on Curve25519 verification shouldn't think they're done yet!

---Dan
Jason A. Donenfeld
2018-01-28 00:42:42 UTC
Permalink
Hey Dan,

Thanks for the pointer and the link to the slides. I've heard about
this implementation before, but I was never able to get a hold of the
source to try it out. I just emailed him to see if it's available
somewhere. Looks like there's a conference paper from Latincrypt 2015
that describes its implementation.

Jason
Jason A. Donenfeld
2018-01-31 15:37:12 UTC
Permalink
I've loaded in fiat64 into the latest kbench curve testing branch, and
it seems to be the fastest generic C version, at least on my Skylake
laptop, inching out slightly in front of hacl64:

donna64: 121790 cycles per call
hacl64: 109782 cycles per call
fiat64: 108984 cycles per call
sandy2x: 102996 cycles per call
amd64: 108563 cycles per call
fiat32: 232826 cycles per call
donna32: 412092 cycles per call
Armando Faz Hernández
2018-01-31 20:36:24 UTC
Permalink
Message: 1
Date: Wed, 31 Jan 2018 16:37:12 +0100
I've loaded in fiat64 into the latest kbench curve testing branch, and
it seems to be the fastest generic C version, at least on my Skylake
Hi Jason,

It would be interesting to see your benchmarks
using code in [1] derived from our SAC2017 paper [2].

[1] https://github.com/armfazh/rfc7748_precomputed
[2] How to (pre-)compute a ladder, SAC.
https://doi.org/10.1007/978-3-319-72565-9_9







--
Armando Faz Hernández, PhD Candidate.
Instituto de Computação, Unicamp.
Campinas, Brasil.
Jason A. Donenfeld
2018-02-01 11:36:35 UTC
Permalink
Hi Armando,

Sure, I'll have a look at this.

I've also found https://github.com/armfazh/hp-ecc-vec . Is this the
code related to your 2015 paper entitled, "Fast Implementation of
Curve25519 Using AVX2"? Or the presentation Dan mentioned a few posts
up? Or both at once?

Also, would you consider relicensing these as GPLv2 so that they can
be used in the Linux kernel? (Alternatively, BSD/MIT/Public Domain for
everyone else?)

Jason
Jason A. Donenfeld
2018-02-01 13:19:53 UTC
Permalink
Hi Armando,

I've started importing your precomputation implementation into kernel
space for use in kbench9000 (and in WireGuard and the kernel crypto
library too, of course).

- The first problem remains the license. The kernel requires
GPLv2-compatible code. GPLv3 isn't compatible with GPLv2. This isn't
up to me at all, unfortunately, so this stuff will have to be licensed
differently in order to be useful.

- It looks like the precomputation implementation is failing some unit
tests! Perhaps it's not properly reducing incoming public points?

{
.private = { 1 },
.public = { 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
},
.result = { 0xb3, 0x2d, 0x13, 0x62, 0xc2, 0x48, 0xd6, 0x2f, 0xe6,
0x26, 0x19, 0xcf, 0xf0, 0x4d, 0xd4, 0x3d, 0xb7, 0x3f, 0xfc, 0x1b,
0x63, 0x8, 0xed, 0xe3, 0xb, 0x78, 0xd8, 0x73, 0x80, 0xf1, 0xe8, 0x34 }
}

[ 8855.567043] Expected: b3 2d 13 62 c2 48 d6 2f e6 26 19 cf f0 4d d4
3d .-.b.H./.&...M.=
[ 8855.567044] Expected: b7 3f fc 1b 63 08 ed e3 0b 78 d8 73 80 f1 e8
34 .?..c....x.s...4
[ 8855.567046] Actual: eb 1b 2b df 13 6a 3e bc 30 9f a4 f7 a1 95 a7 08
..+..j>.0.......
[ 8855.567047] Actual: 11 7f 7c e4 6e 65 a4 44 48 22 4d 00 78 54 70 5b
..|.ne.DH"M.xTp[
[ 8855.567048] kbench9000: precomp self-test 4: FAIL

There's the vector if you'd like to play with it. The other test
vectors I have do pass, though, which is good I suppose.

On the plus side, the implementation is super fast:

With turbo on, on my E3-1505Mv5, I'm getting:

donna64: 121793 cycles per call
hacl64: 109793 cycles per call
fiat64: 108937 cycles per call
sandy2x: 103003 cycles per call
amd64: 108688 cycles per call
precomp: 83391 cycles per call
fiat32: 232835 cycles per call
donna32: 411511 cycles per call

The benchmark of your precomputation implementation has what's
referred to by medical doctors as "less digits".

Regards,
Jason
Armando Faz Hernández
2018-02-23 20:08:46 UTC
Permalink
Post by Jason A. Donenfeld
Hi Armando,
I've started importing your precomputation implementation into kernel
space for use in kbench9000 (and in WireGuard and the kernel crypto
library too, of course).
- The first problem remains the license. The kernel requires
GPLv2-compatible code. GPLv3 isn't compatible with GPLv2. This isn't
up to me at all, unfortunately, so this stuff will have to be licensed
differently in order to be useful.
The rfc7748_precomputed library is now released under LGPLv2.1.
We are happy to see our code integrated in more projects.
Post by Jason A. Donenfeld
- It looks like the precomputation implementation is failing some unit
tests! Perhaps it's not properly reducing incoming public points?
There's the vector if you'd like to play with it. The other test
vectors I have do pass, though, which is good I suppose.
Thanks, for this observation. The code was missing to handle some carry bits,
producing incorrect outputs for numbers between 2p and 2^256. Now, I have
rewritten some operations for GF(2^255-19) considering all of these cases.
More tests were added and fuzz test against HACL implementation.

Code is available at:
https://github.com/armfazh/rfc7748_precomputed (commit c79ca5e...)

*Disclaimer: More test and work is needed for the GF(2^448-2^224-1)
arithmetic.
Post by Jason A. Donenfeld
donna64: 121793 cycles per call
hacl64: 109793 cycles per call
fiat64: 108937 cycles per call
sandy2x: 103003 cycles per call
amd64: 108688 cycles per call
precomp: 83391 cycles per call
fiat32: 232835 cycles per call
donna32: 411511 cycles per call
The benchmark of your precomputation implementation has what's
referred to by medical doctors as "less digits".
Due to the bug's corrections, a slight loss of performance was observed;
however, other operations were optimized too counteracting the losses.
Let us know about your new measurements.



--
Armando Faz Hernández, PhD Candidate.
Instituto de Computação, Unicamp.
Campinas, Brasil.
Jason A. Donenfeld
2018-02-23 21:42:10 UTC
Permalink
Hey Armando,

Thanks for taking the time to fix things up.

I've loaded this into my kbench9000 software
<https://git.zx2c4.com/kbench9000/about/>, in the branch
"jd/curve-comparison", and tested it on two Skylake systems -- a
laptop and a server. With turbo disabled, results are fairly similar
between the two:

Intel(R) Xeon(R) CPU E3-1505M v5 @ 2.80GHz
donna64: 160942 cycles per call
hacl64: 140902 cycles per call
fiat64: 144106 cycles per call
sandy2x: 136074 cycles per call
precomp_bmi2: 121350 cycles per call
precomp_adx: 117676 cycles per call
amd64: 143628 cycles per call
fiat32: 307971 cycles per call
donna32: 544254 cycles per call

Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz
donna64: 162308 cycles per call
hacl64: 141948 cycles per call
fiat64: 146188 cycles per call
sandy2x: 135502 cycles per call
precomp_bmi2: 121061 cycles per call
precomp_adx: 117636 cycles per call
amd64: 146382 cycles per call
fiat32: 307777 cycles per call
donna32: 548081 cycles per call

Your implementations are the two precomp_ ones. I split it into
separate paths for adx and for bmi2, so that we could test it together
on one system.

Regards,
Jason

Loading...