Discussion:
Curve19119: A legacy-level little brother of Curve25519
(too old to reply)
Björn Haase
2017-08-01 21:35:52 UTC
Permalink
Thank you, Mike, for your reply!
In addition to FourQ and Curve19119, other fast-ish options include
2^252-2^232-1 (but maybe not on tiny micros);
NIST’s 2^192-2^64-1 (but again maybe not on the M0?); and the
Goldi-like 2^216-2^108-1. But I agree that you
should stick with Curve25519 if you can afford it.
I've looked a bit closer at FourQ in the last days. I came to the
conclusion that the speedup factor of ~2 reported by Liu, Longa,
Pereira, Reparaz and Seo [LLPRS] in https://eprint.iacr.org/2017/434.pdf
are realistic for the AVR and MSP430. I moreover expect that this might
hold also for the M0 target (not reported by LLPRS).

It must, however, be clear that one buys this with significantly larger
RAM requirements (that is most precious, particularly in the very small
targets) and I believe that it will also be much harder to protect FourQ
against side channel attacks, especially regarding fault injection,
without loosing the speed advantage.
Elligator 2 works for any elliptic curve (over a field of odd characteristic IIRC) with at least one point of
order 2. So you can use it on FourQ.
Thank you for pointing this out. I think that this makes FourQ for these
targets also an interresting choice for PAKE.

I don't believe, however that the claimed huge speedup factor for the M4
from LLPRS actually holds. The Curve25519 figures taken for the
comparision with FourQ for the M4 (~1.4 million cycles cor X25519) are
in my perception much less optimized than the LLPRS implementation for
their "complex" composed field.

I expect that actual speed advantage of FourQ on the M4 might in fact
only be in the range of ~25%. For the M4, the UMAAL instruction gives
you carry accumulation essentially for free and I know that
constant-time field arithmetics on the M4 for the highly regular fe25519
field is actually a factor of ~1.5 faster for Curve25519 than the values
reported for FourQ by LLPRS. I think that this will eat up most of the
advantage from the endomorphisms. Actually this might hold also for
other "larger" targets bigger than the M4, where the regular 255 bit
field could be implemented ay faster than the more complex "complex"
field of FourQ?
In addition to FourQ and Curve19119, other fast-ish options include
NIST’s 2^192-2^64-1 (but again maybe not on the M0?);
The problem in my opinion is to implement the many conditional additions
for the solinas prime in constant time. At least, I did try it and I did
not find a way to implement that efficiently.
and the Goldi-like 2^216-2^108-1 or 2^252-2^232-1 (but maybe not on tiny micros);
Here I expect that the fact that the field is not really much smaller
than for 2^255 - 1 will be the reason that prevents significant speedups
in comparison to Curve25519.
I realized at some point that “SPAKE2-EE” is actually an existing protocol called “PAK”, which is why I didn’t ever write it up.
On a constrained device, a SPEKE variant (like PACE) is a faster and simpler choice. Also I thought the SPEKE patent expired, so why not use that? SPAKE2 (-EE and otherwise) has a stronger security proof, supports augmentation, and is potentially faster with precomputation.
Actually I did mix up EKE and SPEKE regarding the patent that has
expired this spring. We had chosen PACE predominantly because our first
products came on the market way earlier than the expiration date. I
think also that it is a good idea to make both parties contribute
entropy for generating a session-specific ephemeral generator, such as
done in PACE.

When using such an ephemeral generator, one will need to mix in the
entropy from both parties together with the password somehow. I think
that the method used by PACE for doing this is not so bad. We do want
predominantly protect the password. Note that regarding side channels it
might be beneficial to actually use the password only in a symmetric
encryption primitive (such as in PACE). I expect that it will be much
easier to protect a symmetric cypher against side channel leakage (e.g.
by masking) than e.g. a SHA variant function used for mixing the entropies.
I don’t know if there is an augmented version of SPEKE.
Yes it is possible to augment SPEKE. We have ourselves developed an
augmented version of PACE for our bluetooth product line. It's not yet
published but we are working on a paper on that topic.
It is possible to compute Elligator straight to affine with only one exponentiation, using the inverse square root trick.
You want x = -A/(1+ur^2), or -A-that, so you need to simultaneously divide by 1+ur^2 and compute the Legendre symbol of the putative y^2 = x(x^2+Ax+1). If you compute that projectively to get y^2 = a/b, then L(a/b) = L(ab), so you don’t need to do the division first. Let c=ab, d=1+ur^2. Then you can compute
s := (cd^2) ^ ((p-3)/2)
Then (s^2 * cd^2) = (cd^2)^(p-2) = 1/cd^2, so that 1/d = (s^2 * cd^2) * cd unless c=0 or d=0. But if c=0 then the point is (0,0) or infinity, and if d=0 then it’s infinity, so in either case you should return 0 anyway. At the same time, (s * cd^2) = (cd^2)^((p-1)/2) = L(cd^2) = L(c) unless d=0. So that gives you the inverse and Legendre symbol at the same time. A similar trick with (cd^2)^((p-5)/8) can give you the square root, but you don’t actually need that here if you’re using an x-only ladder.
Thank you for pointing this out. I did not yet find time to analyze the
details, but this will indeed give an additional speedup for any SPEKE
and PACE implementation running on an curve for that Elligator2 is suitable.

Thank you again, Mike, for your valuable feedback.

Yours,

Björn.
Mike Hamburg
2017-08-01 22:45:36 UTC
Permalink
In addition to FourQ and Curve19119, other fast-ish options include
NIST’s 2^192-2^64-1 (but again maybe not on the M0?);
The problem in my opinion is to implement the many conditional additions for the solinas prime in constant time. At least, I did try it and I did not find a way to implement that efficiently.
Right, that’s why not on the M0. On the M4 with all the DSP extensions it might work better.
and the Goldi-like 2^216-2^108-1 or 2^252-2^232-1 (but maybe not on tiny micros);
Here I expect that the fact that the field is not really much smaller than for 2^255 - 1 will be the reason that prevents significant speedups in comparison to Curve25519.
Like the P192 field, it depends on the microcontroller.

I’m not very familiar with optimizations for the M0 or M3, so I can’t really comment there. The M0 has no widening multiplier and the M3 has a non-constant-time widening multiplier.

On the M4, you are probably using packed arithmetic with UMAAL, so the 216-bit field doesn’t help much.

On other 32-bit microcontrollers (eg PowerPC, ARC, RV32G or maybe ARM with NEON), the 216-bit field should be much faster than a “ref10”-style Curve25519 implementation: it uses 48 multiply-accumulates per field mul instead of 100, with a smaller speedup on squaring. Considering you also have fewer scalar bits, you might reach a 2:1 speedup overall.

Cheers,
— Mike

Loading...