Discussion:
[curves] FourQ
Trevor Perrin
2015-09-12 22:33:04 UTC
Permalink
There's an updated paper and new code for MSR's FourQ curve:

http://eprint.iacr.org/2015/565

http://research.microsoft.com/en-us/projects/fourqlib/

I tossed the numbers into the spreadsheet at [1], but the paper has a
better performance analysis across several platforms.

What do people think?

Without using the endomorphisms the performance is better than 25519,
and then endomorphisms are close to a 2x speedup. And if unencumbered
use of the endomorphisms is just ~4 years away [2], that's not that
long, in the scheme of things.


Trevor

[1] https://docs.google.com/spreadsheets/d/1SO3NGX-EgIZ1slw9uExb5FoeFy5TVkuA2lEutP6roYI/edit#gid=0

[2] https://moderncrypto.org/mail-archive/curves/2014/000133.html
Michael Hamburg
2015-09-13 10:33:44 UTC
Permalink
Nice. It’s what, 20% faster than before?

My impression had been that GF((2^127-1)^2) has somewhat slower field multiplication than the fastest GF(p) at the same level (eg Montgomery-friendly primes), and makes up for it in faster inversion. But crunching the numbers, it looks like MSR’s latest code has about as fast field arithmetic as any GF(p), at least on the processors they measured, and of course still much faster inversion. (I have no idea how the comparison would turn out on ARM or Broadwell, having not studied that field very carefully). It looks like they’re using a btr+adc lazy reduction and it turns out to be very efficient.

Cheers,
— Mike
Post by Trevor Perrin
http://eprint.iacr.org/2015/565
http://research.microsoft.com/en-us/projects/fourqlib/
I tossed the numbers into the spreadsheet at [1], but the paper has a
better performance analysis across several platforms.
What do people think?
Without using the endomorphisms the performance is better than 25519,
and then endomorphisms are close to a 2x speedup. And if unencumbered
use of the endomorphisms is just ~4 years away [2], that's not that
long, in the scheme of things.
Trevor
[1] https://docs.google.com/spreadsheets/d/1SO3NGX-EgIZ1slw9uExb5FoeFy5TVkuA2lEutP6roYI/edit#gid=0
[2] https://moderncrypto.org/mail-archive/curves/2014/000133.html
_______________________________________________
Curves mailing list
https://moderncrypto.org/mailman/listinfo/curves
D. J. Bernstein
2015-09-15 15:49:28 UTC
Permalink
Post by Trevor Perrin
What do people think?
The critical statement is "59,000" Haswell cycles for FourQ, compared to
60556 Haswell cycles (reported by eBATS) for Kummer.

What's amusing about this is that Haswell is the only platform where we
didn't bother writing an asm implementation for Kummer---this is a very
simple C implementation with intrinsics. Anyone want to bet on what the
results of an asm implementation will be?
Post by Trevor Perrin
Without using the endomorphisms the performance is better than 25519
Somewhat faster than 25519, but much slower _and_ less conservative than
Kummer. If the endomorphisms aren't used then the rankings are clearly

fast: Kummer, then FourQ, then 25519
conservative: 25519, then Kummer, then FourQ

so FourQ isn't Pareto-optimal. Being able to use the endomorphisms to
save time is the only thing that makes FourQ potentially interesting,
but it's also exactly the part covered by the GLV patents.

---Dan
Michael Hamburg
2015-09-15 17:13:33 UTC
Permalink
Post by D. J. Bernstein
Post by Trevor Perrin
What do people think?
The critical statement is "59,000" Haswell cycles for FourQ, compared to
60556 Haswell cycles (reported by eBATS) for Kummer.
What's amusing about this is that Haswell is the only platform where we
didn't bother writing an asm implementation for Kummer---this is a very
simple C implementation with intrinsics. Anyone want to bet on what the
results of an asm implementation will be?
Post by Trevor Perrin
Without using the endomorphisms the performance is better than 25519
Somewhat faster than 25519, but much slower _and_ less conservative than
Kummer. If the endomorphisms aren't used then the rankings are clearly
fast: Kummer, then FourQ, then 25519
conservative: 25519, then Kummer, then FourQ
so FourQ isn't Pareto-optimal. Being able to use the endomorphisms to
save time is the only thing that makes FourQ potentially interesting,
but it's also exactly the part covered by the GLV patents.
—Dan
I agree that FourQ is not a conservative option.

But while patents chill the use of this work for the next few years,
FourQ does have the advantage over Kummer that it can be used for
signatures and other non-ECDH systems.

Also, it is an interesting curve whether or not it’s actually the absolute
fastest.

— Mike
Salz, Rich
2015-09-15 17:20:54 UTC
Permalink
Post by Michael Hamburg
Also, it is an interesting curve whether or not it’s actually the absolute
fastest.
And it has a cool name.
D. J. Bernstein
2015-09-15 22:01:25 UTC
Permalink
Post by Michael Hamburg
FourQ does have the advantage over Kummer that it can be used for
signatures and other non-ECDH systems.
That's an obsolete view of Kummer. What "hyperand" showed is how to
build groups that can be viewed simultaneously as small-coefficient
Kummer surfaces, for fast ECDH, and as Edwards curves, for fast
signatures etc. For example, near the end of

http://cr.yp.to/talks/2015.07.09/slides-djb-20150709-a4.pdf

you can find an elliptic curve over F_{p^2}, where of course p is
2^127-1, having

* group order 32*prime (higher security than FourQ's 392*prime),
* twist order 12*prime (much safer than FourQ), and
* full support for the Kummer ladder with 5-digit coefficients.

Another example has group order 720*prime, twist order 260*prime, and
amazingly small coefficients---even better for computation than the
Gaudry--Schost surface.

I don't think FourQ is doing anything for signatures that won't work for
these curves---it's the same field, the same curve shape, etc. So the
only interesting question is DH, which is why I commented before on DH.

It's unfortunate that the FourQ paper doesn't acknowledge what the
previous literature says about this. The principle here can't be as
simple as "we don't care about speeds until implementations have been
published": the authors also fail to compare to, e.g., the speeds from
Andrew Moon and the more recent speeds from Tung Chou on most of the
platforms they've selected, even though all of that code was publicly
available before the first version of this paper appeared.

I suppose that seeing this sort of stunt provides extra incentive for
designers and implementors to submit to eBATS, and for me to hurry up
and get eBATS updates out the door faster. I've been working on a new
system that will get benchmarks done much more quickly (with the same
API for implementations), but I realize that this shouldn't take time
away from maintaining the existing system.

---Dan
Trevor Perrin
2015-09-16 00:13:21 UTC
Permalink
Post by D. J. Bernstein
It's unfortunate that the FourQ paper doesn't acknowledge what the
previous literature says about this. The principle here can't be as
simple as "we don't care about speeds until implementations have been
published": the authors also fail to compare to, e.g., the speeds from
Andrew Moon and the more recent speeds from Tung Chou on most of the
platforms they've selected, even though all of that code was publicly
available before the first version of this paper appeared.
I don't understand your criticism:

They cite Tung Chou's (not-yet-conference-published!) "sandy2x" work
for Sandy Bridge and Ivy Bridge, which according to [1] are the
platforms it was "tailored for". Those are also the platforms Tung
Chou presented numbers for in [2].

Tung Chou's 25519 code brings performance closer to FourQ (without
endomorphisms) on those platforms, but 25519 is still slower, and much
slower with endomorphisms. Maybe having numbers for Haswell etc would
flesh out the picture, but I wouldn't expect it to change the overall
conclusion that FourQ is faster.

So it seems like they did a good job comparing with recent state of the art.

My only quibble with the FourQ performance analysis is that adding the
var-base and fixed-base numbers into an "ephem DH" and claiming this
represents the "context of a real-world protocol" isn't that helpful.

Different protocols will have different ratios of variable-base to
fixed-base (keygen) operations. MSR likes to argue against the
Montgomery ladder, since it doesn't lend itself to fixed-base
optimization, but I think that's a separate debate that doesn't add
much here.

Trevor

[1] https://sites.google.com/a/crypto.tw/blueprint/
[2] http://csrc.nist.gov/groups/ST/ecc-workshop-2015/presentations/session6-chou-tung.pdf
D. J. Bernstein
2015-09-16 10:21:19 UTC
Permalink
Post by Trevor Perrin
I don't understand your criticism
People who are being encouraged to venture beyond conservative ECC
should be given an accurate picture of what the speedup actually is.

Certainly there _is_ a speedup. This isn't news; see, e.g., the Kummer
paper and the literature cited there. The problem is that the FourQ
paper quantitatively _exaggerates_ the FourQ speedup. Consider, for
example, the following statement from the paper:

When considering the results for the DH key exchange, FourQ performs
1.8--3.5 times faster than Curve25519.

The ratios here come from Table 5, dividing the "ephem. DH" numbers
(what they mean is one-time DH: fixed-base time + variable-base time)
between

* the "FourQ" column and
* the best number in the "Curve25519" columns.

For example, the "3.5" claim comes from the Haswell row of the table,
where the FourQ paper reports

* "92" thousand Haswell cycles for FourQ, from 59000 variable-base
plus 33000 fixed-base, and

* "324" thousand Haswell cycles for Curve25519, from 162000
variable-base plus 162000 fixed-base.

It's true that 324/92 is 3.5. What's wildly inaccurate is the claim that
Curve25519 requires 162000 Haswell cycles for fixed-base scalar
multiplication. This claim was already massively disproven by Andrew
Moon's published software years ago, which the authors fail to cite or
compare to even though I specifically told them months ago that they
were obliged to compare to it. They do cite Tung Chou's newer software,
which was also publicly available months before the first version of
this paper and also massively disproves the same claim, but they fail to
benchmark that software on Haswell.

Most applications don't care about fixed-base speed. That's why, for
many years, nobody bothered writing fast fixed-base Curve25519 code,
even though everybody knew that huge speedups were possible. Eventually
people _did_ write fast fixed-base Curve25519 code, and any application
that cares about fixed-base performance will simply use that code. The
FourQ paper's pretense that these speedups don't exist---in particular,
its failure to compare to this code---really isn't defensible.

The "Pineview" and "Kaveri" figures are corrupted in the same way, and
on top of that are missing an important disclaimer that the previous
literature didn't bother optimizing for these platforms. To some extent
this is also the Haswell situation---for example, the 162000 cycles are
from the Westmere-optimized code in the Ed25519 paper---although Kummer
did target Haswell, and the gap between Ivy Bridge and Haswell isn't as
large as the gap to other CPUs.

The reason that this is important is that the literature clearly shows
that proper ECC optimization is CPU-specific. Anybody can take some
obscure CPU M that hasn't been seriously targeted before and produce
speedups on M. This is helpful for people who want the best performance
on M, it can be useful information regarding M optimization techniques,
and it can predict that some primitives will have an advantage on M---
but reporting "speedup from non-M-optimized implementation of X to
M-optimized implementation of Y" as if it were "speedup from X to Y" is
unreasonable.

Exaggerations are, unfortunately, pervasive in the performance picture
painted in the FourQ paper.

---Dan
Trevor Perrin
2015-09-16 23:33:55 UTC
Permalink
Post by D. J. Bernstein
Certainly there _is_ a speedup. This isn't news; see, e.g., the Kummer
paper and the literature cited there. The problem is that the FourQ
paper quantitatively _exaggerates_ the FourQ speedup. Consider, for
When considering the results for the DH key exchange, FourQ performs
1.8--3.5 times faster than Curve25519.
The ratios here come from Table 5, dividing the "ephem. DH" numbers
(what they mean is one-time DH: fixed-base time + variable-base time)
between
Agreed 3.5x is a little unfair, as they assume 1:1
fixed-base:variable-base operations is the typical ratio, but then
compare a 25519 implementation that doesn't have a fixed-base
optimization against a FourQ implementation that does.

Their broader claim is:

"it is [...] between two and three times faster than Curve25519."
http://research.microsoft.com/en-us/projects/fourqlib/

"it is between two and three times faster than curves that are
currently under consideration as NIST alternatives, such as
Curve25519."
http://eprint.iacr.org/2015/565.pdf

Comparing variable-base, and FourQ with endomorphisms, their numbers
are 2.5-2.75 faster than the CHES2011 implementation, and 2.1-2.2x
faster than Tung Chou's on Sandy Bridge and Ivy Bridge.

Considering all this, it looks roughly like:
- FourQ is a little faster (~10%) than 25519 without endomorphisms
- endomorphisms give close to 2x speedup
- so overall a little over 2x for variable-base (but only a little
faster for fixed-base)?

Seem about right?


Trevor
Watson Ladd
2015-09-18 12:10:47 UTC
Permalink
Post by Trevor Perrin
http://eprint.iacr.org/2015/565
http://research.microsoft.com/en-us/projects/fourqlib/
I tossed the numbers into the spreadsheet at [1], but the paper has a
better performance analysis across several platforms.
What do people think?
Without using the endomorphisms the performance is better than 25519,
and then endomorphisms are close to a 2x speedup. And if unencumbered
use of the endomorphisms is just ~4 years away [2], that's not that
long, in the scheme of things.
The FourQ paper insists that rejecting invalid points is a viable
implementation strategy that provides compatibility with existing
software. Recently teams have independently rediscovered (or perhaps
just republicized) vulnerabilities in Bouncycastle version 1.50 that
stemmed from not validating points.

It may be true that their software properly handles all inputs, and
carefully documents what callers must do to get the claimed security.
But in practice we know that reimplementation frequently happens, and
that these reimplementations frequently contain issues around point
validation. When callers are asked to apply nontrivial amounts of
care, they often fail.

This is an issue for Kummer surfaces also, but there we do not know
how to attack invalid points.
Post by Trevor Perrin
Trevor
[1] https://docs.google.com/spreadsheets/d/1SO3NGX-EgIZ1slw9uExb5FoeFy5TVkuA2lEutP6roYI/edit#gid=0
[2] https://moderncrypto.org/mail-archive/curves/2014/000133.html
_______________________________________________
Curves mailing list
https://moderncrypto.org/mailman/listinfo/curves
--
"Man is born free, but everywhere he is in chains".
--Rousseau.
Trevor Perrin
2015-09-18 23:42:13 UTC
Permalink
[...]
Post by Watson Ladd
Post by Trevor Perrin
What do people think?
[...]
Post by Watson Ladd
The FourQ paper insists that rejecting invalid points is a viable
implementation strategy that provides compatibility with existing
software. Recently teams have independently rediscovered (or perhaps
just republicized) vulnerabilities in Bouncycastle version 1.50 that
stemmed from not validating points.
It may be true that their software properly handles all inputs, and
carefully documents what callers must do to get the claimed security.
But in practice we know that reimplementation frequently happens, and
that these reimplementations frequently contain issues around point
validation.
The paper notes that single-coordinate ladders aren't efficient on
FourQ, so "twist-security" is irrelevant.

With non-ladder implementations you'll have to validate FourQ points
if you're not decompressing, but that's generally true - including for
25519 and 448!

FourQ's decompression is particularly efficient. So one could argue
FourQ implementations are likely to always use compressed points, and
thus are *less* likely to expose themselves to invalid-curve attacks
than curves where compression is more costly.


Trevor
D. J. Bernstein
2015-09-18 13:15:58 UTC
Permalink
Post by Trevor Perrin
- FourQ is a little faster (~10%) than 25519 without endomorphisms
Maybe, but for such small differences one has to look very carefully at
what exactly is being measured (e.g., is point validation included? what
exactly are the assumptions on the input and output?) and of course also
the quantitative security level (2^122.5 vs. 2^125.8---one expects this
to have a close-to-cubic effect).

What's clear from the literature is that about 10% of the time in a
typical conservative scalarmult is spent on inversion, and of course
this part becomes much faster for a quadratic extension field. On the
other hand, a prime field provides more flexibility in adapting to the
multipliers available on the platform. Surely Curve25519 should use 5
limbs with Intel's new 52-bit multiplication instructions, for example,
while GF((2^127-1)^2) will be stuck using 6 limbs.
Post by Trevor Perrin
- endomorphisms give close to 2x speedup
Yes, that's about right.
Post by Trevor Perrin
This is an issue for Kummer surfaces also, but there we do not know
how to attack invalid points.
What "hyperand" recommends is to simply send compressed elliptic-curve
points, rather than sending Kummer points. Unlike FourQ, hyperand
provides twist-secure curves, so it's safe to skip point validation. The
receiver then switches over to the fast Kummer formulas to compute the
DH shared secret.

---Dan
Trevor Perrin
2015-09-19 01:05:59 UTC
Permalink
Post by D. J. Bernstein
Post by Trevor Perrin
- FourQ is a little faster (~10%) than 25519 without endomorphisms
Maybe, but for such small differences one has to look very carefully at
what exactly is being measured (e.g., is point validation included? what
exactly are the assumptions on the input and output?) and of course also
the quantitative security level (2^122.5 vs. 2^125.8---one expects this
to have a close-to-cubic effect).
On the other hand, 25519 has received more optimization over the years,
and Tung Chou's implementation uses more assembly than FourQLib [1,2].

Considering this, MSR's overall estimate of 2-3x speedup (with
endomorphisms) for variable-base ops vs 25519 seems reasonable, even
though the speedup wrt Tung on SB/IB is only 2.1 - 2.2.

I guess we'll get more information over time.

Trevor

[1] https://sites.google.com/a/crypto.tw/blueprint/
[2] http://research.microsoft.com/en-us/projects/fourqlib/

Loading...