Discussion:
Granger-Moss primes for 32 bit implementations
Kyle Butt
2017-09-09 04:21:46 UTC
TLDR: How much do square roots matter for ECC primes?

I've been working on finding Granger Moss primes where t fits in a 32 bit
integer.
Along the way, I worked out tighter bounds for the min and max values after
reduction.

In the paper, they aren't explicit about how much extra room is needed to
handle additions for curve operations. For edwards curves, you need to
account for a factor of 6. This changes their formula
ceil(log_2(m / 2)) + 2k + 5 <= 2w
to:
ceil(log_2(3 * m) + 2k + 5 <= 2w

for m + 1 = 17 and 19, this requires k < 26.5

Similarly for their reduction algorithm, they work in bits, but I did the
math on the actual bounds.

Assuming z_i = 2^63 - 1, after 1 reduction the max value is 2^(63 - l) - 1
+ (t - c)
t - c = (b - 1) * c, the max value carried from z_(i+1), assuming c < b,
after 2 reductions the max value is:
2^(63 - 2*l) - 1 + c + (t - c). = 2^(63 - 2*l) + t - 1. The first c occurs
because t/b = c.
The minimum value is easier to calculate. In this case, the worst case
carry is 0.
so the min value is 2^(63 - 2 * l).

Upon multiplying these values are subtracted. If their difference is less
than 2^27.5, then we can adjust the formula above to:

ceil(log_2(3 * m) + 2k + 4 <= 2w

This does place a greater constraint on l, but it yields larger primes.

because we know that the result of the product takes 1 less bit.
This allows us to construct larger primes for m + 1 = 17, 19.
For m+1 = 19 we can construct a prime with 483 bits, specifically:
p = phi 18 t, where t = (2^3 - 1) * (2^24); b = 2^24; l = 24

Unfortunately, these primes are not fast square root primes, by
construction. It should be clear that p % (2^24) = 1. How much does this
matter? S is 24 in this case, so Tonelli Shanks should be reasonably fast.
Equality checking is also a little tricky due to the redundancy, but a
subtraction single round of modular carries can be done in parallel. Then
you can verify that all the limbs are the same value.

I found an edwards curve that satisfies the safecurves criteria, the
constant is unusual but rigid:
It is the least d in absolute value such that x^2 + y^2 = 1 + (d/b)x^2*y^2
has cofactor {4, 8} and so does its twist. The reason for choosing such a d
is that coefficient multiplication requires reduction, which means that
even for small constants, they must be in the montgomery domain, taking up
at least 2 limbs. But for a constant d/b it can be a single limb.

the least such d is: -56904
for that curve the trace
is: 1805876552616329139365429358604192390835234313183868353172046570215054410

I prototyped the code in Haskell, for quick turnaround and so that I have
something where I can print the intermediate values as test vectors for
other implementations. I can tidy it up and share it if there's interest.
It's not written for speed but for verification.

I plan to try and implement a cuda or opencl version, to try and take
advantage of the parallel nature of the primes.

Thoughts or suggestions? Would it be worthwhile to write any of this up in
a more formal setting?

I think at a minimum borrowing the half bit by computing tighter bounds is
pretty cool.

Kyle
Mike Hamburg
2017-09-09 04:55:30 UTC
I might possibly have written an ECC implementation that documents: âThe following operations are not implemented for NIST P224 because I didnât want to compute square roots: âŠâ

At least with only 2^24+1, you donât have to implement Sutherlandâs algorithm. But the cost of Tonelli-Shanks is nontrivial, since it can still take hundreds of multiplications.

Anyway, Iâll be interested to know what performance you get.
Post by Kyle Butt
TLDR: How much do square roots matter for ECC primes?
I've been working on finding Granger Moss primes where t fits in a 32 bit integer.
Along the way, I worked out tighter bounds for the min and max values after reduction.
In the paper, they aren't explicit about how much extra room is needed to handle additions for curve operations. For edwards curves, you need to account for a factor of 6. This changes their formula
ceil(log_2(m / 2)) + 2k + 5 <= 2w
ceil(log_2(3 * m) + 2k + 5 <= 2w
for m + 1 = 17 and 19, this requires k < 26.5
Similarly for their reduction algorithm, they work in bits, but I did the math on the actual bounds.
Assuming z_i = 2^63 - 1, after 1 reduction the max value is 2^(63 - l) - 1 + (t - c)
2^(63 - 2*l) - 1 + c + (t - c). = 2^(63 - 2*l) + t - 1. The first c occurs because t/b = c.
The minimum value is easier to calculate. In this case, the worst case carry is 0.
so the min value is 2^(63 - 2 * l).
ceil(log_2(3 * m) + 2k + 4 <= 2w
This does place a greater constraint on l, but it yields larger primes.
because we know that the result of the product takes 1 less bit.
This allows us to construct larger primes for m + 1 = 17, 19.
p = phi 18 t, where t = (2^3 - 1) * (2^24); b = 2^24; l = 24
Unfortunately, these primes are not fast square root primes, by construction. It should be clear that p % (2^24) = 1. How much does this matter? S is 24 in this case, so Tonelli Shanks should be reasonably fast. Equality checking is also a little tricky due to the redundancy, but a subtraction single round of modular carries can be done in parallel. Then you can verify that all the limbs are the same value.
It is the least d in absolute value such that x^2 + y^2 = 1 + (d/b)x^2*y^2 has cofactor {4, 8} and so does its twist. The reason for choosing such a d is that coefficient multiplication requires reduction, which means that even for small constants, they must be in the montgomery domain, taking up at least 2 limbs. But for a constant d/b it can be a single limb.
the least such d is: -56904
for that curve the trace is: 1805876552616329139365429358604192390835234313183868353172046570215054410
I prototyped the code in Haskell, for quick turnaround and so that I have something where I can print the intermediate values as test vectors for other implementations. I can tidy it up and share it if there's interest. It's not written for speed but for verification.
I plan to try and implement a cuda or opencl version, to try and take advantage of the parallel nature of the primes.
Thoughts or suggestions? Would it be worthwhile to write any of this up in a more formal setting?
I think at a minimum borrowing the half bit by computing tighter bounds is pretty cool.
Kyle
_______________________________________________
Curves mailing list
https://moderncrypto.org/mailman/listinfo/curves
Kyle Butt
2017-09-11 17:37:21 UTC
Post by Mike Hamburg
I might possibly have written an ECC implementation that documents: âThe
following operations are not implemented for NIST P224 because I didnât
want to compute square roots: âŠâ
At least with only 2^24+1, you donât have to implement Sutherlandâs
algorithm. But the cost of Tonelli-Shanks is nontrivial, since it can
still take hundreds of multiplications.
I implemented it and added counting. For the first 10,000 residues the
average # of multiplies for the iterative portion is 183. 22 of those would
have to be done anyway if p were 3 mod 4.
So 161 extra field multiplications for square roots. Taking the square root
of u/v requires an additional 24 squares and 9 multiplies.
I at least figured out how to do equality checks in band without having to
reduce the value completely. My insight was to set the first component to t
+ t/2, and then do a parallel carry. If the value is 0, all the carries
will be done, and the you check that all components are the same.

I'll try to get some speed #'s but it's at least nice to know that
Tonelli-Shanks isn't too difficult to implement (at least in Haskell), and
that the # of multiplies isn't insurmountable.

For example, my phi 17 prime has p-1 = 2^22*q, 431 bits, and 136 mulitplies
per field multiply or square. Comparing 24 vs 22, you should expect 133
extra field multiplies per square root.
The Ed41417 prime paper lists 144 multiplies per square. Saving 8
mulitplies per field multiplication, with ~8 field multiplies per point
addition is 64 multiplies per point, saving a field multiply just over
It's easy to see that over a scalar multiply, you could theoretically make
it up.

I've never done CUDA/OpenCL, but these should all vectorize nicely. It will
be a worthwhile reason to learn vector programming.
Post by Mike Hamburg
Anyway, Iâll be interested to know what performance you get.
Post by Kyle Butt
TLDR: How much do square roots matter for ECC primes?
I've been working on finding Granger Moss primes where t fits in a 32
bit integer.
Post by Kyle Butt
Along the way, I worked out tighter bounds for the min and max values
after reduction.
Post by Kyle Butt
In the paper, they aren't explicit about how much extra room is needed
to handle additions for curve operations. For edwards curves, you need to
account for a factor of 6. This changes their formula
Post by Kyle Butt
ceil(log_2(m / 2)) + 2k + 5 <= 2w
ceil(log_2(3 * m) + 2k + 5 <= 2w
for m + 1 = 17 and 19, this requires k < 26.5
Similarly for their reduction algorithm, they work in bits, but I did
the math on the actual bounds.
Post by Kyle Butt
Assuming z_i = 2^63 - 1, after 1 reduction the max value is 2^(63 - l) -
1 + (t - c)
Post by Kyle Butt
t - c = (b - 1) * c, the max value carried from z_(i+1), assuming c < b,
2^(63 - 2*l) - 1 + c + (t - c). = 2^(63 - 2*l) + t - 1. The first c
occurs because t/b = c.
Post by Kyle Butt
The minimum value is easier to calculate. In this case, the worst case
carry is 0.
Post by Kyle Butt
so the min value is 2^(63 - 2 * l).
Upon multiplying these values are subtracted. If their difference is
ceil(log_2(3 * m) + 2k + 4 <= 2w
This does place a greater constraint on l, but it yields larger primes.
because we know that the result of the product takes 1 less bit.
This allows us to construct larger primes for m + 1 = 17, 19.
p = phi 18 t, where t = (2^3 - 1) * (2^24); b = 2^24; l = 24
Unfortunately, these primes are not fast square root primes, by
construction. It should be clear that p % (2^24) = 1. How much does this
matter? S is 24 in this case, so Tonelli Shanks should be reasonably fast.
Equality checking is also a little tricky due to the redundancy, but a
subtraction single round of modular carries can be done in parallel. Then
you can verify that all the limbs are the same value.
Post by Kyle Butt
I found an edwards curve that satisfies the safecurves criteria, the
It is the least d in absolute value such that x^2 + y^2 = 1 +
(d/b)x^2*y^2 has cofactor {4, 8} and so does its twist. The reason for
choosing such a d is that coefficient multiplication requires reduction,
which means that even for small constants, they must be in the montgomery
domain, taking up at least 2 limbs. But for a constant d/b it can be a
single limb.
Post by Kyle Butt
the least such d is: -56904
for that curve the trace is: 180587655261632913936542935860
4192390835234313183868353172046570215054410
Post by Kyle Butt
I prototyped the code in Haskell, for quick turnaround and so that I
have something where I can print the intermediate values as test vectors
for other implementations. I can tidy it up and share it if there's
interest. It's not written for speed but for verification.
Post by Kyle Butt
I plan to try and implement a cuda or opencl version, to try and take
advantage of the parallel nature of the primes.
Post by Kyle Butt
Thoughts or suggestions? Would it be worthwhile to write any of this up
in a more formal setting?
Post by Kyle Butt
I think at a minimum borrowing the half bit by computing tighter bounds
is pretty cool.
Post by Kyle Butt
Kyle
_______________________________________________
Curves mailing list
https://moderncrypto.org/mailman/listinfo/curves
Kyle Butt
2017-09-21 01:19:35 UTC
Post by Kyle Butt
Post by Mike Hamburg
I might possibly have written an ECC implementation that documents: âThe
following operations are not implemented for NIST P224 because I didnât
want to compute square roots: âŠâ
At least with only 2^24+1, you donât have to implement Sutherlandâs
algorithm. But the cost of Tonelli-Shanks is nontrivial, since it can
still take hundreds of multiplications.
I implemented it and added counting. For the first 10,000 residues the
average # of multiplies for the iterative portion is 183. 22 of those would
have to be done anyway if p were 3 mod 4.
So 161 extra field multiplications for square roots. Taking the square
root of u/v requires an additional 24 squares and 9 multiplies.
I at least figured out how to do equality checks in band without having to
reduce the value completely. My insight was to set the first component to t
+ t/2, and then do a parallel carry. If the value is 0, all the carries
will be done, and the you check that all components are the same.
I'll try to get some speed #'s but it's at least nice to know that
Tonelli-Shanks isn't too difficult to implement (at least in Haskell), and
that the # of multiplies isn't insurmountable.
For example, my phi 17 prime has p-1 = 2^22*q, 431 bits, and 136
mulitplies per field multiply or square. Comparing 24 vs 22, you should
expect 133 extra field multiplies per square root.
The Ed41417 prime paper lists 144 multiplies per square. Saving 8
mulitplies per field multiplication, with ~8 field multiplies per point
addition is 64 multiplies per point, saving a field multiply just over
It's easy to see that over a scalar multiply, you could theoretically make
it up.
I've never done CUDA/OpenCL, but these should all vectorize nicely. It
will be a worthwhile reason to learn vector programming.
So the half-cyclic convolution is a pain to vectorize, and the compiler
certainly won't do it for you.

I implemented multiply and square for avx2. On my workstation at work I get
163 cycles/multiply and 121/square. Which isn't competitive with the state
of the art, but it's close.

If you wanted differential power resistance, they may be competitive,
because they get it for (almost) free. (Instead of setting the accumulator
to 0, you broadcast a random value 0 t-1)

Someone with more vectorizing experience may be able to push them further,
but that's a lot of work.

If anyone is interested in the code, I can clean it up and post it.
Post by Kyle Butt
Post by Mike Hamburg
Anyway, Iâll be interested to know what performance you get.
Post by Kyle Butt
TLDR: How much do square roots matter for ECC primes?
I've been working on finding Granger Moss primes where t fits in a 32
bit integer.
Post by Kyle Butt
Along the way, I worked out tighter bounds for the min and max values
after reduction.
Post by Kyle Butt
In the paper, they aren't explicit about how much extra room is needed
to handle additions for curve operations. For edwards curves, you need to
account for a factor of 6. This changes their formula
Post by Kyle Butt
ceil(log_2(m / 2)) + 2k + 5 <= 2w
ceil(log_2(3 * m) + 2k + 5 <= 2w
for m + 1 = 17 and 19, this requires k < 26.5
Similarly for their reduction algorithm, they work in bits, but I did
the math on the actual bounds.
Post by Kyle Butt
Assuming z_i = 2^63 - 1, after 1 reduction the max value is 2^(63 - l)
- 1 + (t - c)
Post by Kyle Butt
t - c = (b - 1) * c, the max value carried from z_(i+1), assuming c <
2^(63 - 2*l) - 1 + c + (t - c). = 2^(63 - 2*l) + t - 1. The first c
occurs because t/b = c.
Post by Kyle Butt
The minimum value is easier to calculate. In this case, the worst case
carry is 0.
Post by Kyle Butt
so the min value is 2^(63 - 2 * l).
Upon multiplying these values are subtracted. If their difference is
ceil(log_2(3 * m) + 2k + 4 <= 2w
This does place a greater constraint on l, but it yields larger primes.
because we know that the result of the product takes 1 less bit.
This allows us to construct larger primes for m + 1 = 17, 19.
p = phi 18 t, where t = (2^3 - 1) * (2^24); b = 2^24; l = 24
Unfortunately, these primes are not fast square root primes, by
construction. It should be clear that p % (2^24) = 1. How much does this
matter? S is 24 in this case, so Tonelli Shanks should be reasonably fast.
Equality checking is also a little tricky due to the redundancy, but a
subtraction single round of modular carries can be done in parallel. Then
you can verify that all the limbs are the same value.
Post by Kyle Butt
I found an edwards curve that satisfies the safecurves criteria, the
It is the least d in absolute value such that x^2 + y^2 = 1 +
(d/b)x^2*y^2 has cofactor {4, 8} and so does its twist. The reason for
choosing such a d is that coefficient multiplication requires reduction,
which means that even for small constants, they must be in the montgomery
domain, taking up at least 2 limbs. But for a constant d/b it can be a
single limb.
Post by Kyle Butt
the least such d is: -56904
for that curve the trace is: 180587655261632913936542935860
4192390835234313183868353172046570215054410
Post by Kyle Butt
I prototyped the code in Haskell, for quick turnaround and so that I
have something where I can print the intermediate values as test vectors
for other implementations. I can tidy it up and share it if there's
interest. It's not written for speed but for verification.
Post by Kyle Butt
I plan to try and implement a cuda or opencl version, to try and take
advantage of the parallel nature of the primes.
Post by Kyle Butt
Thoughts or suggestions? Would it be worthwhile to write any of this up
in a more formal setting?
Post by Kyle Butt
I think at a minimum borrowing the half bit by computing tighter bounds
is pretty cool.
Post by Kyle Butt
Kyle
_______________________________________________
Curves mailing list
https://moderncrypto.org/mailman/listinfo/curves
Kyle Butt
2017-09-27 07:36:48 UTC
Post by Kyle Butt
Post by Kyle Butt
Post by Mike Hamburg
I might possibly have written an ECC implementation that documents: âThe
following operations are not implemented for NIST P224 because I didnât
want to compute square roots: âŠâ
At least with only 2^24+1, you donât have to implement Sutherlandâs
algorithm. But the cost of Tonelli-Shanks is nontrivial, since it can
still take hundreds of multiplications.
I implemented it and added counting. For the first 10,000 residues the
average # of multiplies for the iterative portion is 183. 22 of those would
have to be done anyway if p were 3 mod 4.
So 161 extra field multiplications for square roots. Taking the square
root of u/v requires an additional 24 squares and 9 multiplies.
I at least figured out how to do equality checks in band without having
to reduce the value completely. My insight was to set the first component
to t + t/2, and then do a parallel carry. If the value is 0, all the
carries will be done, and the you check that all components are the same.
I'll try to get some speed #'s but it's at least nice to know that
Tonelli-Shanks isn't too difficult to implement (at least in Haskell), and
that the # of multiplies isn't insurmountable.
For example, my phi 17 prime has p-1 = 2^22*q, 431 bits, and 136
mulitplies per field multiply or square. Comparing 24 vs 22, you should
expect 133 extra field multiplies per square root.
The Ed41417 prime paper lists 144 multiplies per square. Saving 8
mulitplies per field multiplication, with ~8 field multiplies per point
addition is 64 multiplies per point, saving a field multiply just over
It's easy to see that over a scalar multiply, you could theoretically
make it up.
I've never done CUDA/OpenCL, but these should all vectorize nicely. It
will be a worthwhile reason to learn vector programming.
So the half-cyclic convolution is a pain to vectorize, and the compiler
certainly won't do it for you.
I implemented multiply and square for avx2. On my workstation at work I
get 163 cycles/multiply and 121/square. Which isn't competitive with the
state of the art, but it's close.
If you wanted differential power resistance, they may be competitive,
because they get it for (almost) free. (Instead of setting the accumulator
to 0, you broadcast a random value 0 t-1)
Someone with more vectorizing experience may be able to push them further,
but that's a lot of work.
If anyone is interested in the code, I can clean it up and post it.
I figured I'd write a last follow-up on some of the work I'd done before
leaving it.

I realized that you could use division to reduce the coefficients by t
instead of montgomery style reduction. For an arbitrary t, that would be
expensive, but for t = (2^26 - 2) It's relatively cheap.
shift right by 26, that becomes a carry to the next coefficient, and add
back 2 times that as an error term. You have to do this twice because the
terms are bigger than t^2. You can find several t's
that are close enough to a power of 2 to use this method of reduction.
(2^26 - 2) is really nice in that 2 has hamming weight 1.

p = phi 19 t is 3 mod 4, for cheap square roots.

If you reduce this way, the reduction is exact, which is nice for 2 reasons:
Equality checking: Subtract, reduce once. All terms will be the same if the
values were equal.
Bits: Because the reduction is exact, you can use a slightly larger t
value. t can be slightly larger than 27 bits,
For instance t = 2^27 + 9, p = phi 19 t is 486 bits. 9 requires an extra
addition, but not an extra shift, because you have to compute the carry
anyway.

I did something else as well. I realized you could view the Granger Moss
construction as a Karatsuba level, and tried t = (2^112 - 8), and p = phi 5
t
I implemented a reduced carry version. You get the first quotient for free
because shifting right by 112 is just the upper results of the product.

I was able to get multiply reduce around 190 cycles implemented that way.
Reduction was 26 of those cycles, and auto-vectorized nicely.
I just used intel intrinsics for mulx and addc, no vector intrinsics.

I may try the 32 bit version on skylake to see how much more I can get out
of avx-512. The majority of the multiplies can fill a 512 bit vector,
lane masking helps with the ones that can't. It also has embedded
broadcast, which should help out a lot. The way I got performance from the
cyclic convolution was to tilt it so that half of each operand was a
broadcast. That along with the increased registers should lower the # of
spills.

The carry propagation runs in 3 passes instead of 5, and avx-512 has 64 bit
arithmetic right shift,
and 64-32 bit packing, used to finish the reduction.

I may also try Neon. The same 4-way vectorization that I did for avx2
splits into 2-way vectorization, with long multiply accumulate.
Post by Kyle Butt
Post by Kyle Butt
Post by Mike Hamburg
Anyway, Iâll be interested to know what performance you get.
Post by Kyle Butt
TLDR: How much do square roots matter for ECC primes?
I've been working on finding Granger Moss primes where t fits in a 32
bit integer.
Post by Kyle Butt
Along the way, I worked out tighter bounds for the min and max values
after reduction.
Post by Kyle Butt
In the paper, they aren't explicit about how much extra room is needed
to handle additions for curve operations. For edwards curves, you need to
account for a factor of 6. This changes their formula
Post by Kyle Butt
ceil(log_2(m / 2)) + 2k + 5 <= 2w
ceil(log_2(3 * m) + 2k + 5 <= 2w
for m + 1 = 17 and 19, this requires k < 26.5
Similarly for their reduction algorithm, they work in bits, but I did
the math on the actual bounds.
Post by Kyle Butt
Assuming z_i = 2^63 - 1, after 1 reduction the max value is 2^(63 - l)
- 1 + (t - c)
Post by Kyle Butt
t - c = (b - 1) * c, the max value carried from z_(i+1), assuming c <
2^(63 - 2*l) - 1 + c + (t - c). = 2^(63 - 2*l) + t - 1. The first c
occurs because t/b = c.
Post by Kyle Butt
The minimum value is easier to calculate. In this case, the worst case
carry is 0.
Post by Kyle Butt
so the min value is 2^(63 - 2 * l).
Upon multiplying these values are subtracted. If their difference is
ceil(log_2(3 * m) + 2k + 4 <= 2w
This does place a greater constraint on l, but it yields larger primes.
because we know that the result of the product takes 1 less bit.
This allows us to construct larger primes for m + 1 = 17, 19.
p = phi 18 t, where t = (2^3 - 1) * (2^24); b = 2^24; l = 24
Unfortunately, these primes are not fast square root primes, by
construction. It should be clear that p % (2^24) = 1. How much does this
matter? S is 24 in this case, so Tonelli Shanks should be reasonably fast.
Equality checking is also a little tricky due to the redundancy, but a
subtraction single round of modular carries can be done in parallel. Then
you can verify that all the limbs are the same value.
Post by Kyle Butt
I found an edwards curve that satisfies the safecurves criteria, the
It is the least d in absolute value such that x^2 + y^2 = 1 +
(d/b)x^2*y^2 has cofactor {4, 8} and so does its twist. The reason for
choosing such a d is that coefficient multiplication requires reduction,
which means that even for small constants, they must be in the montgomery
domain, taking up at least 2 limbs. But for a constant d/b it can be a
single limb.
Post by Kyle Butt
the least such d is: -56904
for that curve the trace is: 180587655261632913936542935860
4192390835234313183868353172046570215054410
Post by Kyle Butt
I prototyped the code in Haskell, for quick turnaround and so that I
have something where I can print the intermediate values as test vectors
for other implementations. I can tidy it up and share it if there's
interest. It's not written for speed but for verification.
Post by Kyle Butt
I plan to try and implement a cuda or opencl version, to try and take
advantage of the parallel nature of the primes.
Post by Kyle Butt
Thoughts or suggestions? Would it be worthwhile to write any of this
up in a more formal setting?
Post by Kyle Butt
I think at a minimum borrowing the half bit by computing tighter
bounds is pretty cool.
Post by Kyle Butt
Kyle
_______________________________________________
Curves mailing list
https://moderncrypto.org/mailman/listinfo/curves
Kyle Butt
2017-09-27 19:17:51 UTC
Hi Kyle,
Thanks for cc'ing me into this conversation. I'm glad you have been
looking at the potential for vectorisation, since this is an aspect that
has not been studied in detail yet for these primes, as far as I know.
I should explain what I did in a little detail then.
I flipped the order of summation for the odd terms. Instead of z_1 = (x_9 -
x_11)(y_11 - y_9) ... (x_1 - x_0)(y_0 - y_1)
This gives you a 2x vectorization of scalar vs vector z_0, z_1 = (x_1 -
(x_18,x_0))((y_18,y_0) - y_1) . . .
To get wider vectorization, you can work across the square diagonally.
16, 17 = (7,8)*9,
16,17,18 += (6,7,8)*10
16,17,18 += (5,6,7)*11
Because shuffling/permutation is relatively expensive, you can re-use the
12,13 = (5,6)*7
16,17,18 += (4,5,6)*12
12,13,14,15 += (4,5,6,7)*8
I realize now that I write this, that I could have flipped the even terms
instead, and then worked from the top down. It would probably be easier to
follow.

To get the shuffled vectors, you can either do an overlapping load/insert
and then shuffle. You can use those vectors 3 times before needing to
Intel multiplies the even slots in a widening multiply, so you can put the
values you'll need later in the odd slots

for instance to get the 3 vectors (6,7,8,9), (5,6,7,8), (4,5,6,7), you
construct the 8x32 vector
6 4 7 5 | 8 6 9 7.
The overlap is because moving values across lanes has higher latency.

You can also do a simple load and permute:
6 2 7 3 | 8 4 9 5

For multiplication, because there are 2 permutes to hide some of the
latency (3 cycles) and because
of the increased register pressure to perform the overlapped load vs a
straight load, the permute is faster.

To facilitate the overlapping loads, you allocate extra room for the
coefficients and then copy the low coefficients into the high area before
starting.
I just thought I should point out that the advantages you note for t near
a power of 2, such as t = 2^26 - 2, were already detailed in the original
paper of Chung and Hasan, "More Generalized Mersenne Numbers", Section 5.1
(reference  in my paper with Moss). It is only when t is not of such a
nice form that one should use the Montgomery representation. However, the
multiplication we introduced should be used in this case.
that point.
C doesn't have to be that small, a small safety margin (I'd have to do the
math, but a few bits) less than sqrt(t) should suffice, because you have to
reduce twice anyway, and the first reduction is overkill.
For multiprecision, you want c to be smaller so that the 2nd reduction is
as few words as possible.
If you use signed coefficients, c can be positive as well, which isn't in
the Chung Hasan paper.
All the best,
- Rob Granger
Thanks.

It means a lot to know I'm not wildly off the mark.
Kyle