add support for PCLMULQDQ instructions #396

bacelar · 2023-03-20T12:39:50Z

This PR adds support for the carry-less multiplication instructions (PCLMULQDQ and VPCLMULQDQ).

bgregoir · 2023-03-21T07:37:45Z

eclib/JModel.ec

+
+op PCLMULQDQ (v1 v2: W128.t) (k: int): W128.t =
+ let x1 = v1 \bits64 (k %% 2) in
+ let x2 = v2 \bits64 (k %% 16 %% 2) in


This line is suspicious, since k %% 16 %% 2 = k %% 2
I think k%%16 should be k%/ 16.
I that case (k%%2) will be bit 0 of k and k %/ 16 %2 bit 4 of k.
Also why k is an int I was expected a W8.t.
But this k should be an immediate it is maybe not important.

In fact, I think the extraction will fail. Because
In jasmin the last argument is a w8 and in EC the last argument is a W8.t.

yes, sorry about that. The index computation was intended to be (_ %/ _ %% _) -- but the whole EC portion was a last minute addition, and definitely very unsatisfactory :-(. I'll properly revise and test it soon.

bgregoir · 2023-03-21T07:38:35Z

eclib/JModel.ec

+
+op VPCLMULQDQ_256 (v1 v2: W256.t) (k: int): W256.t =
+ let x1 = v1 \bits64 (k %% 2) in
+ let x2 = v2 \bits64 (k %% 16 %% 2) in


Same remark here

bgregoir · 2023-03-21T07:43:15Z

proofs/compiler/x86_instr_decl.v

+
+Definition Ox86_PCLMULQDQ_instr := 
+  mk_instr_pp "PCLMULQDQ" [:: sword U128; sword U128; sword U8] (w_ty U128)
+    [:: E 0; E 1; E 2] [:: E 0] MSB_CLEAR (@x86_VPCLMULQDQ U128)


MSB_CLEAR should be MSB_KEEP.

bgregoir · 2023-03-21T07:43:43Z

proofs/compiler/x86_instr_decl.v

+Definition Ox86_VPCLMULQDQ_instr := 
+ (fun sz =>
+   mk_instr (pp_sz "VPCLMULQDQ"%string sz) [:: sword sz; sword sz; sword U8] (w_ty sz)
+       [:: E 1; E 2; E 3] [:: E 0] MSB_CLEAR (@x86_VPCLMULQDQ sz)


MSB_CLEAR is ok here.

bgregoir

I have added some comment.
You should also add something in the change log

vbgl · 2023-03-21T12:31:53Z

Reading the manual, I understand that the 256-bit variant does two parallel multiplications. Is it true? I don’t have access to any hardware with this instruction.

bgregoir · 2023-03-21T12:42:50Z

I did not understand that, maybe I have skip a part of the manual. Where did you understand that?

bacelar · 2023-03-21T13:42:11Z

my understanding was also the it zero-extends the result, but on a second look you are probably right. I'll double check on that

(KL,VL) = (1,128), (2,256)
FOR i= 0 to KL-1:
    IF Imm8[0] = 0:
        TEMP1 := SRC1.xmm[i].qword[0]
    ELSE:
        TEMP1 := SRC1.xmm[i].qword[1]
    IF Imm8[4] = 0:
        TEMP2 := SRC2.xmm[i].qword[0]
    ELSE:
        TEMP2 := SRC2.xmm[i].qword[1]
    DEST.xmm[i] := PCLMUL128(TEMP1, TEMP2)
DEST[MAXVL-1:VL] := 0

bacelar · 2023-03-21T14:46:39Z

...but the description is, at best, misleading...

Instruction:		VPCLMULQDQ ymm1, ymm2, ymm3/m256, imm8
CPUID feature flag:	VPCLMULQDQ
Description:		Carry-less multiplication of one quadword of ymm2 by one quadword
			of ymm3/m256, stores the 128-bit result in ymm1. The immediate is
			used to determine which quadwords of ymm2 and ymm3/m256 should be
			used.

bgregoir

I assume that your are convince that Vincent's comment is write (for ymm)

bgregoir · 2023-03-21T16:15:41Z

eclib/JModel.ec

- let x2 = v2 \bits64 (k %% 16 %% 2) in
- clmulq x1 x2.
+op PCLMULQDQ (v1 v2: W128.t) (k: W8.t): W128.t =
+ let x0 = v1 \bits64 (W8.to_uint k %% 2) in


Maybe k.[0] (i.e bit 0) and k.[4] (bit 4) will be better now that we have W8.t.
But it is upto you

yes, makes sense. I'll change it.

bacelar · 2023-03-21T23:16:37Z

I assume that your are convince that Vincent's comment is write (for ymm)

yes, I am convinced but wasn't able to test it -- in fact I presumed that the processor feature (VPCLMULQDQ) allowing the ymm variant of the instruction would be available in most processors supporting AVX2. But that does not seem to be the case -- if we exclude the high-end processors from intel and amd (that typically also support AVX512), I can only find the AMD Zen3 processor family.
So, I wonder if it makes sense to support it in Jasmin. I'm inclined to think it would make sense to inhibit it for now (that is, replace check_size_128_256 sz with assert (sz =? U128) ErrType), but keep the definition as it is to smooth a possible extension (if/when someone asks for it...). But would like to ear your opinion.

bgregoir · 2023-03-22T05:32:43Z

Agree with you, I let you make the change, and thus we can merge.

vbgl · 2023-03-22T05:39:14Z

If the sz argument has a single valid value, then it might not be needed.

bgregoir · 2023-03-22T05:42:35Z

Ok Vincent is right:
"This instruction computes in parallel 4 pclmul multiplications, i.e. carryless multiplications of binary polynomials of degree at most 63, stored in 4 quadwords in
512 bit registers, as seen above. Thus, four 128 bit results are stored in the zmm register ... "

bgregoir · 2023-03-22T06:59:38Z

I have not opinion on: should we keep the 256 version.
If someone has a strong opinion against it let them speak now or forever hold their peace.

bacelar · 2023-03-22T14:05:35Z

I have not opinion on: should we keep the 256 version.
If someone has a strong opinion against it let them speak now or forever hold their peace.

no strong opinion on that either. I realised that what I've suggested does not do what I intended -- so just keep it unchanged. Lets hope there's someone with a Zen3 processor that would make a good use of it... :-)

* add support for PCLMULQDQ instructions * fix the semantics of VPCLMULQDQ * cosmetic change on the EC code --------- Co-authored-by: José Bacelar Almeida <[email protected]> (cherry picked from commit 3c783b6)

add support for PCLMULQDQ instructions

6995c3d

bacelar requested a review from bgregoir March 20, 2023 12:41

bgregoir reviewed Mar 21, 2023

View reviewed changes

fix the semantics of VPCLMULQDQ

d7c7b12

bgregoir reviewed Mar 21, 2023

View reviewed changes

cosmetic change on the EC code

cb91efa

bgregoir approved these changes Mar 22, 2023

View reviewed changes

bgregoir merged commit 3c783b6 into jasmin-lang:main Mar 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add support for PCLMULQDQ instructions #396

add support for PCLMULQDQ instructions #396

bacelar commented Mar 20, 2023

bgregoir Mar 21, 2023

bgregoir Mar 21, 2023

bacelar Mar 21, 2023

bgregoir Mar 21, 2023

bgregoir Mar 21, 2023

bgregoir Mar 21, 2023

bgregoir left a comment

vbgl commented Mar 21, 2023

bgregoir commented Mar 21, 2023

bacelar commented Mar 21, 2023

bacelar commented Mar 21, 2023

bgregoir left a comment

bgregoir Mar 21, 2023

bacelar Mar 21, 2023

bacelar commented Mar 21, 2023

bgregoir commented Mar 22, 2023

vbgl commented Mar 22, 2023

bgregoir commented Mar 22, 2023

bgregoir commented Mar 22, 2023

bacelar commented Mar 22, 2023

add support for PCLMULQDQ instructions #396

add support for PCLMULQDQ instructions #396

Conversation

bacelar commented Mar 20, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bgregoir left a comment

Choose a reason for hiding this comment

vbgl commented Mar 21, 2023

bgregoir commented Mar 21, 2023

bacelar commented Mar 21, 2023

bacelar commented Mar 21, 2023

bgregoir left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bacelar commented Mar 21, 2023

bgregoir commented Mar 22, 2023

vbgl commented Mar 22, 2023

bgregoir commented Mar 22, 2023

bgregoir commented Mar 22, 2023

bacelar commented Mar 22, 2023