Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for PCLMULQDQ instructions #396

Merged
merged 3 commits into from
Mar 22, 2023

Conversation

bacelar
Copy link
Collaborator

@bacelar bacelar commented Mar 20, 2023

This PR adds support for the carry-less multiplication instructions (PCLMULQDQ and VPCLMULQDQ).

@bacelar bacelar requested a review from bgregoir March 20, 2023 12:41
eclib/JModel.ec Outdated

op PCLMULQDQ (v1 v2: W128.t) (k: int): W128.t =
let x1 = v1 \bits64 (k %% 2) in
let x2 = v2 \bits64 (k %% 16 %% 2) in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is suspicious, since k %% 16 %% 2 = k %% 2
I think k%%16 should be k%/ 16.
I that case (k%%2) will be bit 0 of k and k %/ 16 %2 bit 4 of k.
Also why k is an int I was expected a W8.t.
But this k should be an immediate it is maybe not important.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, I think the extraction will fail. Because
In jasmin the last argument is a w8 and in EC the last argument is a W8.t.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, sorry about that. The index computation was intended to be (_ %/ _ %% _) -- but the whole EC portion was a last minute addition, and definitely very unsatisfactory :-(. I'll properly revise and test it soon.

eclib/JModel.ec Outdated

op VPCLMULQDQ_256 (v1 v2: W256.t) (k: int): W256.t =
let x1 = v1 \bits64 (k %% 2) in
let x2 = v2 \bits64 (k %% 16 %% 2) in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same remark here


Definition Ox86_PCLMULQDQ_instr :=
mk_instr_pp "PCLMULQDQ" [:: sword U128; sword U128; sword U8] (w_ty U128)
[:: E 0; E 1; E 2] [:: E 0] MSB_CLEAR (@x86_VPCLMULQDQ U128)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MSB_CLEAR should be MSB_KEEP.

Definition Ox86_VPCLMULQDQ_instr :=
(fun sz =>
mk_instr (pp_sz "VPCLMULQDQ"%string sz) [:: sword sz; sword sz; sword U8] (w_ty sz)
[:: E 1; E 2; E 3] [:: E 0] MSB_CLEAR (@x86_VPCLMULQDQ sz)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MSB_CLEAR is ok here.

Copy link
Contributor

@bgregoir bgregoir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added some comment.
You should also add something in the change log

@vbgl
Copy link
Member

vbgl commented Mar 21, 2023

Reading the manual, I understand that the 256-bit variant does two parallel multiplications. Is it true? I don’t have access to any hardware with this instruction.

@bgregoir
Copy link
Contributor

I did not understand that, maybe I have skip a part of the manual. Where did you understand that?

@bacelar
Copy link
Collaborator Author

bacelar commented Mar 21, 2023

my understanding was also the it zero-extends the result, but on a second look you are probably right. I'll double check on that

(KL,VL) = (1,128), (2,256)
FOR i= 0 to KL-1:
    IF Imm8[0] = 0:
        TEMP1 := SRC1.xmm[i].qword[0]
    ELSE:
        TEMP1 := SRC1.xmm[i].qword[1]
    IF Imm8[4] = 0:
        TEMP2 := SRC2.xmm[i].qword[0]
    ELSE:
        TEMP2 := SRC2.xmm[i].qword[1]
    DEST.xmm[i] := PCLMUL128(TEMP1, TEMP2)
DEST[MAXVL-1:VL] := 0

@bacelar
Copy link
Collaborator Author

bacelar commented Mar 21, 2023

...but the description is, at best, misleading...

Instruction:		VPCLMULQDQ ymm1, ymm2, ymm3/m256, imm8
CPUID feature flag:	VPCLMULQDQ
Description:		Carry-less multiplication of one quadword of ymm2 by one quadword
			of ymm3/m256, stores the 128-bit result in ymm1. The immediate is
			used to determine which quadwords of ymm2 and ymm3/m256 should be
			used.     

Copy link
Contributor

@bgregoir bgregoir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that your are convince that Vincent's comment is write (for ymm)

eclib/JModel.ec Outdated
let x2 = v2 \bits64 (k %% 16 %% 2) in
clmulq x1 x2.
op PCLMULQDQ (v1 v2: W128.t) (k: W8.t): W128.t =
let x0 = v1 \bits64 (W8.to_uint k %% 2) in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe k.[0] (i.e bit 0) and k.[4] (bit 4) will be better now that we have W8.t.
But it is upto you

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, makes sense. I'll change it.

@bacelar
Copy link
Collaborator Author

bacelar commented Mar 21, 2023

I assume that your are convince that Vincent's comment is write (for ymm)

yes, I am convinced but wasn't able to test it -- in fact I presumed that the processor feature (VPCLMULQDQ) allowing the ymm variant of the instruction would be available in most processors supporting AVX2. But that does not seem to be the case -- if we exclude the high-end processors from intel and amd (that typically also support AVX512), I can only find the AMD Zen3 processor family.
So, I wonder if it makes sense to support it in Jasmin. I'm inclined to think it would make sense to inhibit it for now (that is, replace check_size_128_256 sz with assert (sz =? U128) ErrType), but keep the definition as it is to smooth a possible extension (if/when someone asks for it...). But would like to ear your opinion.

@bgregoir
Copy link
Contributor

Agree with you, I let you make the change, and thus we can merge.

@vbgl
Copy link
Member

vbgl commented Mar 22, 2023

If the sz argument has a single valid value, then it might not be needed.

@bgregoir
Copy link
Contributor

Ok Vincent is right:
"This instruction computes in parallel 4 pclmul multiplications, i.e. carryless multiplications of binary polynomials of degree at most 63, stored in 4 quadwords in
512 bit registers, as seen above. Thus, four 128 bit results are stored in the zmm register ... "

@bgregoir
Copy link
Contributor

I have not opinion on: should we keep the 256 version.
If someone has a strong opinion against it let them speak now or forever hold their peace.

@bacelar
Copy link
Collaborator Author

bacelar commented Mar 22, 2023

I have not opinion on: should we keep the 256 version.
If someone has a strong opinion against it let them speak now or forever hold their peace.

no strong opinion on that either. I realised that what I've suggested does not do what I intended -- so just keep it unchanged. Lets hope there's someone with a Zen3 processor that would make a good use of it... :-)

@bgregoir bgregoir merged commit 3c783b6 into jasmin-lang:main Mar 22, 2023
vbgl pushed a commit that referenced this pull request May 30, 2023
* add support for PCLMULQDQ instructions

* fix the semantics of VPCLMULQDQ

* cosmetic change on the EC code

---------

Co-authored-by: José Bacelar Almeida <[email protected]>
(cherry picked from commit 3c783b6)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants