-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add support for PCLMULQDQ instructions #396
Conversation
eclib/JModel.ec
Outdated
|
||
op PCLMULQDQ (v1 v2: W128.t) (k: int): W128.t = | ||
let x1 = v1 \bits64 (k %% 2) in | ||
let x2 = v2 \bits64 (k %% 16 %% 2) in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is suspicious, since k %% 16 %% 2 = k %% 2
I think k%%16 should be k%/ 16.
I that case (k%%2) will be bit 0 of k and k %/ 16 %2 bit 4 of k.
Also why k is an int I was expected a W8.t.
But this k should be an immediate it is maybe not important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, I think the extraction will fail. Because
In jasmin the last argument is a w8 and in EC the last argument is a W8.t.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, sorry about that. The index computation was intended to be (_ %/ _ %% _) -- but the whole EC portion was a last minute addition, and definitely very unsatisfactory :-(. I'll properly revise and test it soon.
eclib/JModel.ec
Outdated
|
||
op VPCLMULQDQ_256 (v1 v2: W256.t) (k: int): W256.t = | ||
let x1 = v1 \bits64 (k %% 2) in | ||
let x2 = v2 \bits64 (k %% 16 %% 2) in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same remark here
|
||
Definition Ox86_PCLMULQDQ_instr := | ||
mk_instr_pp "PCLMULQDQ" [:: sword U128; sword U128; sword U8] (w_ty U128) | ||
[:: E 0; E 1; E 2] [:: E 0] MSB_CLEAR (@x86_VPCLMULQDQ U128) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MSB_CLEAR should be MSB_KEEP.
Definition Ox86_VPCLMULQDQ_instr := | ||
(fun sz => | ||
mk_instr (pp_sz "VPCLMULQDQ"%string sz) [:: sword sz; sword sz; sword U8] (w_ty sz) | ||
[:: E 1; E 2; E 3] [:: E 0] MSB_CLEAR (@x86_VPCLMULQDQ sz) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MSB_CLEAR is ok here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added some comment.
You should also add something in the change log
Reading the manual, I understand that the 256-bit variant does two parallel multiplications. Is it true? I don’t have access to any hardware with this instruction. |
I did not understand that, maybe I have skip a part of the manual. Where did you understand that? |
my understanding was also the it zero-extends the result, but on a second look you are probably right. I'll double check on that
|
...but the description is, at best, misleading...
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that your are convince that Vincent's comment is write (for ymm)
eclib/JModel.ec
Outdated
let x2 = v2 \bits64 (k %% 16 %% 2) in | ||
clmulq x1 x2. | ||
op PCLMULQDQ (v1 v2: W128.t) (k: W8.t): W128.t = | ||
let x0 = v1 \bits64 (W8.to_uint k %% 2) in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe k.[0] (i.e bit 0) and k.[4] (bit 4) will be better now that we have W8.t.
But it is upto you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, makes sense. I'll change it.
yes, I am convinced but wasn't able to test it -- in fact I presumed that the processor feature (VPCLMULQDQ) allowing the ymm variant of the instruction would be available in most processors supporting AVX2. But that does not seem to be the case -- if we exclude the high-end processors from intel and amd (that typically also support AVX512), I can only find the AMD Zen3 processor family. |
Agree with you, I let you make the change, and thus we can merge. |
If the |
Ok Vincent is right: |
I have not opinion on: should we keep the 256 version. |
no strong opinion on that either. I realised that what I've suggested does not do what I intended -- so just keep it unchanged. Lets hope there's someone with a Zen3 processor that would make a good use of it... :-) |
* add support for PCLMULQDQ instructions * fix the semantics of VPCLMULQDQ * cosmetic change on the EC code --------- Co-authored-by: José Bacelar Almeida <[email protected]> (cherry picked from commit 3c783b6)
This PR adds support for the carry-less multiplication instructions (PCLMULQDQ and VPCLMULQDQ).