-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add riscv float32 gemm #4903
Add riscv float32 gemm #4903
Conversation
期待你的 riscv gemm !这个任务挺难的( |
仿照arm64版本的gemm,有了一个float32的基本实现,可以通过test_gemm和test_gemm1的测试。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
很多intrinsics在新的gcc中无法编译了,这里需要兼容,具体看下ci中的错误
Codecov Report
@@ Coverage Diff @@
## master #4903 +/- ##
==========================================
- Coverage 94.90% 94.81% -0.10%
==========================================
Files 779 769 -10
Lines 223166 239834 +16668
==========================================
+ Hits 211795 227394 +15599
- Misses 11371 12440 +1069
|
nT 初始化成了随机数? |
Rewrite some intrinsic now performance OK
qemu的时间也太不准了啊...
|
src/layer/riscv/riscv_usability.h
Outdated
@@ -86,6 +86,284 @@ static inline vfloat32m8_t vle32_v_f32m8_f32m1(const float* ptr) | |||
return vloxei32_v_f32m8(ptr, bindex, vl); | |||
} | |||
|
|||
#define VL 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不要在h里面直接define VL 4
会污染到其他所有include这个h的代码
我看 transpose8x8_ps 调用的地方,基本都有前面的 load / 后面的 store,这么看或许根本不需要这个 transpose8x8_ps ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
transpose8x8_ps和transpose4x4_ps的情况好像确实可以这样做
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个还没修,有时间了再改一下
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
要求transpose之后向量寄存器要存到的内存是连续的,这样就可以不用transpose8x8_ps里面的tmp数组了,但是看了一下gemm中现有的transpose满足这个条件的并不多。要求transpose之后向量寄存器要存到的内存是连续的,这样就可以不用transpose8x8_ps里面的tmp数组了,但是看了一下gemm中现有的transpose满足这个条件的并不多。
如果自认为完成,请去掉标题的 WIP |
其实还有fp16的gemm没做,但是最近这两天又比较忙,想等到周末再看看。评估一下,如果真的做不完了就把现在的PR改成"Add riscv float32 gemm". |
没有在riscv中找到与vfmlalq_laneq_low_f16类似的f16乘f16最后和f32累加的intrinsic函数。在计算前先讲f32转换成f16,如果还是需要在运算时将f16转换成f32,那么这样做能取得足够的收益吗? |
vfwmul_vf_f32m2 可以参考 convolution_packn_fp16s.h fp16s部分写法 |
国庆之前估计没时间写了,放假的时候应该可以搞一下,完成“利用risc-v vector和zfh(fp16)扩展优化实现gemm_riscv.cpp,使用qemu测试”的目标 |
Thanks for your contribution ! |
No description provided.