Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
优化结果
尝试了一些优化方法,目前最快只能达到100ms左右。
work2.cpp对应的是my2的结果,对比同等编译选项的原来的源代码快10.33倍,比原工程快16倍。
work.cpp对应的是my的结果,对比同等编译选项快6.77倍,对比原工程优化快10.55倍。
优化思路
编译选项
首先想到的是数学计算,考虑到原算法确保了不会出现计算得到NaN的情况,果断开启
-ffast-math -march=native
选项。可以和baseline对比发现速度明显提高不少。
AOS-->SOA
原来的数据包装是比较符合OOP的封装,但是7个float既没有8字节对齐,也不利于编译器自动优化。改为SOA后速度提升。
从godbolt上可以明显看到SIMD的汇编指令变多了。
复制数据去除数据依赖
step函数明显是一个二维循环遍历,编译器自动优化时star和other都是来自同一数组的迭代器,指令流水线容易产生读写冲突。因此可以复制一份原始数据作为other使用,而且other从算法上看也是只发生读取的,复制一份没有问题。解决可能发生的数据读写依赖冲突后,编译器优化性能大幅提升,基本达到work.cpp的性能。
sqrt函数
step中循环的迭代部分,最耗时的操作无疑就是sqrt。因此改用
std::sqrt
能够避免使用双精度指令浪费性能,而且使编译器优化得更加快,更加好。又或者使用更快的开方计算技巧,我尝试了著名的雷神之锤3中的开方函数,效果还可以,比std::sqrt快一点点,并且这个条件下可以关闭-ffast-math
选项。手写SIMD
尝试了下手写SIMD,直接在外围循环展开。由于机器限制只能使用256位计算。但是速度确实很快,达到100ms左右。