-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dev #114
base: dev
Are you sure you want to change the base?
Dev #114
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many thanks for the PR! I really like it :-)
I raised a number of comments, but it seems that there's nothing serious.
For most of the algorithms I think we could rewrite them to not use non-SIMD operations in the prologue and epilogue at all. We could do a single unaligned load that overlaps with the main aligned SIMD body, do the computations and then do unaligned store that also overlaps with the main aligned SIMD body. This would be faster in most cases, as the scalar code is multiple times slower than SIMD.
@@ -0,0 +1,69 @@ | |||
/* Copyright (C) 2018 Povilas Kanapickas <[email protected]> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't be shy to use your name :-) It's you who wrote the code, the copyrights belong to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I think we could share the copyright. Anyway its your lib :) and i just add new capabilities
simdpp/algorithm/all_of.h
Outdated
#include <simdpp/algorithm/helper_input_range.h> | ||
|
||
namespace simdpp { | ||
namespace SIMDPP_ARCH_NAMESPACE { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation: Please don't indent the namespace blocks. Also, the library uses 4 space indent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. I switch between different computer (linux,windows) and default indentation are not the same between my ide.
simdpp/algorithm/all_of.h
Outdated
template<typename T, typename UnaryPredicate> | ||
bool all_of(T const* first, T const* last, UnaryPredicate pred) | ||
{ | ||
#ifndef NDEBUG //precondition debug mode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could add something like SIMDPP_DEBUG and use it throughout the library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed switch between NDEBUG->SIMDPP_DEBUG. Now we have to decide when/where we activate these flag regarding input configurations
simdpp/algorithm/all_of.h
Outdated
throw std::runtime_error("all_of - null ptr last."); | ||
#endif | ||
|
||
using simd_type_T = typename typetraits<T>::simd_type; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer type_traits. Is there any conflict that prevents this name by chance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK i could switch to type_traits or simd_type_traits to avoid any conflicts
simdpp/algorithm/all_of.h
Outdated
|
||
//prologue | ||
auto lastprologue = first + size_prologue_loop; | ||
if(!std::all_of(first, lastprologue, pred)) return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Space after if
, return on next line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed. And also fixed for any_of
test/insn/all_of.cc
Outdated
const auto predEqualTen = UnaryPredicateEqualValue<T>((T)10); | ||
const auto predEqualFive = UnaryPredicateEqualValue<T>((T)5); | ||
{ //test prologue | ||
vector_t ivect = { (T)10,(T)10 }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for higher level algorithms we need more exhaustive testing. Like all small array lengths, plus all combinations of alignments of both sequence begin and end pointers. Probably makes sense to create some kind of generic generator and use it in all tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK i will think about this. And provide something asap
simdpp/algorithm/transform_reduce.h
Outdated
//---prologue | ||
for (; i < size_prologue_loop; ++i) | ||
{ | ||
init = binary_op(init,unary_op(*first++)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to do possibly overlapping SIMD operations for the prologue and epilogue and then mask out the extra lanes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry i don't understand this point. Could you make an exemple?
simdpp/algorithm/transform.h
Outdated
auto i = 0u; | ||
|
||
//---prologue | ||
for (; i < size_prologue_loop; ++i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to do possibly overlapping SIMD operations for the prologue and epilogue. Some elements will be computed twice, but the code will be much faster this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idem
@@ -84,10 +84,10 @@ inline Arch get_arch_string_list(const char* const strings[], int count, const c | |||
return res; | |||
#endif | |||
|
|||
int prefixlen = std::strlen(prefix); | |||
for (int i = 0; i < count; ++i) { | |||
auto prefixlen = std::strlen(prefix); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer the explicit type for the simple numeric types. Auto makes the code less readable in these cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK use size_t instead
test/utils/test_helpers.h
Outdated
@@ -90,6 +90,7 @@ class TestData { | |||
TestData& operator=(const TestData& other) | |||
{ | |||
data_ = other.data_; | |||
return (*this); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we drop the parentheses? I thing they're not needed in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
Thanks !
May i miss something but as i spotted above, we also need prologue if data lenght is too small to fit in simd registers( eg 7 uint , in this way we could use transparently simddp::function everywhere ) |
Yes, if the total length is less than the width of the register, then the scalar part is needed. My point was that if you have, say, a range of 15 uint16 elements to process, then it's faster to just process two overlapping 8 element pairs instead of doing 1 full wave and 7 element scalar prologue/epilogue. |
Follow review * fix indent * add "fuzzing" tests for all algorithm * add TEST_EQUAL_COLLECTIONS * add nrt helpers for generating data (to be moved elsewhere ?)
* Try to fix visual 2013/2015 compilation issues * enforce const/inline and noexcept for predicate
what is the purpose of having SIMDPP_NOEXECPT and changing inline to a custom macro? |
Regards |
* Add google benchmark as ExternalProject * Add three bench suite transform "unary", reduce "unary", load/store Todo: * strange behavior on transform bench suite. STD seems faster than SIMD on MSVC2017 <--- to be checked on gcc>5 * add other cases
First pull request around isue #107
Add all_of,any_of,copy,copy_n,count,count_if,equal,fill,find,find_if,find_if_not,lexicographical_compare,max,max_element,min,min_element,none_of,reduce,replace,replace_if,transform,transform_reduce "STL" like algorithm
Provide non regressions tests (Validated on Visual2017 and GCC) and documentation
Please pay attention on workaround i make around masktype. May i miss something and better approach exist
I think a preliminary refactoring could be to move some usefull Unary/Binary predicate in a dedicated header
Other fix and/or proposal
fix TestData& operator=(const TestData& other) assignment operator
reduce warning (of course last commit coud be dropped)