-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PyOV] Tensor constructor rework and data casting, adjust bf16 behavior #23263
Conversation
@slyalin could you please take a look, especially for bf16->fp32 upcast partr? @KodiaqQ is it possible to test it functionally at your side from branch? |
Do we still have ability to take original bf16 tensor without any upcasts? Am I right that we still lucking numpy/numpy#19808 in numpy and this is the main reason for adding this bf16 special behavior (and not because user just doesn't expect bf16 as model input/output preferring it to be an internal implementation detail)? |
@slyalin yes, the bfloat16 is not supported and I do not see it being supported in the near future. For example, Torch is addressing this problem by making Tensor class a primary way to handle data. When data is returned as float32, the resulting numpy array can be safely operated on without need to create specialized code for views on the data. This is the main reason of adding this behavior. Similar policies may apply to packed types in a future (i.e. automatic unpacking of u1 to uint8).
For synchronous inference there is no easy way of doing that. Again, this raises the topic of adding ReturnPolicy (or adding yet another |
@@ -133,7 +144,7 @@ def infer( | |||
self, | |||
inputs, | |||
is_shared=share_inputs, | |||
), share_outputs=share_outputs, decode_strings=decode_strings)) | |||
), share_outputs=share_outputs, decode_strings=decode_strings, cast_bf16=cast_bf16)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why infer
should be responsible for data conversion?
I think plugins need to support bf16 and accept it as input. Current flag looks like a WA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please read supporting tickets and this comment: #23263 (comment)
It's not WA, it's a solution to the non-existing data type in numpy. This data conversion is also optional.
if (dst_dtype == ov::element::boolean) { | ||
auto tmp = ov::op::v0::Constant(dst_dtype, | ||
array_helpers::get_shape(array), | ||
array_helpers::array_as_vector_bool<T>(array)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here we perform double copy:
- from array to vector
- from vector to constant
why do we copy and not share data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If share is not possible, the double copy can be avoided by:
auto src_tensor = ov::Tensor(dst_type, array_helpers::get_shape(array), array.data());
Will make view not copy, then just use memcpy to dst tensor.
Check if can work for bool type.
}, | ||
R"( | ||
Gets all outputs tensors of this InferRequest. | ||
|
||
Note: All string-based data is decoded by default. | ||
Note: All bf16-based data is upcasted to fp32 by default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if I set output tensor as bf16? will it be ignored and new one is created?
If yes, it's against API rules
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function return dict were values are copied numpy arrays -- it can just be switched off. However, it was decided that strings will be decoded here -- let's try to have some uniformity between decisions.
Nothing changes with Model/CompiledModel/InferRequest while calling this.
:param dtype: Desired type of Tensor. Leave undefined to inherit the dtype | ||
from numpy array. Casting will always result in copy! | ||
i.e. when creating Type.bf16 tensors or creating Type.f32 tensor | ||
from numpy.float64 array, data sharing will be disabled by defult. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm against of tensor to convert data.
Data conversion is part of graph execution process.
Python bindings are allowed only to perform bind C++ and Python APIs, but not perform extra computations outside of main infer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that extending user-friendly/Pythonic (or following well-known examples) interface and allowing proper data manipulation is more critical.
Let's take an example: https://pytorch.org/docs/stable/generated/torch.from_numpy.html
As you might see there is no way to create bfloat16
Tensor, the approach is following torch.from_numpy(array).to(torch.bfloat16)
.
Here you have definition of to
: https://pytorch.org/docs/stable/generated/torch.Tensor.to.html#torch.Tensor.to
"Otherwise, the returned tensor is a copy of self with the desired torch.dtype and torch.device."
@KodiaqQ please can we have your insight here? As Python (and thus NNCF) perspective can be fundamentally different from C++.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that extending user-friendly/Pythonic (or following well-known examples) interface and allowing proper data manipulation is more critical.
Here you extend computations graph, e.g. introduce pre-post processing capabilities outside of main device graph.
In case of CUDA / PyTorch, you can trace your model with from_numy / to, etc and trace CUDA graph, which will contain such conversions as part of the model (and hence, it's optimized), but in OpenVINO it's not true, because you perform data conversions via slow ov::Constant implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From NNCF's perspective, it might be very useful to have methods for data conversion somewhere in Python API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ilya-lavrenov so what is your proposal to address it in OpenVINO? How should bf16
data be correctly interpreted (from any possible array i.e. int16/float16/float32, with regards to values and not it's binary representation) and populate Tensors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the full problem from the tickets. They just provide some code which doesn't work, but I don't understand how it's related to the fact that we store bf16 as fp16 and we need to change this approach.
} else { | ||
tmp = ov::op::v0::Constant(t).cast_vector<T>(); | ||
} | ||
return py::array(dtype, t.get_shape(), tmp.data()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it perform copy of tmp
array? if yes, is it possible to fill data directly to py::array
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe is possible to implement it like:
return py::array(dtype, t.get_shape(), tmp.data()); | |
return py::array(dtype, t.get_shape(), t.data()); |
if py::array accepts void*
, it looks like T
& dtype
is "same" type so tensor data should be interpreted correctly from void*.
tmp.emplace_back(static_cast<T>(ov::bfloat16::from_bits(*(reinterpret_cast<uint16_t*>(t.data()) + i)))); | ||
} | ||
} else { | ||
tmp = ov::op::v0::Constant(t).cast_vector<T>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here we create 2 temporary objects, which is very unoptimal
should we extend Constant to handle bf16 case as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Constant::cast_vector
is supporting bf16
.
auto tmp = std::make_shared<std::vector<T>>(ov::op::v0::Constant(t).cast_vector<T>());
then, is possible to py::Array extends life for tmp and just wrap in py::array to provide correct iface?
return py::array(dtype, t.get_shape(), tmp);
@@ -716,12 +722,13 @@ void regclass_InferRequest(py::module m) { | |||
cls.def_property_readonly( | |||
"results", | |||
[](InferRequestWrapper& self) { | |||
return Common::outputs_to_dict(self, false, true); | |||
return Common::outputs_to_dict(self, false, true, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why results
are converted to fp32 by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, it was decided that strings will be decoded here -- let's try to have some uniformity between decisions.
:param dtype: Desired type of Tensor. Leave undefined to inherit the dtype | ||
from numpy array. Casting will always result in copy! | ||
i.e. when creating Type.bf16 tensors or creating Type.f32 tensor | ||
from numpy.float64 array, data sharing will be disabled by defult. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that extending user-friendly/Pythonic (or following well-known examples) interface and allowing proper data manipulation is more critical.
Here you extend computations graph, e.g. introduce pre-post processing capabilities outside of main device graph.
In case of CUDA / PyTorch, you can trace your model with from_numy / to, etc and trace CUDA graph, which will contain such conversions as part of the model (and hence, it's optimized), but in OpenVINO it's not true, because you perform data conversions via slow ov::Constant implementations.
} else if (dst_dtype == ov::element::f32) { | ||
return array_from_tensor_t<float>(std::move(t), py::dtype("float32")); | ||
} | ||
if (dst_dtype == ov::element::f16) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (dst_dtype == ov::element::f16) { | |
else if (dst_dtype == ov::element::f16) { |
?
|
||
py::array array_from_constant(ov::op::v0::Constant&& c, bool is_shared) { | ||
// Get actual dtype from OpenVINO type: | ||
auto ov_type = c.get_element_type(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto ov_type = c.get_element_type(); | |
const auto& ov_type = c.get_element_type(); |
} else if (element_type == ov::element::u8 || element_type == ov::element::u1 || element_type == ov::element::u4 || | ||
element_type == ov::element::nf4) { | ||
// WA for u1, u4, nf4, all returned as packed uint8 arrays | ||
return _get_byte_strides<uint8_t>(shape); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fundamental type for these type except u8
is int8_t so maybe use same type as in C++ implementation?
if (dst_dtype == ov::element::bf16 || type_helpers::get_ov_type(array) != dst_dtype) { | ||
tensor_helpers::fill_tensor(tensor, array, dst_dtype); | ||
return tensor; | ||
} | ||
tensor_helpers::fill_tensor(tensor, array); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (dst_dtype == ov::element::bf16 || type_helpers::get_ov_type(array) != dst_dtype) { | |
tensor_helpers::fill_tensor(tensor, array, dst_dtype); | |
return tensor; | |
} | |
tensor_helpers::fill_tensor(tensor, array); | |
if (dst_dtype == ov::element::bf16 || type_helpers::get_ov_type(array) != dst_dtype) { | |
tensor_helpers::fill_tensor(tensor, array, dst_dtype); | |
} else { | |
tensor_helpers::fill_tensor(tensor, array); | |
} |
std::vector<size_t> byte_strides; | ||
std::vector<size_t> element_strides = ov::row_major_strides(s); | ||
for (auto v : element_strides) { | ||
byte_strides.push_back(static_cast<size_t>(v) * sizeof(T)); | ||
} | ||
return byte_strides; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::vector<size_t> byte_strides; | |
std::vector<size_t> element_strides = ov::row_major_strides(s); | |
for (auto v : element_strides) { | |
byte_strides.push_back(static_cast<size_t>(v) * sizeof(T)); | |
} | |
return byte_strides; | |
auto byte_strides = ov::row_major_strides(s); | |
for (auto&& stride : byte_strides){ | |
stride *= sizeof(T); | |
} | |
return byte_strides; |
static_cast<size_t>(c.get_element_type().size()), /* Size of one scalar */ | ||
std::string(1, 'H'), /* Python struct-style format descriptor */ | ||
static_cast<size_t>(shape.size()), /* Number of dimensions */ | ||
std::vector<size_t>{shape.begin(), shape.end()}, /* Buffer dimensions */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why make copy? the ov::Shape can be casted to this kind of vector is base class for Shape.
result.reserve(array.size()); | ||
|
||
for(long int i = 0; i < array.size(); i++) { | ||
result.emplace_back(*(static_cast<T*>(const_cast<void*>(array.data())) + i) != 0 ? 1 : 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why const cast is required?
is array.data() pointing to bool
type.
Is possible to change this loop to something similar like:
std::transform(array.begin(), array.end(), std::back_inserter(result), [](const bool v){ return static_cast<char>(v);});
cast bool value will return 0 or 1.
|
||
template <typename T> | ||
std::vector<T> array_as_vector(py::array& array){ | ||
T *ptr = static_cast<T*>(const_cast<void*>(array.data())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why const_cast
is required?
The data is just read so still can be const.
py::array array_from_tensor_t(ov::Tensor&& t, py::dtype&& dtype) { | ||
std::vector<T> tmp; | ||
if (t.get_element_type() == ov::element::bf16) { | ||
for (size_t i = 0; i < t.get_size(); i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be converted to std::transform?
Get pointer to tensor before loop and use it.
auto t_ptr = reinterpret_cast<const uint16_t*>(t.data());
instead taking it every iteration.
auto element_type = self.get_element_type(); | ||
auto shape = self.get_shape(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto element_type = self.get_element_type(); | |
auto shape = self.get_shape(); | |
const auto& element_type = self.get_element_type(); | |
const auto& shape = self.get_shape(); |
the strides could be calculate like:
auto strides = ov::row_major_strides(s)
// then mul each stride in strides by element_type.size()
This PR will be closed in a week because of 2 weeks of no activity. |
Closed in favor of #23771 |
Details:
cast_data
function that can up/down-cast data in available numpy formats, i.e.Tensor.data
of bf16 is returning array in f16 that needs to be reinterpreted from bytes --cast_data
allows to cast it to f32 so conversion is not loosing any information and providing easy interface to get data in numpy format.cast_bf16
flag determine what result will be produced. To keep backward compatibility, default value isFalse
and returned array is ofnp.float16
type to be reinterpreted. By settingcast_bf16=True
the data will be converted tonp.float32
type, however it will loose sharing capabilities i.e. when user setsshare_outputs=True
, bf16 data will always be a copy.Tensor(data, *, dtype, share)
andcast_data
.Tickets: