Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PyOV] Tensor constructor rework and data casting, adjust bf16 behavior #23263

Closed
wants to merge 11 commits into from

Conversation

jiwaszki
Copy link

@jiwaszki jiwaszki commented Mar 4, 2024

Details:

  • Added new constructor for Tensor class that resolves issues with up/down-casting of data, including custom bf16 type.
  • Added cast_data function that can up/down-cast data in available numpy formats, i.e. Tensor.data of bf16 is returning array in f16 that needs to be reinterpreted from bytes -- cast_data allows to cast it to f32 so conversion is not loosing any information and providing easy interface to get data in numpy format.
  • Adjusted returning values from inference calls to be optionally upcasted to f32 in cases of bf16 outputs. Using cast_bf16 flag determine what result will be produced. To keep backward compatibility, default value is False and returned array is of np.float16 type to be reinterpreted. By setting cast_bf16=True the data will be converted to np.float32 type, however it will loose sharing capabilities i.e. when user sets share_outputs=True, bf16 data will always be a copy.
  • Adjusted data dispatcher to utilize new Tensor capabilities.
  • TODO: tests for new constructor Tensor(data, *, dtype, share) and cast_data.

Tickets:

  • 129930, part of 125433, 127292

@jiwaszki jiwaszki added category: Python API OpenVINO Python bindings WIP work in progress labels Mar 4, 2024
@jiwaszki jiwaszki added this to the 2024.1 milestone Mar 4, 2024
@jiwaszki jiwaszki self-assigned this Mar 4, 2024
@mlukasze
Copy link
Contributor

@slyalin could you please take a look, especially for bf16->fp32 upcast partr?

@KodiaqQ is it possible to test it functionally at your side from branch?

@slyalin
Copy link
Contributor

slyalin commented Mar 11, 2024

Adjusted returning values from inference calls to be automatically upcasted to f32 in cases of bf16 outputs.

Do we still have ability to take original bf16 tensor without any upcasts?

Am I right that we still lucking numpy/numpy#19808 in numpy and this is the main reason for adding this bf16 special behavior (and not because user just doesn't expect bf16 as model input/output preferring it to be an internal implementation detail)?

@jiwaszki
Copy link
Author

jiwaszki commented Mar 11, 2024

Adjusted returning values from inference calls to be automatically upcasted to f32 in cases of bf16 outputs.

Do we still have ability to take original bf16 tensor without any upcasts?

Am I right that we still lucking numpy/numpy#19808 in numpy and this is the main reason for adding this bf16 special behavior (and not because user just doesn't expect bf16 as model input/output preferring it to be an internal implementation detail)?

@slyalin yes, the bfloat16 is not supported and I do not see it being supported in the near future.

For example, Torch is addressing this problem by making Tensor class a primary way to handle data.
For OV it's not that simple as commitment is already made to numpy array being default way of communicating with Python (ov.Tensor is more of a wrapper to the memory than standalone class with various functionalities). Also inference functions return numpy arrays instead of ov.Tensors.

When data is returned as float32, the resulting numpy array can be safely operated on without need to create specialized code for views on the data. This is the main reason of adding this behavior. Similar policies may apply to packed types in a future (i.e. automatic unpacking of u1 to uint8).

Do we still have ability to take original bf16 tensor without any upcasts?

For synchronous inference there is no easy way of doing that. Again, this raises the topic of adding ReturnPolicy (or adding yet another casting=True/False) to inference calls to address such cases (EDIT: added cast_bf16 to ensure backward compatibility).
It is possible to get ov.Tensor by using get_tensor() on InferRequest and getting the data to be reinterpreted. For asynchronous inference this way is already preferred, so concerns are only about sync version.

@jiwaszki jiwaszki marked this pull request as ready for review March 14, 2024 14:53
@jiwaszki jiwaszki requested a review from a team as a code owner March 14, 2024 14:53
@jiwaszki
Copy link
Author

@slyalin @akuporos Please review, thanks!

@@ -133,7 +144,7 @@ def infer(
self,
inputs,
is_shared=share_inputs,
), share_outputs=share_outputs, decode_strings=decode_strings))
), share_outputs=share_outputs, decode_strings=decode_strings, cast_bf16=cast_bf16))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why infer should be responsible for data conversion?
I think plugins need to support bf16 and accept it as input. Current flag looks like a WA

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please read supporting tickets and this comment: #23263 (comment)

It's not WA, it's a solution to the non-existing data type in numpy. This data conversion is also optional.

if (dst_dtype == ov::element::boolean) {
auto tmp = ov::op::v0::Constant(dst_dtype,
array_helpers::get_shape(array),
array_helpers::array_as_vector_bool<T>(array));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we perform double copy:

  • from array to vector
  • from vector to constant

why do we copy and not share data?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If share is not possible, the double copy can be avoided by:

auto src_tensor = ov::Tensor(dst_type, array_helpers::get_shape(array), array.data());
Will make view not copy, then just use memcpy to dst tensor.
Check if can work for bool type.

},
R"(
Gets all outputs tensors of this InferRequest.

Note: All string-based data is decoded by default.
Note: All bf16-based data is upcasted to fp32 by default.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if I set output tensor as bf16? will it be ignored and new one is created?
If yes, it's against API rules

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function return dict were values are copied numpy arrays -- it can just be switched off. However, it was decided that strings will be decoded here -- let's try to have some uniformity between decisions.

Nothing changes with Model/CompiledModel/InferRequest while calling this.

:param dtype: Desired type of Tensor. Leave undefined to inherit the dtype
from numpy array. Casting will always result in copy!
i.e. when creating Type.bf16 tensors or creating Type.f32 tensor
from numpy.float64 array, data sharing will be disabled by defult.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm against of tensor to convert data.
Data conversion is part of graph execution process.

Python bindings are allowed only to perform bind C++ and Python APIs, but not perform extra computations outside of main infer

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that extending user-friendly/Pythonic (or following well-known examples) interface and allowing proper data manipulation is more critical.

Let's take an example: https://pytorch.org/docs/stable/generated/torch.from_numpy.html
As you might see there is no way to create bfloat16 Tensor, the approach is following torch.from_numpy(array).to(torch.bfloat16).
Here you have definition of to: https://pytorch.org/docs/stable/generated/torch.Tensor.to.html#torch.Tensor.to

"Otherwise, the returned tensor is a copy of self with the desired torch.dtype and torch.device."

@KodiaqQ please can we have your insight here? As Python (and thus NNCF) perspective can be fundamentally different from C++.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that extending user-friendly/Pythonic (or following well-known examples) interface and allowing proper data manipulation is more critical.

Here you extend computations graph, e.g. introduce pre-post processing capabilities outside of main device graph.

In case of CUDA / PyTorch, you can trace your model with from_numy / to, etc and trace CUDA graph, which will contain such conversions as part of the model (and hence, it's optimized), but in OpenVINO it's not true, because you perform data conversions via slow ov::Constant implementations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From NNCF's perspective, it might be very useful to have methods for data conversion somewhere in Python API.

Copy link
Author

@jiwaszki jiwaszki Mar 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ilya-lavrenov so what is your proposal to address it in OpenVINO? How should bf16 data be correctly interpreted (from any possible array i.e. int16/float16/float32, with regards to values and not it's binary representation) and populate Tensors?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the full problem from the tickets. They just provide some code which doesn't work, but I don't understand how it's related to the fact that we store bf16 as fp16 and we need to change this approach.

} else {
tmp = ov::op::v0::Constant(t).cast_vector<T>();
}
return py::array(dtype, t.get_shape(), tmp.data());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it perform copy of tmp array? if yes, is it possible to fill data directly to py::array?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe is possible to implement it like:

Suggested change
return py::array(dtype, t.get_shape(), tmp.data());
return py::array(dtype, t.get_shape(), t.data());

if py::array accepts void*, it looks like T & dtype is "same" type so tensor data should be interpreted correctly from void*.

tmp.emplace_back(static_cast<T>(ov::bfloat16::from_bits(*(reinterpret_cast<uint16_t*>(t.data()) + i))));
}
} else {
tmp = ov::op::v0::Constant(t).cast_vector<T>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we create 2 temporary objects, which is very unoptimal

should we extend Constant to handle bf16 case as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Constant::cast_vector is supporting bf16.

auto tmp = std::make_shared<std::vector<T>>(ov::op::v0::Constant(t).cast_vector<T>());

then, is possible to py::Array extends life for tmp and just wrap in py::array to provide correct iface?

return py::array(dtype, t.get_shape(), tmp);

@@ -716,12 +722,13 @@ void regclass_InferRequest(py::module m) {
cls.def_property_readonly(
"results",
[](InferRequestWrapper& self) {
return Common::outputs_to_dict(self, false, true);
return Common::outputs_to_dict(self, false, true, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why results are converted to fp32 by default?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#23263 (comment)

However, it was decided that strings will be decoded here -- let's try to have some uniformity between decisions.

:param dtype: Desired type of Tensor. Leave undefined to inherit the dtype
from numpy array. Casting will always result in copy!
i.e. when creating Type.bf16 tensors or creating Type.f32 tensor
from numpy.float64 array, data sharing will be disabled by defult.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that extending user-friendly/Pythonic (or following well-known examples) interface and allowing proper data manipulation is more critical.

Here you extend computations graph, e.g. introduce pre-post processing capabilities outside of main device graph.

In case of CUDA / PyTorch, you can trace your model with from_numy / to, etc and trace CUDA graph, which will contain such conversions as part of the model (and hence, it's optimized), but in OpenVINO it's not true, because you perform data conversions via slow ov::Constant implementations.

@jiwaszki jiwaszki removed the WIP work in progress label Mar 18, 2024
} else if (dst_dtype == ov::element::f32) {
return array_from_tensor_t<float>(std::move(t), py::dtype("float32"));
}
if (dst_dtype == ov::element::f16) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (dst_dtype == ov::element::f16) {
else if (dst_dtype == ov::element::f16) {

?


py::array array_from_constant(ov::op::v0::Constant&& c, bool is_shared) {
// Get actual dtype from OpenVINO type:
auto ov_type = c.get_element_type();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto ov_type = c.get_element_type();
const auto& ov_type = c.get_element_type();

} else if (element_type == ov::element::u8 || element_type == ov::element::u1 || element_type == ov::element::u4 ||
element_type == ov::element::nf4) {
// WA for u1, u4, nf4, all returned as packed uint8 arrays
return _get_byte_strides<uint8_t>(shape);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fundamental type for these type except u8 is int8_t so maybe use same type as in C++ implementation?

Comment on lines +497 to +501
if (dst_dtype == ov::element::bf16 || type_helpers::get_ov_type(array) != dst_dtype) {
tensor_helpers::fill_tensor(tensor, array, dst_dtype);
return tensor;
}
tensor_helpers::fill_tensor(tensor, array);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (dst_dtype == ov::element::bf16 || type_helpers::get_ov_type(array) != dst_dtype) {
tensor_helpers::fill_tensor(tensor, array, dst_dtype);
return tensor;
}
tensor_helpers::fill_tensor(tensor, array);
if (dst_dtype == ov::element::bf16 || type_helpers::get_ov_type(array) != dst_dtype) {
tensor_helpers::fill_tensor(tensor, array, dst_dtype);
} else {
tensor_helpers::fill_tensor(tensor, array);
}

Comment on lines +120 to +125
std::vector<size_t> byte_strides;
std::vector<size_t> element_strides = ov::row_major_strides(s);
for (auto v : element_strides) {
byte_strides.push_back(static_cast<size_t>(v) * sizeof(T));
}
return byte_strides;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::vector<size_t> byte_strides;
std::vector<size_t> element_strides = ov::row_major_strides(s);
for (auto v : element_strides) {
byte_strides.push_back(static_cast<size_t>(v) * sizeof(T));
}
return byte_strides;
auto byte_strides = ov::row_major_strides(s);
for (auto&& stride : byte_strides){
stride *= sizeof(T);
}
return byte_strides;

static_cast<size_t>(c.get_element_type().size()), /* Size of one scalar */
std::string(1, 'H'), /* Python struct-style format descriptor */
static_cast<size_t>(shape.size()), /* Number of dimensions */
std::vector<size_t>{shape.begin(), shape.end()}, /* Buffer dimensions */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why make copy? the ov::Shape can be casted to this kind of vector is base class for Shape.

result.reserve(array.size());

for(long int i = 0; i < array.size(); i++) {
result.emplace_back(*(static_cast<T*>(const_cast<void*>(array.data())) + i) != 0 ? 1 : 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why const cast is required?
is array.data() pointing to bool type.
Is possible to change this loop to something similar like:

std::transform(array.begin(), array.end(), std::back_inserter(result), [](const bool v){ return static_cast<char>(v);});

cast bool value will return 0 or 1.


template <typename T>
std::vector<T> array_as_vector(py::array& array){
T *ptr = static_cast<T*>(const_cast<void*>(array.data()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why const_cast is required?
The data is just read so still can be const.

py::array array_from_tensor_t(ov::Tensor&& t, py::dtype&& dtype) {
std::vector<T> tmp;
if (t.get_element_type() == ov::element::bf16) {
for (size_t i = 0; i < t.get_size(); i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be converted to std::transform?
Get pointer to tensor before loop and use it.
auto t_ptr = reinterpret_cast<const uint16_t*>(t.data());
instead taking it every iteration.

Comment on lines +380 to +381
auto element_type = self.get_element_type();
auto shape = self.get_shape();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto element_type = self.get_element_type();
auto shape = self.get_shape();
const auto& element_type = self.get_element_type();
const auto& shape = self.get_shape();

the strides could be calculate like:

auto strides = ov::row_major_strides(s)
// then mul each stride in strides by element_type.size() 

Copy link
Contributor

This PR will be closed in a week because of 2 weeks of no activity.

@github-actions github-actions bot added the Stale label Apr 15, 2024
@jiwaszki jiwaszki closed this Apr 18, 2024
@jiwaszki
Copy link
Author

Closed in favor of #23771

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants