Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rdataframe to awkward #1448

Closed
wants to merge 132 commits into from
Closed

rdataframe to awkward #1448

wants to merge 132 commits into from

Conversation

ianna
Copy link
Collaborator

@ianna ianna commented Apr 27, 2022

No description provided.

@ianna ianna marked this pull request as draft April 27, 2022 14:11
@codecov
Copy link

codecov bot commented Apr 27, 2022

@ianna
Copy link
Collaborator Author

ianna commented Apr 29, 2022

@jpivarski - would the following guarantee that the result does not go away? It's in ROOT src/bindings/pyroot_legacy/_rdf_utils.py

import numpy


class ndarray(numpy.ndarray):
    """
    A wrapper class that inherits from numpy.ndarray and allows to attach the
    result pointer of the `Take` action in an `RDataFrame` event loop to the
    collection of values returned by that action. See
    https://docs.scipy.org/doc/numpy/user/basics.subclassing.html for more
    information on subclassing numpy arrays.
    """
    def __new__(cls, numpy_array, result_ptr):
        """
        Dunder method invoked at the creation of an instance of this class. It
        creates a numpy array with an `RResultPtr` as an additional
        attribute.
        """
        obj = numpy.asarray(numpy_array).view(cls)
        obj.result_ptr = result_ptr
        return obj

    def __array_finalize__(self, obj):
        """
        Dunder method that fills in the instance default `result_ptr` value.
        """
        if obj is None: return
        self.result_ptr = getattr(obj, "result_ptr", None)

@jpivarski
Copy link
Member

That class doesn't do anything to attach itself to a RDataSource by itself, but presumably NumPy arrays require special handling because np.ndarray is an extension type. By subclassing it, they've made an equivalent of an np.ndarray that would work around any issues due to np.ndarray not being an ordinary Python type. Our arrays and Lookup are ordinary Python classes.

But you're right that RNumpyDS is going to have exactly the same problem that we do: the data is owned by a NumPy object, so that NumPy object has to be kept in scope as long as the RNumpyDS is. Here's the pointer:

https://github.com/root-project/root/blob/4f9759c604d05a8304bf0b55f5738a479441e151/bindings/pyroot/pythonizations/inc/RNumpyDS.hxx#L60-L63

and maybe that PyObject* is actually an ndarray subclass, rather than NumPy's own np.ndarray, for technical reasons, but what's important for us is that they do hold such an object.

The RNumpyDS does Py_INCREF and Py_DECREF the fPyRVecs with no special handling for the Python GIL:

https://github.com/root-project/root/blob/4f9759c604d05a8304bf0b55f5738a479441e151/bindings/pyroot/pythonizations/inc/RNumpyDS.hxx#L129-L150

There are no other references to fPyRVecs in the RNumpyDS implementation. If this is the right way to do it, then we should do the same. Our RDataSources run under the same conditions as RNumpyDS; we should be doing the same memory management.

If this is right, then I think that means that ROOT functions that might create or delete RNumpyDS instances are holding the Python GIL. That surprises me, since it would prevent speed-ups in parallel processing. Maybe it's okay because the parallel processing only happens in a fork-join pattern in which the RNumpyDS is never created or destroyed?

It's still worth asking As it turns out, @etejedor has already answered my question here: #1446 (comment). I'm just getting the messages out of order. Okay, so it is safe to Py_INCREF and Py_DECREF freely in the constructor and destructor, without any special GIL-handling, because they aren't called from __release_gil__ functions.

@ianna ianna force-pushed the ianna/rdataframe-to-awkward branch from 692b33d to 9e1ac32 Compare May 4, 2022 14:01
@ianna
Copy link
Collaborator Author

ianna commented May 4, 2022

@jpivarski - it looks like the failing pre-commit is unrelated to this PR;

flake8...................................................................Failed
- hook id: flake8
- exit code: 1

src/awkward/config.py:99:5: T201 print found.
src/awkward/_v2/_connect/numba/arrayview.py:16:9: T201 print found.
src/awkward/_v2/_connect/numba/arrayview.py:17:9: T201 print found.
src/awkward/_connect/_numba/arrayview.py:20:9: T201 print found.
src/awkward/_connect/_numba/arrayview.py:21:9: T201 print found.

@jpivarski
Copy link
Member

src/awkward/config.py:99:5: T201 print found.

We should get rid of anything related to "config.py". It's for compiling dependents, which is something we're giving up on on v2.

> ```shell
> src/awkward/_v2/_connect/numba/arrayview.py:16:9: T201 print found.
> src/awkward/_v2/_connect/numba/arrayview.py:17:9: T201 print found.
> src/awkward/_connect/_numba/arrayview.py:20:9: T201 print found.
> src/awkward/_connect/_numba/arrayview.py:21:9: T201 print found.
> ```

There shouldn't be print statements in these files. Remove them.

How did they get there and why is this showing up now? I have no idea.

@ianna
Copy link
Collaborator Author

ianna commented May 4, 2022

```shell
src/awkward/config.py:99:5: T201 print found.

We should get rid of anything related to "config.py". It's for compiling dependents, which is something we're giving up on on v2.

src/awkward/_v2/_connect/numba/arrayview.py:16:9: T201 print found.
src/awkward/_v2/_connect/numba/arrayview.py:17:9: T201 print found.
src/awkward/_connect/_numba/arrayview.py:20:9: T201 print found.
src/awkward/_connect/_numba/arrayview.py:21:9: T201 print found.

There shouldn't be print statements in these files. Remove them.

How did they get there and why is this showing up now? I have no idea.

I think, print here is not ignored anymore:

def code_to_function(code, function_name, externals=None, debug=False):
    if debug:
        print("################### " + function_name)  # noqa: T001
        print(code)  # noqa: T001

@jpivarski
Copy link
Member

Apparently, what happened is that the number changed from T001 to T201. I would have thought that those numbers wouldn't change—why else would they be short codes like that? Oh well. Since this might be run under different versions of flake8-print, it's probably safest to ignore both:

# noqa: T001,T201

I don't know what the new "T001" is, but we're only losing the protection of this test on a few isolated lines, so it's probably fine.

@ianna ianna mentioned this pull request May 17, 2022
@ianna
Copy link
Collaborator Author

ianna commented May 17, 2022

replaced by #1474

@ianna ianna closed this May 17, 2022
@jpivarski jpivarski deleted the ianna/rdataframe-to-awkward branch September 23, 2022 00:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants