-
Notifications
You must be signed in to change notification settings - Fork 279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add TVM backend #232
Comments
To add some documentation on what is happening. We had a first version of TVM integrated. For Tree models only GEMM worked because Afterwards we saw that there was some typing error. This PR fixed that. The related issue explain the problem in details. Now we have GEMM and the two tree traversal strategies working on TVM. |
One thing I noticed is that models coming from hummingbird is extremely slow to compile on TVM. The following test with the data size (20, 10) doesn't finishing compiling in 30 min on my laptop. The Relay model doesn't look complicated, and for standard imagenet models compilation doesn't take more than 1 min. I have a feeling that there is an infinite loop going on inside TVM codegen. Do you have a similar observation?
|
I was actually testing it myself this morning and I had troubles with the lightgbm notebook (it crashed the kernel, i didn't had the chance to debug further but I guess it's the same problem). The blog notebook worked well. I will try with your test and let you know. Any idea on what is going on? |
I have no idea, but since conversion to Relay is instant, this is not a frontend problem, but somewhere deep inside codegen. It's likely that workloads from hummingbird are novel ones that TVM hasn't seen. It would be really interesting if debugging this problem uncovers a bug in TVM codegen. |
On my mac actually worked. It took like few minutes (probably 3 or 4). |
Interesting, but still taking few minutes for such a simple model doesn't seem right. I've seen a report saying compiling Mask RCNN taking 20 min https://discuss.tvm.ai/t/vm-slow-compilation-of-tf-object-detection-models/7479/8 I'm assuming making dataset bigger also makes trees and hence converted models bigger? I wonder how much it would take to compile for a real world dataset. (I've read your OSDI paper yesterday and it says TVM compliation is slow). |
Yes for some models on GPU I think we saw something like in the ballpark of 40 minutes. This was with some GBDT with 500 trees on a dataset with few hundreds features. In the paper however we used custom Relay implementations (not the PyTorch frontend, which should make no difference), few custom operators, and TVM 0.6. Our goal for the moment is to replicate the same speedups we saw using the custom models with your fronted. And fix the compilation time while we do this :) |
Now I am having the same problem. I increased the numbers of trees to 10 and the input dataset to 2000x28 and it is still compiling after 10 minutes. I tried with the dataset we were using in the lgbm notebook (200k records) and compilation throws a segfault. |
Ok, still running after 1 hour. My setting:
@masahi do you think we can have someone looking into this? It get unmanageable quite fast, this was not happening with our internal Relay models on 0.6. |
We can open an issue there once we understand what is going on inside codegen. I have no idea yet, but I know where to look. If this is something I can fix, I'll do it. One thing worth checking now is comparing the output of PyTorch frontend and the hand written one you have. |
Getting that version will take a while, but I will try. Will keep you posted once I get there, please share if you find anything in between. |
I did some digging. It seems the problem is due to the very aggresive operator fusion TVM does and its consquence of very large inlined expression. First, this is the Relay graph for dataset of size (5, 10).
This is the same graph after operator fusion. Pay particular attention to the first fused op named %7, which has gather, where, take etc. This fused op has 7 ops inside it. If we increase the size of dataset, of course we end up with more operators in the graph, but all of these added ops will be fused into this first fused function. The concrete example will follow below.
This is the low level TVM IR for the first fused function above with 7 operators in it. Gather, where, add, reshape, take etc are fused into a one liner expression. It is not complicated, and compilation is instant.
Now, if we increase the size of dataset to (6, 10), we get this fused graph. Note that the first fused function now contains more than 20 ops.
This is the corresponding low level IR for the first fused function with 20 ops. 20 ops are inlined into a one liner expression, and it is huge.
If the dataset size is (20, 10), I get a fused function with 70-90 ops inside. The result one liner expression is so huge it takes forever to compile. What I don't understand yet is why TVM generates such a huge, messy expression from your subgraph. My guess is it is due to the complex indexing logic that is unique to your use case. Usually operations involved in operator fusion are simple elemwise operations whose indexing logic is trivial. So aggressive op fusion done by TVM is always beneficial. |
This is really interesting, thanks for digging into it! I have few questions:
|
I opened a thread in their forum to discuss this issue https://discuss.tvm.ai/t/aggressive-operator-fusion-and-its-consequence-of-huge-inlined-tir-expression/7687
I'm pretty sure the op fusion logic hasn't changed between 0.6 and 0.7. But I do know that there have been a lot of effort around refactoring their lower level IR. I'm not sure if that change is related to this problem.
Yes. If there is a op which is foreign to TVM, TVM will treat it as "opaque" op and it breaks the fusion there.
I think disabling op fusion is not possible because later passes depend on op fusion being applied. It is certainly possible to make the maximum size of fused functions configurable, we don't have that at the moment. Actually there is a hardcoded constant https://github.com/apache/incubator-tvm/blob/master/src/relay/transforms/fuse_ops.cc#L82 that limits the size of fused function. I tried changing it to something like 10 but somehow it didn't change anything. I'll look more into this later today. |
Thanks for opening the thread on the forum! Looking forward for a possible generic solution (e.g, without having to resort to custom ops). For the models where compilation works, the numbers look on par with what we saw using the custom model implementation, so this is exciting! |
According to https://discuss.tvm.ai/t/aggressive-operator-fusion-and-its-consequence-of-huge-inlined-tir-expression/7687/2, the reason of code blow up is due to indexing operations being duplicated at the lower IR level. I think this makes a lot of sense, although it's not obvious to me exactly where duplication is happening. Do you have an idea where could this happen? If the graph has a diamond-like pattern (i.e. a node with multiple incoming edges going back to a common ancestor), this could certainly happen. |
Ok I think that diamond patter appears because the model actually iterates over trees (all batched together) at a level by level fashion. So indexes in the i-th level are used at the i-th + 1 level. (code here) Is there a way we can tell TVM to fuse all ops within a level, but across levels? Like do fusion within each for loop iteration but not across (I think that's why we are seeing this diamond pattern). |
No, because at the Relay IR level, the loop is unrolled so TVM cannot find a loop boundary. Tracing by Torch already does unrolling (Torchscript cannot generate jit modules with for loop). What we can do immediately to unblock your development is to make You can do that by changing kBroadcast at to With this change, I can compile for the dataset of (500, 10) on my laptop. With (1000, 10) I got segfault. |
Is the segfault due to the same problem? Ok I will do that and see what this unblocks. |
@interesaaat I think the reason why we didn't get this issue in the prototype system is because of the use of custom WhereOp which braked the fusion at every tree depth level. Back then there was no support Where in Relay. From some previous experiments, I remember operator fusion was the main reason for low runtimes in TVM compared to TorchScript. When I disabled optimizations in TVM it was on par with TorchScript, which is expected. But operator fusion only at the tree depth-level gave some good performance. Across levels, we anyway need to bring in new weights and indices tensors from memory. In the ideal case, if TVM support a configurable fusion depth we can have some heuristic values for tree translation which takes into account the shape of the trees. |
The configurable fuse depth is supported in apache/tvm#6327. With this change there is no need to change But note that due to the way fusion works in TVM, the resulting size of fused subgraphs are not "tight" wrt the max fused depth in the sense that, if the max depth is 20, and in the middle of fusion pass there are two consecutive subgraphs with size 11 each, they will remain as 11 size subgraphs. We won't move nodes from one to another to make one of the subgraph have maximum possible nodes. So you should assume that the max fuse depth is a conservative bound. To enable this config,
To dump the graph after fusion, you can do
|
@interesaaat It seems the reason I got the segfault with big dataset is, there is a very large concatenation of parameter tensors at the beginning of the graph. For example, for the dataset of size (10, 10), there is this function that can be evaluated at compile time (since all param tensors are constant)
The %0 above is a tuple of 10 (1, 50) tensors, all that function above does is create a (10, 50) tensor by concatenating the tuple of tensors. For a dataset of size (1000, 10), the above tuple contains 1000 (1, 50) tensors, and TVM gets segfault during evaluation of concatenate. Do you know where this concatenation of param tensors happen? Since this is completely a compile time operation, it's better to do concat in Torch and pass the concat-ed tensor to TVM. |
@masahi I think it is coming from here: https://github.com/microsoft/hummingbird/blob/master/hummingbird/ml/operator_converters/_tree_implementations.py#L224 We rely on the PyTorch input to get the batch size as TorchScript runtime can support any batch size. |
Thanks @masahi! I am going to try with your PR and see what we get. From your comments I see that pretty much now it solves the problem, and we only need to fix the batch size as @scnakandala suggested (will send a fix shortly). Question: do you think we can fix the diamond pattern thing. Setting the limits is good to unblock us, but is un-optimal. I am more than happy to help in case. |
Ok I have pushed a commit fixing max depth and batch size and now TVM works, even with large datasets :) |
Do you mean you want to fuse only the inner loop (one tree level)? That would creates many smaller functions, so compile time is certainly faster but the runtime perf might be slower than having big functions whose size is bounded by max depth param. But big fused functions generated by TVM would contain many duplicated expression, so the performance relies on CSE being done by LLVM or NVCC. |
Ok so bottom line, this is the best we can do. |
Yes I think so. We should certainly benchmark the runtime performance with different fuse depth, to see how much fusion helps and if CSE by LLVM and NVCC are done as expected. |
Once I finish with covering the other operators I can work on the benchmark. One question: to remove the warnings such as |
Same 1.7s.
|
It would be interesting to see the difference between a custom op and open source version in more detail. If indeed they are equivalent, the lowered IR should also be equivalent. We need to debug, I think such a large perf delta shouldn't exist. To me it is also a bit surprising to see prediction on 10000 rows taking more than 1 second. |
That's for 500 trees of dept 8. Regulare |
I want to take a look at this if your are open sourcing this stuff. If you like, I'd start by looking at the lowered IR by uncommenting this line. |
Ok trying now. Open sourcing the code can be a bit problematic because I need to take a cut of the whole code and most likely the cut won't work end to end. Is this the expected output? I can do the same for the custom code.
|
hmm isn't that for the custom op version? For the open source one we should have more ops fused into one function, so the lowered IR would be more complicated. For the custom op version IR makes sense. |
Sorry I didn't past the whole output. This is the whole output.
|
Ok I finally managed to get it working with tvm 0.6 and custom ops.
|
Interesting. One thing that is immediately obvious is there are a lot of memory buffers ( Are you sure the input PyTorch modules are the same? Or having a custom op allows representing parameters more compactly? Please try comparing the number and size of parameter tensors passed to TVM ( |
Also, if it is possible, it would be interesting to run your custom op version with the current TVM. As you can see how the lowered IR looks like in two different versions, there have been a major change in how lower level IR is represented since v0.6. We should make sure the perf regression is coming from solely due to custom op and not from internal change in lower level IR. |
Do you think that placeholders are because of the reshape ops that copy all the time? There are no reshapes in the custom ops. I will start with your first suggestion since I already tried months back to switch to 0.7 and miserably fail. I am quite sure that the params are the same. |
No, since reshapes ops are fused into one memory movement operation. I'm pretty sure they come from tree parameters. In the relay graph of custom version, there are also many parameters represented as |
The two
and
|
I don't see any major difference beside the fact the we were previously cast by |
This is a print for
|
I realized that, since in the custom version you are not converting from PyTorch, comparing two PyTorch code doesn't tell us much. What I want to see is the
I get this
Also, I want see the Relay graph after fusion. You can dump that by
In the current HB, there should be one big function, something like this:
This is the function that gets translated to the lowered IR function with many placeholders we discussed above. Note that this function takes as parameter tensors of different sizes (like p02, p12 etc). These parameters become placeholder in the lowered IR. So the output from current HB makes sense to me. Since I don't know why the lowered IR of custom version doesn't have many placeholder, I want to see how the corresponding Relay function before lowering looks like (for example, how many parameters it has etc) |
Does the |
ah right, for the current HB they make sense. Yes, I want to see the params for the custom version. I don't need the values of params, just the shape is enough. |
Yes that was quite verbose :) I am trying to print also for the other case, but it takes a while...Question: do you want to have a call and go over this live? Maybe we can find a solution with less iterations :) |
:) ok I'll send an email to you (this thread is already too long) |
This is the print of
|
I don't think there is an easy solution for the |
I don't think having a custom pytorch would help. We can at least remove the last reshape here (it doesn't do anything) https://github.com/microsoft/hummingbird/blob/master/hummingbird/ml/operator_converters/_tree_implementations.py#L328 If we need to keep indices around as 1D, can we do Also, do indices need to be 64 bit? If 32 bit is enough, that would cut bandwidth by half, and I think that would be a good deal. Does custom one also uses 64 bit indices? From the parameters you posted above, they seem to be 32 bit. |
Yea i also saw that one. Will remove.
I see. Let me try to make everything a 1d array. Hopefully it will work.
I think that indices are supposed to be Long in pytorch. |
Unfortunately there is no solution to this because we cannot |
@masahi We tried to get rid of the Unfortunately since we are using
This version is actually about 2x slower than the previous one. Even without reshape I don't see any major operator fusion happening. |
Actually, is Although the original version without
|
Yes exactly. The new version has more gather and that doesn't seem help improving performance. Segfault is likely due to the same issue we met previously, you need to fix the batch size at compile time and pass the concated parameters to TVM. In general, I recommend setting fuse depth to something like 50, even if you can compile without it. It seems too much fusing hurts performance (for a similar reason too big loop unrolling hurts) |
closed with #236 |
No description provided.
The text was updated successfully, but these errors were encountered: