Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICudaEngine.getTensorShape function gets the output dimension of the model that contains -1. #4120

Open
demuxin opened this issue Sep 11, 2024 · 34 comments
Labels
triaged Issue has been triaged by maintainers

Comments

@demuxin
Copy link

demuxin commented Sep 11, 2024

Description

I get ICudaEngine object after model building with TensorRT. And I get output dimension of model by ICudaEngine.getTensorShape.

The output dimension contains -1. Then sometimes there is no -1, but the correct dimension value.

What determines the value of the output dimension, and under what circumstances is there a -1 in the output dimension.

This is my onnx model info:

$ polygraphy inspect model codetr_op17_sim.onnx
[I] ==== ONNX Model ====
    Name: torch_jit | ONNX Opset: 17
    
    ---- 1 Graph Input(s) ----
    {input [dtype=float32, shape=(1, 3, 832, 1440)]}
    
    ---- 1 Graph Output(s) ----
    {output [dtype=float32, shape=(1, 1000, 6)]}
    
    ---- 881 Initializer(s) ----
    
    ---- 3581 Node(s) ----

But getTensorShape gets an output dimension of [1, -1, 6]

Environment

TensorRT Version: tensorrt10.3

NVIDIA GPU: RTX 3090

NVIDIA Driver Version: 535.183.01

CUDA Version: 12.2

Operating System: ubuntu22.04

PyTorch Version (if applicable): 1.13.0

@moraxu
Copy link
Collaborator

moraxu commented Sep 11, 2024

The presence of -1 in the output dimension indicates a dynamic shape.
The actual dimension value is determined at inference time, based on the input data and model operations.
If you query the shape before inference or without input bindings, -1 will remain because TRT cannot infer the actual value without the input data.

In your specific case, to resolve the dynamic shape, try to run inference with actual input data and then query the output shape again.

@moraxu moraxu added triaged Issue has been triaged by maintainers Topic: Dynamic Shape labels Sep 11, 2024
@demuxin
Copy link
Author

demuxin commented Sep 12, 2024

In your specific case, to resolve the dynamic shape, try to run inference with actual input data and then query the output shape again.

TensorRT inference needs to pass in the addresses of the input and output data,

if I don't know the dimensions of the output data, how do I allocate memory space for the output data?

in another word, I have to know the output data dimensions before I run inference.

could you please give specific usage?

@moraxu
Copy link
Collaborator

moraxu commented Sep 12, 2024

How about this example? (EDIT: Sorry, the "try to run inference with actual input data" might have been confusing without an example)

// Create execution context
IExecutionContext* context = engine->createExecutionContext();

// Set dynamic input dimensions
Dims inputDims = Dims4(1, 3, 832, 1440); // Example input dimensions
context->setBindingDimensions(inputIndex, inputDims);

// Get output dimensions after setting input dimensions
Dims outputDims = context->getBindingDimensions(outputIndex);

// Allocate memory for input and output buffers based on the resolved dimensions
size_t inputSize = volume(inputDims) * sizeof(float);
size_t outputSize = volume(outputDims) * sizeof(float);
void* inputBuffer = malloc(inputSize);
void* outputBuffer = malloc(outputSize);

// Set up buffer pointers and perform inference
void* buffers[2] = {inputBuffer, outputBuffer};
context->enqueueV2(buffers, stream, nullptr);

// Output results are now in outputBuffer

@demuxin
Copy link
Author

demuxin commented Sep 12, 2024

The getBindingDimensions api has been removed on TensorRT 10.3, and replaced with the getTensorShape function.

Going back to the problem at the beginning, getTensorShape gets that the output dimension contains -1.

how can I get the true output dimension of the model?

@demuxin
Copy link
Author

demuxin commented Sep 12, 2024

I found that I don't have this problem with onnx models exported using torch 1.12 and opset 16,

but I have this problem with onnx models exported using torch 1.13 and opset 17.

@moraxu
Copy link
Collaborator

moraxu commented Sep 12, 2024

Are you able to get the true output dimensions of the model after this sequence of calls?

Dims inputDims = Dims4(1, 3, 832, 1440); // or whatever your actual input dimensions are
context->setInputShape(inputIndex, inputDims);
Dims outputDims = context->getTensorShape(outputIndex);

If not, please share the original ONNX model here and I'll instance an internal bug for someone to take over.

@demuxin
Copy link
Author

demuxin commented Sep 12, 2024

I can't get the true output dimensions of the model using your sequence of calls.

this is model link:
https://drive.google.com/file/d/17IrTEbNLAmq1J2Ax_2QJdR26KdrAEuBj/view?usp=sharing

@demuxin
Copy link
Author

demuxin commented Sep 17, 2024

Hi @moraxu , Has there been any progress on this issue?

@moraxu
Copy link
Collaborator

moraxu commented Sep 17, 2024

Apologies for the late reply, there were company's holidays last week. TRT may also need certain optimizations enabled to fully resolve dynamic shapes, you can enable that by using optimization profiles when building the engine.

For example, when I ran:

polygraphy run codetr_op17_sim.onnx --trt --input-shapes input:[1,3,832,1440]
[I] RUNNING | Command: /home/mguzek/.local/bin/polygraphy run codetr_op17_sim.onnx --trt --input-shapes input:[1,3,832,1440]
[I] Will generate inference input data according to provided TensorMetadata: {input [shape=(1, 3, 832, 1440)]}
[I] trt-runner-N0-09/17/24-11:49:51     | Activating and starting inference
[I] Configuring with profiles:[
        Profile 0:
            {input [min=[1, 3, 832, 1440], opt=[1, 3, 832, 1440], max=[1, 3, 832, 1440]]}
    ]
[W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect.
[I] Building engine with configuration:
    Flags                  | []
    Engine Capability      | EngineCapability.STANDARD
    Memory Pools           | [WORKSPACE: 20082.12 MiB, TACTIC_DRAM: 20082.12 MiB, TACTIC_SHARED_MEMORY: 1024.00 MiB]
    Tactic Sources         | [EDGE_MASK_CONVOLUTIONS, JIT_CONVOLUTIONS]
    Profiling Verbosity    | ProfilingVerbosity.DETAILED
    Preview Features       | [PROFILE_SHARING_0806]
[I] Finished engine building in 62.939 seconds
[I] trt-runner-N0-09/17/24-11:49:51    
    ---- Inference Input(s) ----
    {input [dtype=float32, shape=(1, 3, 832, 1440)]}
[I] trt-runner-N0-09/17/24-11:49:51    
    ---- Inference Output(s) ----
    {output [dtype=float32, shape=(1, 1000, 6)]}
[I] trt-runner-N0-09/17/24-11:49:51     | Completed 1 iteration(s) in 643.8 ms | Average inference time: 643.8 ms.
[I] PASSED | Runtime: 67.502s | Command: /home/mguzek/.local/bin/polygraphy run codetr_op17_sim.onnx --trt --input-shapes input:[1,3,832,1440]

It seems the output shapes were resolved, so maybe try using those specific profiles:

        Profile 0:
            {input [min=[1, 3, 832, 1440], opt=[1, 3, 832, 1440], max=[1, 3, 832, 1440]]}

by passing them to the config.

@demuxin
Copy link
Author

demuxin commented Sep 18, 2024

I also have no problem using the polygraphy command.

I do get this problem when using the TensorRT C++ api, have you tried the TensorRT C++ api?

@moraxu
Copy link
Collaborator

moraxu commented Sep 19, 2024

Sorry, I had to do more digging because wasn't familiar with that part of the codebase. To unblock you for now, you can use getMaxOutputSize to query the max possible size (but it can be huge in certain cases).

To query for the exact size, one would have to use a class implementing nvinfer1::IOutputAllocator like in this sample code https://github.com/NVIDIA/TensorRT/blob/release/10.4/samples/common/sampleDevice.h#L483 but the actual example on how to populate the dims via notifyShape() is too tightly coupled with our internal code - I have to ask another colleague about this. I will get back to you.

@demuxin
Copy link
Author

demuxin commented Sep 19, 2024

Thank you for taking this so seriously.

@moraxu
Copy link
Collaborator

moraxu commented Sep 20, 2024

@demuxin , I believe for now you will have to use getMaxOutputSize. If the memory allocation turned out to be too large, let me know and I can CC someone else here.

@demuxin
Copy link
Author

demuxin commented Sep 21, 2024

The value I get via getMaxOutputSize() is 24064, but this value is a bit strange.

The output dimension is [1, 1000, 6], so the number of bytes should be either 1000 * 6 * sizeof(float) = 24000 or 1000 * 6 * sizeof(half) = 12000.

@moraxu
Copy link
Collaborator

moraxu commented Sep 21, 2024

Can you paste your full standalone C++ snippet or at least the part where you build the engine and query getMaxOutputSize()? I'll instance an internal bug then for someone to take a look at

@demuxin
Copy link
Author

demuxin commented Sep 21, 2024

Our code has a relatively high degree of encapsulation, and I've intercepted key parts of it, it can't run, but it doesn't affect much.

{
    std::shared_ptr<nvinfer1::IBuilder> builder(nvinfer1::createInferBuilder(gLogger), destroyNV<nvinfer1::IBuilder>);
    std::shared_ptr<nvinfer1::IBuilderConfig> config(builder->createBuilderConfig(), destroyNV<nvinfer1::IBuilderConfig>);
    std::shared_ptr<nvinfer1::INetworkDefinition> network;
    network = std::shared_ptr<nvinfer1::INetworkDefinition>(builder->createNetworkV2(flag), destroyNV<nvinfer1::INetworkDefinition>);

    std::shared_ptr<nvonnxparser::IParser> onnxParser;
    onnxParser.reset(nvonnxparser::createParser(*network, gLogger), destroyNV<nvonnxparser::IParser>);

    onnxParser->parse(onnxmodel, model_size);
    config->setMemoryPoolLimit(nvinfer1::MemoryPoolType::kWORKSPACE, 1LU << 31);

    std::shared_ptr<nvinfer1::IHostMemory> seridata(builder->buildSerializedNetwork(*network, *config), destroyNV<nvinfer1::IHostMemory>);

    std::shared_ptr<nvinfer1::IRuntime> runtime_ = std::shared_ptr<nvinfer1::IRuntime>(nvinfer1::createInferRuntime(gLogger), destroyNV<nvinfer1::IRuntime>);
    std::shared_ptr<nvinfer1::ICudaEngine> engine_ = std::shared_ptr<nvinfer1::ICudaEngine>(runtime_->deserializeCudaEngine(seridata->data(), seridata->size()), destroyNV<nvinfer1::ICudaEngine>);
    std::shared_ptr<nvinfer1::IExecutionContext> context_ = std::shared_ptr<nvinfer1::IExecutionContext>(engine_->createExecutionContext(), destroyNV<nvinfer1::IExecutionContext>);

    int output_tensor_id = 1;
    const char* tensor_name = engine_->getIOTensorName(output_tensor_id);

    printf("--------->>>>>> %ld\n", context_->getMaxOutputSize(tensor_name));

    auto dims = engine_->getTensorShape(tensor_name);

    for (int j = 0; j < dims.nbDims; ++j) {
        printf("%d ", dims.d[j]);  // output: [1, -1, 6]
    }
}

@moraxu
Copy link
Collaborator

moraxu commented Sep 23, 2024

@demuxin I also got 24064 bytes on the output but just realized that the output type is float32 and so 1000 * 6 * 4 (sizeof(float32)) = 24000 as expected?

@demuxin
Copy link
Author

demuxin commented Sep 24, 2024

yes, but when I set half precision, I still got 24064 via getMaxOutputSize().

@moraxu
Copy link
Collaborator

moraxu commented Sep 24, 2024

This is an upper bound like I said, so feel free to divide it by 2 when using half precision.

@demuxin
Copy link
Author

demuxin commented Sep 25, 2024

ok, do you have any ideas about the problem of dimension-1?

@moraxu
Copy link
Collaborator

moraxu commented Sep 25, 2024

The issue with "dimension-1" in the output is caused by operator NonZero in your ONNX model, where the output shape depends on the actual data passed into the model during inference. These types of operators produce variable-sized outputs because the number of non-zero elements, for instance, changes based on the input data.

For such cases, it's simply not possible to provide a fixed output shape at build time, as the shape is determined dynamically during execution. This is why you're seeing dynamic shapes (-1) in the output shape, indicating that the dimension can only be resolved after the input data is known.

@moraxu
Copy link
Collaborator

moraxu commented Sep 25, 2024

Please simply use getMaxOutputSize in this case.

@demuxin
Copy link
Author

demuxin commented Sep 25, 2024

thank you for your working, But the output of getMaxOutputSize is 24064. It is difficult to derive the specific dimension corresponding to -1 using 24064. Can you explain how the value 24064 comes from?

@demuxin
Copy link
Author

demuxin commented Sep 25, 2024

Can I solve this problem by editing the output dimensions of the ONNX model?

@moraxu
Copy link
Collaborator

moraxu commented Sep 25, 2024

Can you explain how the value 24064 comes from?

getMaxOutputSize provides the maximum required size for the output buffer in bytes, accounting for dynamic shapes and ensuring the buffer is large enough for the largest possible output.

It is difficult to derive the specific dimension corresponding to -1 using 24064.

OK, I'll tag my colleague here who maintains that piece of the codebase in terms of shape inference during actual inference. He's currently on a short sick leave, apologies for the inconvenience.

Can I solve this problem by editing the output dimensions of the ONNX model?

No, it's just that your ONNX model exhibits data dependent shapes due to the NonZero operator in it.

@demuxin
Copy link
Author

demuxin commented Sep 25, 2024

I'll tag my colleague here who maintains that piece of the codebase in terms of shape inference during actual inference.

Thank you in advance for your response.

@jhalakpatel
Copy link

jhalakpatel commented Sep 29, 2024

Thanks @demuxin for your patience. Thanks, @moraxu for helping out.

There are 3 possible categories of output shapes:

  1. Static shapes: Output shapes are statically inferred at compile time (i.e. engine build time) since input shapes are statically known.
  2. Dynamic shapes: Output shapes are resolved at runtime when runtime input shapes are provided (using setInputShape API for inputs)
  3. Data-dependent shapes: Output shapes are resolved during engine execution - THIS IS THE CURRENT CASE.

Data dependence shapes are unknown until a layer has executed as the output shape depends on the input data rather than the input shape. Since input data is not known until a certain point in execution, hence the problem.

There are two approaches:

  1. Deferred output allocation: Implement nvinfer1::IOutputAllocator interface.
    https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#dynamic-shaped-output should guide you through how to implement it.

To answer specifics,
a) You can avoid calling setTensorAddress for the output with a -1 shape.
b) Register an IOutputAllocator instance for a given output
c) Implement reallocateOutput callback method to perform actual allocation - Your application can store the allocated pointer for later use.
d) Implement notifyShape callback method - This will be invoked with exact output shapes - Your application can store it for later use.

API doc: https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/classnvinfer1_1_1v__1__0_1_1_i_output_allocator.html

  1. Pre-allocate memory: Since the exact output shape cannot be known until execution, TensorRT derives upper bound for such shapes internally. As @moraxu suggested, this can sometimes result in very impractical bounds. This is good for quick experimentation IMO, but should rely on the first approach for final deployment.

Also, regarding why the max output size is 24064 and not 24000 is due to GPU memory granularity of 512 bytes. 24064 % 512 == 0.

CUASSERT(cudaDeviceGetAttribute(&textureAlignment, cudaDevAttrTextureAlignment, device));

You can read more about cudaDevAttrTextureAlignment here: https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html.

Let us know if this resolves issues at your end. Thanks again @moraxu for your help!

@demuxin
Copy link
Author

demuxin commented Sep 30, 2024

Thanks for your detailed answer, I'll try your method when my vacation is over.

@demuxin
Copy link
Author

demuxin commented Oct 8, 2024

Hi @jhalakpatel , I read the dev guide, and don't understand how the allocatorMap is being used.

std::unordered_map<std::string, MyOutputAllocator> allocatorMap;

for (const char* name : names of outputs)
{
    Dims extent = context->getTensorShape(name);
    void* ptr;
    if (engine->getTensorLocation(name) == TensorLocation::kDEVICE)
    {
        if (extent.d contains a -1)
        {
            auto allocator = std::make_unique<MyOutputAllocator>();
            context->setOutputAllocator(name, allocator.get());
            allocatorMap.emplace(name, std::move(allocator));
        }
        else
        {
            ptr = allocate device memory per extent and format
                   }
    }
    else
    {
        ptr = allocate cpu memory per extent and format
    }
    context->setTensorAddress(name, ptr);
}

Then the reallocateOutput function needs to be passed the size parameter. Who is this function called by.

if it's me, how do I know what the value of size is, and if it's TensorRT, what determines the value of size.

@jhalakpatel
Copy link

jhalakpatel commented Oct 15, 2024

@demuxin Yes, TensorRT computes size parameter during inference. Your implementation for reallocateOutput could inspect the size parameter and current memory pointer to perform reallocations if required.

Your implementation would look like this:

void *OutputAllocator::reallocateOutputAsync(
    char const *tensorName, void *currentMemory, uint64_t size,
    uint64_t alignment, CudaStream stream) {

  assert(currentMemory == mCurrentMemory && "output buffer mismatch");
  assert(strcmp(tensorName, mTensorName) == 0 && "tensor name mismatch");
  assert(!mReallocateOutputCalled && "duplicate call to reallocateOutput");
  mReallocateOutputCalled = true;
  // Some memory allocators return nullptr when allocating zero bytes, but
  // TensorRT requires a non-null ptr even for empty tensors, so allocate a
  // dummy byte.
  size = std::max(size, static_cast<uint64_t>(1));

  // Check if reallocation is required.
  if (size > mOutputSize) {
    size = roundUp(size, alignment);

    if (mOutputPtr) {
      // Free up existing memory before allocating a larger one.
      cudaFree(mOutputPtr);
    }

    mOutputPtr = nullptr;
    mOutputSize = 0;

    void *memory;
    cudaMalloc(&memory, size);
    mOutputPtr = memory;
    if (mOutputPtr != nullptr) {
      // Record new memory size for the newly allocated memory
      mOutputSize = size;
    }
    return mOutputPtr;
  }

  // Otherwise return existing memory buffer pointer.
  return mCurrentMemory;
}

where, the overall class could look like:

class OutputAllocator : public nvinfer1::OutputAllocator {
public:
  OutputAllocator(const char* tensorName, void* memory, int64_t size): mTensorName(tensorName), mCurrentMemory(memory), mSize(size) {}
  ~OutputAllocator();

  /// Reallocate output memory asynchronously.
  void *reallocateOutputAsync(char const *tensorName, void *currentMemory,
                              uint64_t size, uint64_t alignment,
                              CudaStream /*stream*/) override;

  /// Notify the shape of the tensor.
  void notifyShape(char const *tensorName, const int64_t *dims,
                   int64_t nbDims) override;

  void *mOutputPtr{nullptr}; ///< nullptr if memory could not be allocated
  uint64_t mOutputSize{0};   ///< Size of allocation pointed to by output.
  bool mReallocateOutputCalled{
      false}; ///< Flag indicating if reallocateOutput was called
  bool mNotifyShapeCalled{false}; ///< Flag indicating if notifyShape was called
  std::vector<int64_t> mOutputDims; ///< Dimensions of tensor.

private:
  const char *mTensorName; ///< Name of the tensor
  void *mCurrentMemory;    ///< Current memory pointer
};

You could instantiate the class as:

 auto allocator = std::make_unique<MyOutputAllocator>("tensor name", nullptr, 0);

Let me know how it goes.

@abysslover
Copy link

This solution seems to work, but the problem is that reallocation occur every inference.
i.e., reallocateOutputAsync will be executed for each inference..

@jhalakpatel
Copy link

@abysslover You would only reallocate if the new size is more than existing allocated memory size.

  // Check if reallocation is required.
  if (size > mOutputSize) {
  ...
 }

@abysslover
Copy link

abysslover commented Nov 1, 2024

@jhalakpatel Unfortunately, your suggestion did not answer my question and, in turn, did not solve the problem. I have tested the code and compared the pointer addresses before and after executing the code you initially provided with minor modifications. I will provide the solution for future readers of these comments.

class OutputAllocator : public nvinfer1::IOutputAllocator {

(...)
if (mOutputPtr != nullptr) {
    // Record new memory size for the newly allocated memory
    mOutputSize = size;
}
if (mOutputPtr == mCurrentMemory) {
    mReallocateOutputCalled = true;
}
mCurrentMemory = mOutputPtr;
return mOutputPtr;

(...)

/// Notify the shape of the tensor.
void DynamicOutputAllocator::notifyShape(char const* tensorName, nvinfer1::Dims const& dims) noexcept {
    mOutputDims = dims;
}

@fettahyildizz
Copy link

fettahyildizz commented Nov 6, 2024

@demuxin Yes, TensorRT computes size parameter during inference. Your implementation for reallocateOutput could inspect the size parameter and current memory pointer to perform reallocations if required.

Your implementation would look like this:

void *OutputAllocator::reallocateOutputAsync(
    char const *tensorName, void *currentMemory, uint64_t size,
    uint64_t alignment, CudaStream stream) {

  assert(currentMemory == mCurrentMemory && "output buffer mismatch");
  assert(strcmp(tensorName, mTensorName) == 0 && "tensor name mismatch");
  assert(!mReallocateOutputCalled && "duplicate call to reallocateOutput");
  mReallocateOutputCalled = true;
  // Some memory allocators return nullptr when allocating zero bytes, but
  // TensorRT requires a non-null ptr even for empty tensors, so allocate a
  // dummy byte.
  size = std::max(size, static_cast<uint64_t>(1));

  // Check if reallocation is required.
  if (size > mOutputSize) {
    size = roundUp(size, alignment);

    if (mOutputPtr) {
      // Free up existing memory before allocating a larger one.
      cudaFree(mOutputPtr);
    }

    mOutputPtr = nullptr;
    mOutputSize = 0;

    void *memory;
    cudaMalloc(&memory, size);
    mOutputPtr = memory;
    if (mOutputPtr != nullptr) {
      // Record new memory size for the newly allocated memory
      mOutputSize = size;
    }
    return mOutputPtr;
  }

  // Otherwise return existing memory buffer pointer.
  return mCurrentMemory;
}

where, the overall class could look like:

class OutputAllocator : public nvinfer1::OutputAllocator {
public:
  OutputAllocator(const char* tensorName, void* memory, int64_t size): mTensorName(tensorName), mCurrentMemory(memory), mSize(size) {}
  ~OutputAllocator();

  /// Reallocate output memory asynchronously.
  void *reallocateOutputAsync(char const *tensorName, void *currentMemory,
                              uint64_t size, uint64_t alignment,
                              CudaStream /*stream*/) override;

  /// Notify the shape of the tensor.
  void notifyShape(char const *tensorName, const int64_t *dims,
                   int64_t nbDims) override;

  void *mOutputPtr{nullptr}; ///< nullptr if memory could not be allocated
  uint64_t mOutputSize{0};   ///< Size of allocation pointed to by output.
  bool mReallocateOutputCalled{
      false}; ///< Flag indicating if reallocateOutput was called
  bool mNotifyShapeCalled{false}; ///< Flag indicating if notifyShape was called
  std::vector<int64_t> mOutputDims; ///< Dimensions of tensor.

private:
  const char *mTensorName; ///< Name of the tensor
  void *mCurrentMemory;    ///< Current memory pointer
};

You could instantiate the class as:

 auto allocator = std::make_unique<MyOutputAllocator>("tensor name", nullptr, 0);

Let me know how it goes.

Hello @jhalakpatel, I read docs and your replies thoroughly but I fail to inference with dynamic shaped outputs. My pipeline goes like this

(for input bindings)
{
Malloc(buffer[input_index], input_size)
}
(for output bindings)
{
std::unique_ptr<OutputAllocator> output_allocator = std::make_unique<OutputAllocator>(name.c_str(), nullptr, 0)
context_->setOutputAllocator(name.c_str(), output_allocator.get())
allocator_map_.emplace(name, std::move(output_allocator));
}

---- DURING ENQUEUE ---
cudaMemcpyAsync(buffer[input_index], input.data(), input_size, cudaMemcpyHostToDevice, stream)
context_->enqueueV2(buffers_.data(), stream_, nullptr);
??????????????

That's where I don't get it. After enqueue, output bindings of buffer are null because I have never allocated memory using Malloc. is the output data stored in memory pointed by OutputAllocator->outputPtr? @demuxin If you have any insight, I would be highly appreciated as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

6 participants