We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
So my case is that in the Llama TensorRT engine, I need to infer in a while loop.
At the first implementation, I created tensors in each iteration, which was found to be too slow because we needed to create tensors each time.
def allocate_buffer(self, shape_dict=None, engine=None, context=None): tensors = OrderedDict() for binding in range(engine.num_io_tensors): name = engine.get_tensor_name(binding) if shape_dict and name in shape_dict: shape = shape_dict[name] else: shape = context.get_tensor_shape(name) dtype = trt.nptype(engine.get_tensor_dtype(name)) if engine.get_tensor_mode(name) == trt.TensorIOMode.INPUT: context.set_input_shape(name, shape) tensor = torch.empty(tuple(shape), dtype=numpy_to_torch_dtype_dict[dtype]).to(device=self.device) tensors[name] = tensor return tensors
Then I tried to initialize the global tensors when initialising the model, like below.
def initial_global_tensors(self): tensor_shapes = { "position": (1, 2000), "inputs_embeds": (1, 2000, 1792), "lm_logits": (1, 1, 8212), } for i in range(12): tensor_shapes[f"past_key_in{i}"] = (1, 16, 2000, 112) tensor_shapes[f"past_value_in{i}"] = (1, 16, 2000, 112) tensor_shapes[f"past_key{i}"] = (1, 16, 2000, 112) tensor_shapes[f"past_value{i}"] = (1, 16, 2000, 112) self.global_tensors = { name: torch.zeros(shape, dtype=torch.int64 if 'position' in name else torch.float32).to(device=self.device) for name, shape in tensor_shapes.items() }
Then I can slice the tensors in each iteration, based on the actual input shape
def set_shape(self, shape_dict=None, engine=None, context=None): tensors = OrderedDict() for binding in range(engine.num_io_tensors): name = engine.get_tensor_name(binding) if shape_dict and name in shape_dict: shape = shape_dict[name] else: shape = context.get_tensor_shape(name) if engine.get_tensor_mode(name) == trt.TensorIOMode.INPUT: context.set_input_shape(name, shape) original_tensor = self.global_tensors[name] sliced_tensor = original_tensor for dim, size in enumerate(shape): sliced_tensor = sliced_tensor.narrow(dim, 0, size) tensors[name] = sliced_tensor return tensors
However, I found the results from global tensors slicing differ from creating tensors for each iteration. And I have no ideas.
The text was updated successfully, but these errors were encountered:
I found the solution, we need to make sliced_tensor as contiguous.
sliced_tensor
Sorry, something went wrong.
No branches or pull requests
So my case is that in the Llama TensorRT engine, I need to infer in a while loop.
At the first implementation, I created tensors in each iteration, which was found to be too slow because we needed to create tensors each time.
Then I tried to initialize the global tensors when initialising the model, like below.
Then I can slice the tensors in each iteration, based on the actual input shape
However, I found the results from global tensors slicing differ from creating tensors for each iteration. And I have no ideas.
The text was updated successfully, but these errors were encountered: