-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
onnx_implementation added #43
base: main
Are you sure you want to change the base?
onnx_implementation added #43
Conversation
Now running
I have splitted the forward function into two parts: forward and postprocessing_encoder_output:
|
Heya @polodealvarado! This is looking awesome already! I did some experiments of my own with your code, and I can reproduce your good validation results. Regarding the naming issue - I am also open to e.g. renaming the inputs/outputs, for example prepending them with I've tested it out briefly, and I get:
Also, I must admit that I haven't yet figured out how to actually run the produced Thanks for working on this, I appreciate it! This is definitely heading in the right direction.
|
Given that "num_words", "num_marker_pairs", ... etc are not computed by the onnx graph I think it does not affect to the perfomance, but I will run some tests just in case. Great! I like your approach. Let's use "out_" for these variables. |
The first draft for the SpanMarkerOnnxPipeline is ready. For some reason the onnx model only processes one batch when the input batch size is different from the dummy input batch size used during the onnx conversion.
|
Mmmm I have found a similar issue here: Actually we have torch.cat implemented into the forward function. |
Hmm, that is odd. I can also reproduce your findings, i.e. that it works for the short sample, but not for the longer one. |
Hello @tomaarsen. I have figured out how to make it work. Tomorrow I will share with you how to generate the ONNX models. Now I am facing the challenge of how to improve the inference time with ONNX because it is a bit slower than Torch (at least on my M1 Pro). |
Awesome! I also noticed when running the still-flawed ONNX model that it was roughly as fast on GPU (On my RTX 3090) and roughly 2x as fast on CPU. I'd be very curious to see your findings, and to observe the ONNX graph. Perhaps there's some model inefficiencies, for example the ugly Thank you for all of your work on this!
|
Here I have a first solution, however it doesn't improve the inference time compare with the original torch model. To avoid the use of this for-loop inside the onnx model I proposed to split the whole "forward" process into two onnx models: one just for the encoder and another for the classifier. The postprocessing code between the self.encoder() and self.classfier(), where the torch.cat function is, remains unchanged. However, the execution of the onnx encoder model still goes slower than the torch. I am still researching about some kind of optimizations. @tomaarsen Thank you ! ResultsBatch size: 30 System Info |
Tomorrow I share with you the other solution I have developed. This one tries to improve the forward code. |
I'll run the script in about 10 hours tomorrow morning and report back my results! I'll run it both on CPU and GPU if I can. |
These are my results. I reran everything twice to make sure that there wasn't any random fluke: CPU:
2nd time:
CUDA:
Not the same results :( Okay, I've narrowed it down. The difference was a slight change in the def strip_score_from_results(results):
return [[{key: value for key, value in ent.items() if key != "score"} for ent in ents] for ents in results] and then used: print(f"Results are the same: {strip_score_from_results(torch_result)==strip_score_from_results(onnx_result)}") And now it says:
Another thing to consider is that the normal
and for CPU we get:
Another interesting quirk, my ONNX is sometimes a bit slower if I set two providers:
|
Interesting. I have been reading some issues in the onnxruntime repo and I have found several problems running onnxruntime engine on Apple Silicon. I will run the same test on my Linux - GPU server. On the other, based on your result we got an improvement, at least for this first test, of : |
@tomaarsen Setting ResultsWithout document-level context
CPU CUDA With document-level context
CPU CUDA System InfoArchitecture: x86_64 Python librariesPackage Version aiohttp 3.8.6 |
Oh that is very interesting! I would not have expected a change in performance between those two cases. |
Hello! No worries! No major updates from my side, though I did inspect the third script that you created. I see that it doesn't use optimum, for example. I also tried to run it, though I get an error with
|
…dealvarado/SpanMarkerNER into feature/onnx_implementation
Hello Tom. I am back. Tomorrow I will tell you about the latest progress I made before I was away these past two weeks. On the other hand, could you run the scripts "onnx_implementation_with_torch.py" and "onnx_implementation_with_optimum.py" and share your results?
|
Hello! Here are my outputs when running your scripts, and also when modifying the scripts to use CUDA (My RTX 3090) instead:
One notable consideration is that with CUDA, there seems to be some "warmup" necessary: the first inference is always slower than the others. In practice, that means just running inference on some dummy text before doing the measurements. So that's about an 11% speedup at this batch size. And a 80% speedup for CPU at these settings!
|
Hello @tomaarsen . I just uploaded the SpanMarkerOnnx class ready in the onnx folder, within span_marker. I've commented on the class and functions, when you can, take a look and let me know. Here is a brief summary: The tokenizer, as in any onnx model pipeline, is necessary.
When exporting the for loop we get the following: Dynamic slicing on data-dependent value is not supported (you can reproduce the error by running the file onnx_with_torchdynamo.py) I have tried different solutions but none have worked for me. Until we find a solution, I think we could launch this first version with the encoder and classifier separately since the execution times compared to the base model are noticeably better. I have left a file "test_onnx.py" so you can run several tests comparing results and execution times between an onnx model and a base model (by passing provider=["CUDAExecutionProvider"] you can run it on CUDA) Another thing I would like to do is running these tests on different operating systems, especially Windows and Linux.
|
span_marker/modeling.py
Outdated
embeddings = [] | ||
sequence_length_last_hidden_state = last_hidden_state.size(2) * 2 | ||
# Pre-allocates the necessary space for feature_vector | ||
feature_vector = torch.zeros(batch_size, sequence_length // 2, sequence_length_last_hidden_state) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feature_vector = torch.zeros(batch_size, sequence_length // 2, sequence_length_last_hidden_state) | |
feature_vector = torch.zeros(batch_size, sequence_length // 2, sequence_length_last_hidden_state, device=self.device) |
This is required for it to run well on non-CPU devices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps even also dtype=last_hidden_state.dtype
, e.g. if someone loads the SpanMarker model in lower precision:
SpanMarkerModel.from_pretrained("...", torch_dtype=torch.float16)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also not sure if it's possible to run the ONNX model in lower precision, e.g. float16. For regular torch
, it makes a very big difference. (e.g. 2 or 3x as fast)
On the other hand, it is possible to run onnx in lower precision. Link I can include a function that applies such transformation. |
@tomaarsen There are still some problems with the OnnxRuntime when running fp16 models and it mostly depends on the hardware and versions of OnnxRuntime being used. Could you run some tests with higher batch size and reps values using the test_onnx.py file? |
I ran some tests: CUDA
CPU
|
I have achieved similar results to what I expected, based on the cases I found in the ONNX repository. Backlog for future versions:
What do you think? |
Hello Tom!
Here you have the first draft for the onnx implementation.
I propose the following to-do list to get the ONNX format for SpanMarkerModel mainly based on the Optimum documentation of HuggingFace :
Requirements
How to get the onnx model
If you run the script onnx_implementation.py you get the model exported into onnx format.
The model is perfectly generated but we should fix a problem with the input/output variable names.
Error response
There is a problem with the names of the generated input/output nodes because ONNX sets different names for each node. In the forward function the following input names are the same in the outputs when the SpanMarkerOutput object is created: "start_marker_indices", "num_marker_pairs", "num_words", "document_ids", "sentence_ids". By default, ONNX transforms the corresponding input names by adding '.1' at the end (You can check the error response at the bottom of the script).
Reviewing the code, I noticed that within the 'forward' function there are 'post-processing' stages included after the encoder output is obtained. One thing we could do is to split this code and leave the 'forward' solely for the execution of the encoder and then apply the post processing steps and the input/output names are not affected. This way the input/output names are unique.