Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continue updating the Apple backend #741

Open
2 tasks
freedomtan opened this issue Jun 27, 2023 · 18 comments
Open
2 tasks

Continue updating the Apple backend #741

freedomtan opened this issue Jun 27, 2023 · 18 comments

Comments

@freedomtan
Copy link
Contributor

some ideas we can try to improve the apple backend

@RSMNYS
Copy link
Contributor

RSMNYS commented Oct 9, 2023

@freedomtan for further improvements should we use the saved models from your repo (MobileBert, MobileDet)? Or we can use some models from the TensorFlow hub (At least for MobileBert model: https://tfhub.dev/tensorflow/mobilebert_en_uncased_L-24_H-128_B-512_A-4_F-4_OPT). As I see the saved models are with tf 1 version. However in new model the inputs are different than ours.

@freedomtan
Copy link
Contributor Author

@freedomtan for further improvements should we use the saved models from your repo (MobileBert, MobileDet)? Or we can use some models from the TensorFlow hub (At least for MobileBert model: https://tfhub.dev/tensorflow/mobilebert_en_uncased_L-24_H-128_B-512_A-4_F-4_OPT). As I see the saved models are with tf 1 version. However in new model the inputs are different than ours.

I am not proud of my repo :-)
For MobileBERT: whatever we do, it should be compatible (and mathematically equivalent) with what Google colleagues contribulted at https://github.com/mlcommons/mobile_open/tree/main/language/bert.
For MobileDet: see https://github.com/mlcommons/mobile_open/tree/main/vision/mobilenet

We should check the accuracies of models.

As far as I can tell the https://tfhub.dev/tensorflow/mobilebert_en_uncased_L-24_H-128_B-512_A-4_F-4_OPT is not for SQuAD (hence not compatible)

@RSMNYS
Copy link
Contributor

RSMNYS commented Oct 30, 2023

Hi guys! So I've converted the MobileBERT using the coreMLTools version 7, TensorFlow v 2.12 to the *.mlpackage format, as well, as optimised the model using quantization technique. Currently I have the problem to use the *.mlpackage format in our application. The problem arises when do on device compilation to receive the mlmodelc. I've tried to compile on the Mac itself and then use the compiled model, but then some issue with loading its content. So working to resolve this to see how accurate is the optimised model.

When working on the task I found such issues/possible improvements:

  1. When do the on device model compilation we use some deprecated method, which compile the model synchronously. There are alternatives which uses async methods or method with the callback. So would be good to revise this. The problem here is that app expects configured CoreMLExecutor after init.
  2. flutter has some issues while debugging on the device with iOS 17. (So can't debug directly), some patches exists, but the recommended in the doc flutter version is not the official one, and can't be updated, maybe we can revise this as well, and try to update the flutter to latest, at least it will be easier to maintain in future.

@freedomtan
Copy link
Contributor Author

freedomtan commented Oct 31, 2023

Hi guys! So I've converted the MobileBERT using the coreMLTools version 7, TensorFlow v 2.12 to the *.mlpackage format, as well, as optimised the model using quantization technique. Currently I have the problem to use the *.mlpackage format in our application. The problem arises when do on device compilation to receive the mlmodelc. I've tried to compile on the Mac itself and then use the compiled model, but then some issue with loading its content. So working to resolve this to see how accurate is the optimised model.

When working on the task I found such issues/possible improvements:

  1. When do the on device model compilation we use some deprecated method, which compile the model synchronously. There are alternatives which uses async methods or method with the callback. So would be good to revise this. The problem here is that app expects configured CoreMLExecutor after init.
  2. flutter has some issues while debugging on the device with iOS 17. (So can't debug directly), some patches exists, but the recommended in the doc flutter version is not the official one, and can't be updated, maybe we can revise this as well, and try to update the flutter to latest, at least it will be easier to maintain in future.

@RSMNYS I don't really get what you ran into. From my past experiences, if we can make the //flutter/cpp/binary:main work on macOS, mostly the app will work on iOS.

And for performance, please check if you got latency improvement in Xcode's / Instrument's Core ML Performance Report first.

@RSMNYS
Copy link
Contributor

RSMNYS commented Oct 31, 2023

@RSMNYS I don't really get what you ran into. From my past experiences, if we can make the //flutter/cpp/binary:main work on macOS, mostly the app will work on iOS.

The thing is it works for main (when doing tests), and it loads the ml program with no issues. But when trying in the app the error says can't read the spec. Will continue with this today.

@RSMNYS
Copy link
Contributor

RSMNYS commented Nov 6, 2023

Hi guys! Here are the results of the inferences by using the original MobileBERT (mlmodel) and the new converted models (mlpackage and optimized mlpackage). For the optimized one we used the default int8 quantized data type). As we can see the converted mlpackage has worse results than the original mlmodel. Need to check what could be the problem. As for the quantized model all seems correct as we used the lower precision for the data type (int8 and not float16)m that's why worse results.

MLPackage is the directory and not the single file. So to have the fingerprint for it we need to archive it. To correctly handle the archive with the mlpackage I've adjusted the archive_cached_helper. So now the app can load the mlpackage and do the inferences. Still have some difficulties with the model path after app restarts, because the logic returns only the path the archive's folder.

In our case we have: https://github.com/RSMNYS/mobile_models/raw/main/v3_0/CoreML/MobileBERT.zip. After download and unarchive the model is saved to ../raw/main/v3_0/CoreML/MobileBERT/MobileBERT.mlpackage. After app restart (app uses cached resources) he app returns this model_path: ../raw/main/v3_0/CoreML/MobileBERT, which is not correct. I think we can resolve this by introducing the new property to the pbtxt settings: model_name, so we can compose the model_path correctly and support models type which are not just single file, but package(directory). Please let's discuss.

optimization_results

@freedomtan
Copy link
Contributor Author

freedomtan commented Nov 6, 2023

@RSMNYS Please check model performance with Xcode Performance tab and/or Core ML Instruments first. For performance benchmark, it's hard to ask people to believe that we have "improved" model which is 1 - (92.14/121.71) = 24% slower than the original one.

@anhappdev
Copy link
Collaborator

@RSMNYS Can you try rename ‘MobileBERT.zip’ to ‘MobileBERT.mlpackage.zip’

@freedomtan
Copy link
Contributor Author

@RSMNYS please share your forked repo or one the .mlpackage model.

@RSMNYS
Copy link
Contributor

RSMNYS commented Nov 16, 2023

@freedomtan
Copy link
Contributor Author

freedomtan commented Nov 16, 2023

@freedomtan here is the forked repo: https://github.com/RSMNYS/mobile_models/raw/main/v3_0/CoreML/MobileBERT.mlpackage.zip

Let's check something basic.

  1. Did you try to open the model you converted in Xcode and run it in the Performance tab as in Xcode Performance tab and/or Core ML Instruments? When I tried to run your .mlpackage model with "CPU and ANE", I got error messages. Mostly, there is something wrong. And if you compare running my .mlmodel and your .mlpackage, you can .mlmodel is faster because it's CPU+ANE and your .mlpackage is on GPU+ANE.
  2. I converted the .mlmodel I had at https://github.com/freedomtan/coreml_models_for_mlperf/tree/main/mobilebert to .mlpackage by opening in Xcode 15.0, clicking the "Edit" button, and then accepting the conversion. I got a model in .mlpackage. And running the model with CPU and ANE is roughly as good as running the .mlmodel one.
  3. if you want to debug the converted model, maybe you can start from its MIL https://apple.github.io/coremltools/docs-guides/source/model-intermediate-language.html

@RSMNYS
Copy link
Contributor

RSMNYS commented Nov 16, 2023

@freedomtan here is my test with the converted model:
Screenshot 2023-11-16 at 11 28 04
Screenshot 2023-11-16 at 11 31 13

Sometimes Xcode fails to create the report, but I believe this is the Xcode issue, because it shows me sometimes the wrong operating system, but in the result all is listed correctly.

Can you share the general tab for the converted model, the results, and version of the Xcode, please.

@freedomtan
Copy link
Contributor Author

@RSMNYS I meant "CPU and ANE". "GPU and ANE" is the reason why your model is slower.

@freedomtan here is my test with the converted model: Screenshot 2023-11-16 at 11 28 04 Screenshot 2023-11-16 at 11 31 13

Sometimes Xcode fails to create the report, but I believe this is the Xcode issue, because it shows me sometimes the wrong operating system, but in the result all is listed correctly.

Can you share the general tab for the converted model, the results, and version of the Xcode, please.

@freedomtan
Copy link
Contributor Author

@freedomtan to post profiling results how old coreml model work on couple devices.

@freedomtan
Copy link
Contributor Author

@RSMNYS
with the MobileBERT.mlmodel here, https://github.com/freedomtan/coreml_models_for_mlperf/tree/main/mobilebert
comparing my .mlmodel and your .mlpackage in the Instruments, you can see, as I said, GPU takes much longer time than CPU.

With coremltools's converter, you can try to convert a TF model to MIL by setting the convert_to parameter to milinternal https://apple.github.io/coremltools/source/coremltools.converters.convert.html.

freedom's .mlmodel Sergie's .mlpackage

@freedomtan
Copy link
Contributor Author

freedomtan commented Dec 4, 2023

@RSMNYS

I dug a bit into it over the past weekend. Some information maybe useful.

  • We can check graphs of both .mlmodel and .mlpackage with netron.
    • for .mlmodel: simply netron MobileBERT.mlmodel works
    • for .mlpackage: there is model.mlmodel in MobileBERT.mlpackage/Data/com.apple.CoreML/
  • we can get MIL programs matching graphs
    • for .mlmodel: use convert_to = 'milinternal' as mentioned above
    • for .mlpackage: xcrun mlmodelc compile MobileBERT.mlpackage /tmp, then we can find model.mil in /tmp/MobileBERT.mlmodelc/
  • With the graphs and MIL programs, it's possible to check the first 10 ops and 8 os of .mlmodel and .mlpackage, respectively.

And then, it should be possible to tweek .mil program.

@RSMNYS
Copy link
Contributor

RSMNYS commented Jan 4, 2024

@freedomtan I did some more testing with MobileBERT.mlpackage. I've set different precisions for the model: Float16, and Float32 and here are the results:

FLOAT16

All units: 8.33 ms 1900 operations on NE, 8 op on GPU
CPU: can't run
CPU & GPU: 31.45 ms - all operations run on GPU
CPU & NE: can't run

FLOAT32

All unit: 31.11 ms - all operations run on GPU only
CPU only: 69.25 ms
CPU & GPU: 30.7 ms - all operations run on GPU only
CPU & NE: 69.13 ms - all operations run on CPU ony

Also found this description: ML programs use a GPU runtime that is backed by the Metal Performance Shaders Graph framework. So could it be that mlpackage is optimised to perform the operations on the gpu (to utilise parallel execution). And since nlp models has the sequence nature, it's not so beneficial to run on gpu. (In terms of qps). We can check other models (vision) to see if the operations are faster in this case. Checking more.

@freedomtan
Copy link
Contributor Author

@freedomtan I did some more testing with MobileBERT.mlpackage. I've set different precisions for the model: Float16, and Float32 and here are the results:

FLOAT16

All units: 8.33 ms 1900 operations on NE, 8 op on GPU CPU: can't run CPU & GPU: 31.45 ms - all operations run on GPU CPU & NE: can't run

FLOAT32

All unit: 31.11 ms - all operations run on GPU only CPU only: 69.25 ms CPU & GPU: 30.7 ms - all operations run on GPU only CPU & NE: 69.13 ms - all operations run on CPU ony

Also found this description: ML programs use a GPU runtime that is backed by the Metal Performance Shaders Graph framework. So could it be that mlpackage is optimised to perform the operations on the gpu (to utilise parallel execution). And since nlp models has the sequence nature, it's not so beneficial to run on gpu. (In terms of qps). We can check other models (vision) to see if the operations are faster in this case. Checking more.

@RSMNYS
The float16 and float32 results don't surprise me at all. As far as I know,

CPU: flp16, fp32 (and maybe bf16)
GPU: fp16, bf16, and float32
ANE: fp16.

I recommend

  1. read https://machinelearning.apple.com/research/neural-engine-transformers,
  2. watch https://developer.apple.com/videos/play/wwdc2023/10047/, and
  3. read https://apple.github.io/coremltools/docs-guides/source/performance-impact.html

for .mlmodel and .mlpackage we discussed,

mlmodel mlpackage

As you can see, running on GPU is slower.

With MIL program and netron, we can find what the 10 and 8 ops in mlmodel and mlpackage, respectively

  • mlmodel:
    • 0: buffer conversion
    • 1-3: the 2 nodes after 'segment_ids'
    • 4-5: the 2 nodes after 'input_mask'
    • 6-7: the 3 nodes after 'input_ids'
    • 8-9: loadConst and the add after 3
  • mlpackage:
    • 0-2: bufer conversion
    • 3: the node after 'input_ids'
    • 4: the gather after 'segment_ids' → cast, cast folded
    • 5, 6 : the cast and reshape after 'input_mask'
    • 7: the gather after 3, the expand_dims, cast folded

Maybe we can change mlprogram manually to check what stopped the 8 ops from running on CPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants