-
Notifications
You must be signed in to change notification settings - Fork 443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GSoC] Update tune
API for LLM hyperparameters optimization
#2393
[GSoC] Update tune
API for LLM hyperparameters optimization
#2393
Conversation
Signed-off-by: helenxie-bit <[email protected]>
For the test example, please refer to this example in the proposal. |
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Ref: #2339 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, adding a few initial comments.
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
@andreyvelich When I pushed my latest changes, it was strange that only three of the end-to-end tests succeeded, while the others failed due to a Katib deployment issue. And the main reason seems to be related to the following error: error: timed out waiting for the condition on pods/katib-mysql-77b9495867-q6txk
NAME READY STATUS RESTARTS AGE
katib-controller-dbc9cc-bmtf4 1/1 Running 0 2m
katib-db-manager-67b8c998f4-mmljn 1/1 Running 1 (59s ago) 2m
katib-mysql-77b9495867-q6txk 0/1 Pending 0 2m
training-operator-86d756f697-5scdr 1/1 Running 0 2m2s Containers:
katib-mysql:
Image: mysql:8.0.29
Port: 3306/TCP
Host Port: 0/TCP
Args:
--datadir
/var/lib/mysql/datadir
Liveness: exec [/bin/bash -c mysqladmin ping -u root -p${MYSQL_ROOT_PASSWORD}] delay=10s timeout=1s period=5s #success=1 #failure=10
Readiness: exec [/bin/bash -c mysql -D ${MYSQL_DATABASE} -u root -p${MYSQL_ROOT_PASSWORD} -e 'SELECT 1'] delay=10s timeout=1s period=5s #success=1 #failure=10
Startup: exec [/bin/bash -c mysqladmin ping -u root -p${MYSQL_ROOT_PASSWORD}] delay=0s timeout=1s period=15s #success=1 #failure=60
Environment:
MYSQL_ROOT_PASSWORD: <set to the key 'MYSQL_ROOT_PASSWORD' in secret 'katib-mysql-secrets'> Optional: false
MYSQL_ALLOW_EMPTY_PASSWORD: true
MYSQL_DATABASE: katib
Mounts:
/var/lib/mysql from katib-mysql (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-mgm9j (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
katib-mysql:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: katib-mysql
ReadOnly: false
kube-api-access-mgm9j:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 2m default-scheduler 0/1 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.. I have run my code locally, and everything works fine. I also tried reverting the changes and pushing the code to the CI/CD pipelines from before the latest update, but the same error occurred. I suspect this might be due to resource limitations or a Minikube configuration issue. |
@tenzen-y Any thoughts ? @kubeflow/wg-training-leads Do you remember why we are not using Kind cluster for Katib E2Es ? Is it because we need PVC with ReadWriteMany for PBT Suggestion ? |
Signed-off-by: helenxie-bit <[email protected]>
@andreyvelich I have updated the Minikube version, and it is working well now. Thank you very much! Please review the latest changes when you have time 😃 |
Yes, the reason is RWX PersistentVolume. KinD does not support RWX PV. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this @helenxie-bit!
I left a few comments.
/assign @kubeflow/wg-training-leads @deepanker13
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
…rn type for helper functions Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
@andreyvelich Thank you for your comments! I have made the updates accordingly. Please review them when you have time. |
Signed-off-by: helenxie-bit <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this great contribution @helenxie-bit 🎉
/lgtm
/assign @deepanker13 @johnugeorge @tenzen-y
Please take a look at the final changes
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: johnugeorge The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What this PR does / why we need it:
This PR implements the initial functionality for LLM hyperparameter optimization in the
tune
API.Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #
Checklist: