Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sdk tests with papermill #2448

Open
wants to merge 23 commits into
base: master
Choose a base branch
from

Conversation

yehudit1987
Copy link

What this PR does / why we need it:
This PR creates E2E tests for katib examples to run with papermill.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #2417

Checklist:

  • Docs included if any changes are user facing

Yehudit Kerido added 5 commits October 27, 2024 12:49
Signed-off-by: Yehudit Kerido <[email protected]>
Signed-off-by: Yehudit Kerido <[email protected]>
Signed-off-by: Yehudit Kerido <[email protected]>
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tenzen-y for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Electronic-Waste
Copy link
Member

/rerun-all

@Electronic-Waste
Copy link
Member

Electronic-Waste commented Oct 28, 2024

@yehudit1987 Can you please fix these CI errors?

@Electronic-Waste
Copy link
Member

@yehudit1987 Can you sign your commits with git commit -s? The DCO checks failed due to this reason.

@Electronic-Waste
Copy link
Member

FYI, you can check this reference: https://github.com/kubeflow/katib/pull/2448/checks?check_run_id=32215445282

Yehudit Kerido added 10 commits October 29, 2024 21:29
Signed-off-by: Yehudit Kerido <[email protected]>
Signed-off-by: Yehudit Kerido <[email protected]>
Signed-off-by: Yehudit Kerido <[email protected]>
Signed-off-by: Yehudit Kerido <[email protected]>
Signed-off-by: Yehudit Kerido <[email protected]>
Signed-off-by: Yehudit Kerido <[email protected]>
Signed-off-by: Yehudit Kerido <[email protected]>
Signed-off-by: Yehudit Kerido <[email protected]>
@Electronic-Waste
Copy link
Member

/rerun-all

Signed-off-by: Yehudit Kerido <[email protected]>
@yehudit1987 yehudit1987 marked this pull request as ready for review November 3, 2024 11:16
Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late response @yehudit1987. I left a few comments for you.

And I'm busy with my works now. I'll give reviews on Notebooks later:)

Comment on lines 39 to 48
if [ -x "$(command -v apt-get)" ]; then
echo "Upgrading Podman using apt-get..."
sudo apt-get update
sudo apt-get install -y podman
elif [ -x "$(command -v dnf)" ]; then
echo "Upgrading Podman using dnf..."
sudo dnf upgrade podman -y
else
echo "Package manager not found. Skipping upgrade."
fi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please tell me why we need to use podman?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be better to change the dir name from template-notebook-test to template-e2e-notebook-test to be consistent with other dirs:)

@Electronic-Waste
Copy link
Member

/rerun-all

Yehudit Kerido added 2 commits November 5, 2024 10:41
Signed-off-by: Yehudit Kerido <[email protected]>
Signed-off-by: Yehudit Kerido <[email protected]>
Yehudit Kerido added 2 commits November 5, 2024 11:55
Signed-off-by: Yehudit Kerido <[email protected]>
Signed-off-by: Yehudit Kerido <[email protected]>
@Electronic-Waste
Copy link
Member

/rerun-all

Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review @yehudit1987. I was very busy recently.

I left some comments for you. Thanks for your great contributions!

Comment on lines 351 to 376
"from kubernetes import client, config\n",
"\n",
"# Initialize KatibClient\n",
"kclient = KatibClient(namespace=namespace)\n",
"\n",
"# Load kubeconfig\n",
"config.load_kube_config()\n",
"\n",
"# Kubernetes API for managing namespaces\n",
"core_v1_api = client.CoreV1Api()\n",
"\n",
"# Function to add label to namespace if it doesn't have the required one\n",
"def add_katib_label_to_namespace(namespace):\n",
" ns = core_v1_api.read_namespace(namespace)\n",
" labels = ns.metadata.labels or {}\n",
" if labels.get(\"katib.kubeflow.org/metrics-collector-injection\") != \"enabled\":\n",
" print(f\"Adding label to namespace {namespace}...\")\n",
" labels[\"katib.kubeflow.org/metrics-collector-injection\"] = \"enabled\"\n",
" body = {\"metadata\": {\"labels\": labels}}\n",
" core_v1_api.patch_namespace(namespace, body)\n",
" print(f\"Label added to namespace {namespace}.\")\n",
" else:\n",
" print(f\"Namespace {namespace} already has the required label.\")\n",
"\n",
"# Add the required label to the namespace\n",
"add_katib_label_to_namespace(namespace)\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to add these lines?

Comment on lines 9 to 12
papermill-args-yaml:
description: 'Additional arguments to pass to Papermill in yaml format'
required: false
default: ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that we didn't pass parameters to these two notebook examples. Not sure if it meets our requirements

cc👀 @andreyvelich @tenzen-y @yehudit1987

@@ -65,10 +65,12 @@ echo "Deploying Katib"
cd ../../../../../ && WITH_DATABASE_TYPE=$WITH_DATABASE_TYPE make deploy && cd -

# Wait until all Katib pods is running.
TIMEOUT=120s
TIMEOUT=180s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to change TIMEOUT to 180s?

kubectl wait --for=condition=ContainersReady=True --timeout=${TIMEOUT} -l "katib.kubeflow.org/component in ($WITH_DATABASE_TYPE,controller,db-manager,ui)" -n kubeflow pod ||
(kubectl get pods -n kubeflow && kubectl describe pods -n kubeflow && exit 1)
echo "Waiting for pods to be ready for $TIMEOUT seconds..."
sleep $TIMEOUT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary since we already have kubectl wait instruction? Please let me know your thought.

Comment on lines 96 to 99
if ! kubectl get namespaces | grep -q "kubeflow-user-example-com"; then
kubectl create namespace kubeflow-user-example-com
fi

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can use default namespace instead of creating a new one:)

Comment on lines 26 to 35
- name: Install dependencies
shell: bash
run: |
python -m pip install --upgrade pip
pip install papermill kubeflow-katib jupyter ipykernel
python -m ipykernel install --user --name python3 --display-name "Python 3"

- name: Setup Minikube Cluster
shell: bash
run: ./test/e2e/v1beta1/scripts/gh-actions/setup-minikube.sh true true "" "" "cmaes"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need to create minikube cluster between these two steps?

- name: Setup Minikube Cluster
uses: medyagh/[email protected]
with:
network-plugin: cni
cni: flannel
driver: none
kubernetes-version: ${{ inputs.kubernetes-version }}
minikube-version: 1.31.1
start-args: --wait-timeout=120s

Comment on lines 30 to 40
function check_minikube() {
if minikube status >/dev/null 2>&1; then
echo "Minikube is already running."
else
echo "Minikube is not running. Starting Minikube..."
minikube start
fi
}

echo "Checking Minikube Kubernetes Cluster"
check_minikube
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess, we do not check the status of minikube cluster here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll be better if you do not delete the original content.

And I think it's necessary to rewrite this example since TFJobClient() is outdated in the newest SDK in training-operator. WDYT👀 @kubeflow/wg-automl-leads

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't apply to many changes to that notebook beside as you mentioned replacing the TFJobClient with TrainingClient and specifying the job type. For some reason it looks like I rewrite the whole example. Anyway I fixed the "corrupted" notebook. Also I fixed the other issues you pointed at.

Regarding default namespace, it failed to create experiments without specifying the namespace.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @yehudit1987!

I know some code lines were generated by pycharm or vscode. The "original content" I meant is some output and images in the notebook, not those code lines.

As for the namespace, could we specify default namespace for the experiment like namespace=default? And it will be better if we could specify namespace for papermill like: kubeflow/training-operator#2274 .

Signed-off-by: Yehudit Kerido <[email protected]>
Yehudit Kerido added 2 commits November 18, 2024 07:59
Signed-off-by: Yehudit Kerido <[email protected]>
Signed-off-by: Yehudit Kerido <[email protected]>
Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of effort @yehudit1987 ! Thanks for your contribution.

I left some comments for you. cc👀 @kubeflow/wg-automl-leads

@@ -671,4 +671,4 @@
},
"nbformat": 4,
"nbformat_minor": 4
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
}
}

kubectl wait --for=condition=ContainersReady=True --timeout=${TIMEOUT} -l "katib.kubeflow.org/component in ($WITH_DATABASE_TYPE,controller,db-manager,ui)" -n kubeflow pod ||
(kubectl get pods -n kubeflow && kubectl describe pods -n kubeflow && exit 1)
kubectl wait --for=condition=ContainersReady=True --timeout=${TIMEOUT} -l "katib.kubeflow.org/component in ($WITH_DATABASE_TYPE,controller,db-manager,ui)" -n kubeflow pod || (kubectl get pods -n kubeflow && kubectl describe pods -n kubeflow && exit 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better if we could adjust the format of this line.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Recover its original state)

Comment on lines +101 to 112
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# Experiment name and namespace.\n",
"namespace = \"kubeflow-user-example-com\"\n",
"namespace = \"kubeflow\"\n",
"experiment_name = \"cmaes-example\"\n",
"\n",
"metadata = V1ObjectMeta(\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add parameters tag in metadata and allow args in papermill rewrite them like: kubeflow/training-operator#2274?

@@ -314,7 +342,8 @@
"\n",
"# Start the Katib Experiment.\n",
"exp_name = \"tune-mnist\"\n",
"katib_client = katib.KatibClient()\n",
"namespace=\"kubeflow\"\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like above

Comment on lines +444 to +446
"import time\n",
"time.sleep(120)\n",
"status = katib_client.is_experiment_succeeded(exp_name, namespace=namespace)\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we replace fixed-time sleep with wait_for_experiment_condition()?

def wait_for_experiment_condition(
self,
name: str,
namespace: Optional[str] = None,
expected_condition: str = constants.EXPERIMENT_CONDITION_SUCCEEDED,
timeout: int = 600,
polling_interval: int = 15,
apiserver_timeout: int = constants.DEFAULT_TIMEOUT,
):

@Electronic-Waste
Copy link
Member

/rerun-all

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Test] E2e Tests for Notebook Examples
3 participants