Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSoC] Update tune API for LLM hyperparameters optimization #2393

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
89 commits
Select commit Hold shift + click to select a range
2a882d7
update tune api for llm hyperparameters optimization
helenxie-bit Jul 21, 2024
0c3e067
resolve conflict
helenxie-bit Jul 21, 2024
158c8f3
resolve conflict
helenxie-bit Jul 21, 2024
f4a0d4e
fix the problem of dependency
helenxie-bit Jul 21, 2024
7e7dd56
fix the format of import statement
helenxie-bit Jul 21, 2024
62ad385
adjust the blank lines
helenxie-bit Jul 21, 2024
3f36740
delete the trainer to reuse it in Training Operator
helenxie-bit Jul 22, 2024
9d20253
update constants
helenxie-bit Jul 22, 2024
dfbe793
update metrics format
helenxie-bit Jul 25, 2024
290a249
update the type of and
helenxie-bit Jul 29, 2024
aba2606
update the message of 'ImportError'
helenxie-bit Jul 29, 2024
eaf0193
add TODO of PVC creation
helenxie-bit Jul 29, 2024
62355a2
update the name of pvc
helenxie-bit Jul 29, 2024
7b2b40e
reuse constants from Training Operator
helenxie-bit Jul 29, 2024
acd1dcf
keep 'parameters' and update validation
helenxie-bit Jul 30, 2024
10b057d
update for test
helenxie-bit Jul 31, 2024
5a87eb0
reuse 'get_container_spec' and 'get_pod_template_spec' from Training …
helenxie-bit Aug 7, 2024
8387e67
resolve conflicts
helenxie-bit Aug 7, 2024
71605b4
format with black
helenxie-bit Aug 7, 2024
35acedb
fix Lint error
helenxie-bit Aug 7, 2024
af534b3
fix Lint errors
helenxie-bit Aug 7, 2024
c7f6e10
delete types
helenxie-bit Aug 7, 2024
9fdbdb7
fix format
helenxie-bit Aug 7, 2024
ddd5153
update format
helenxie-bit Aug 7, 2024
b31e820
update format
helenxie-bit Aug 7, 2024
dad3831
fix e2e test error
helenxie-bit Aug 7, 2024
1afe56d
add TODO
helenxie-bit Aug 8, 2024
ad7bce8
format with max line length
helenxie-bit Aug 8, 2024
7e58c94
format docstring
helenxie-bit Aug 8, 2024
61dc8ca
update format
helenxie-bit Aug 8, 2024
ba0d7d1
add helper functions
helenxie-bit Aug 8, 2024
2a1b008
update format
helenxie-bit Aug 8, 2024
b368521
update format
helenxie-bit Aug 8, 2024
3ccbdf9
run test again
helenxie-bit Aug 12, 2024
64e34e0
run test again
helenxie-bit Aug 12, 2024
dde724c
run test again
helenxie-bit Aug 12, 2024
1cccd4a
fix dict substitution in training_parameters
helenxie-bit Aug 14, 2024
510661d
fix typo
helenxie-bit Aug 17, 2024
f03c5ba
Merge remote-tracking branch 'origin/master' into helenxie/update_tun…
helenxie-bit Aug 18, 2024
f6b15a2
resolve conflicts and add check for case of no parameters
helenxie-bit Aug 18, 2024
6a3e046
fix format
helenxie-bit Aug 18, 2024
25541b9
fix format
helenxie-bit Aug 18, 2024
99e74d1
fix format
helenxie-bit Aug 18, 2024
96cf99c
fix flake8 error
helenxie-bit Aug 18, 2024
c568806
fix format
helenxie-bit Aug 18, 2024
6f65253
fix format
helenxie-bit Aug 18, 2024
ad17ac9
fix format
helenxie-bit Aug 18, 2024
9a1e2df
fix format
helenxie-bit Aug 18, 2024
dd12cc2
fix format
helenxie-bit Aug 19, 2024
160065a
update isort file to black and fix typo
helenxie-bit Aug 21, 2024
48a3ee0
modify the set of metrics format
helenxie-bit Aug 21, 2024
0f8a8ef
update tune API
helenxie-bit Aug 21, 2024
3bc3d87
add types.TrainerResources class
helenxie-bit Aug 21, 2024
4f6fc35
fix flake8 error
helenxie-bit Aug 21, 2024
038aeda
rerun tests
helenxie-bit Aug 22, 2024
62a6682
rerun tests
helenxie-bit Aug 23, 2024
d7dd567
resolve conflict
helenxie-bit Aug 23, 2024
95dfddd
resolve conflict
helenxie-bit Aug 23, 2024
7dbd0f5
Merge remote-tracking branch 'upstream/master' into helenxie/update_t…
helenxie-bit Aug 23, 2024
fe39051
rerun tests
helenxie-bit Aug 23, 2024
d20ea35
rerun tests
helenxie-bit Aug 23, 2024
ef27bf6
rerun tests
helenxie-bit Aug 23, 2024
466ca39
rerun tests
helenxie-bit Aug 23, 2024
741df8a
rerun tests
helenxie-bit Aug 23, 2024
e131636
rerun tests
helenxie-bit Aug 23, 2024
fe1348f
rerun tests
helenxie-bit Aug 23, 2024
2484e49
rerun tests
helenxie-bit Aug 23, 2024
f0453b0
rerun tests
helenxie-bit Aug 23, 2024
64ccbc7
rerun tests
helenxie-bit Aug 23, 2024
1ad05e6
delete properties of 'TrainerResources'
helenxie-bit Aug 27, 2024
1b054ac
fix format error
helenxie-bit Aug 27, 2024
5394113
update types
helenxie-bit Aug 29, 2024
dc3a104
fix format
helenxie-bit Aug 29, 2024
dc007b1
add import of 'TrainerResources' in '__init__.py' of katib
helenxie-bit Aug 29, 2024
3d7c9c2
rerun tests
helenxie-bit Aug 29, 2024
96db205
revert changes and rerun tests
helenxie-bit Aug 29, 2024
1a56c07
check pvc and pv status of katib deployments
helenxie-bit Aug 29, 2024
da2b6e0
check pvc and pv status of katib deployments
helenxie-bit Aug 29, 2024
970a592
recommit changes
helenxie-bit Aug 29, 2024
e529ec4
update minikube version when setup
helenxie-bit Aug 29, 2024
17f9dea
delete the code that disables formatting for the tune function
helenxie-bit Aug 30, 2024
1a2c1ad
update according to andrey's feedback
helenxie-bit Aug 30, 2024
5494925
add helper function in utils
helenxie-bit Aug 30, 2024
e1e710e
fix format
helenxie-bit Aug 30, 2024
c2df967
rerun tests
helenxie-bit Aug 30, 2024
9f69329
move metrics_collector_spec back & update helper functions & add retu…
helenxie-bit Aug 30, 2024
2374386
rerun tests
helenxie-bit Aug 30, 2024
233b582
fix some typos
helenxie-bit Aug 30, 2024
faa0f7f
simplify the definition of 'TrainerResources'
helenxie-bit Sep 2, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/template-setup-e2e-test/action.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ runs:
version: ${{ inputs.kubernetes-version }}

- name: Setup Minikube Cluster
uses: medyagh/[email protected].16
uses: medyagh/[email protected].18
with:
network-plugin: cni
cni: flannel
Expand Down
4 changes: 4 additions & 0 deletions hack/gen-python-sdk/post_gen.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,10 @@ def _rewrite_helper(input_file, output_file, rewrite_rules):
if output_file == "sdk/python/v1beta1/kubeflow/katib/__init__.py":
lines.append("# Import Katib API client.\n")
lines.append("from kubeflow.katib.api.katib_client import KatibClient\n")
lines.append("# Import Katib TrainerResources class.\n")
lines.append(
"from kubeflow.katib.types.trainer_resources import TrainerResources\n"
)
lines.append("# Import Katib report metrics functions\n")
lines.append("from kubeflow.katib.api.report_metrics import report_metrics\n")
lines.append("# Import Katib helper functions.\n")
Expand Down
2 changes: 2 additions & 0 deletions sdk/python/v1beta1/kubeflow/katib/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,8 @@

# Import Katib API client.
from kubeflow.katib.api.katib_client import KatibClient
# Import Katib TrainerResources class.
from kubeflow.katib.types.trainer_resources import TrainerResources
# Import Katib report metrics functions
from kubeflow.katib.api.report_metrics import report_metrics
# Import Katib helper functions.
Expand Down
567 changes: 419 additions & 148 deletions sdk/python/v1beta1/kubeflow/katib/api/katib_client.py

Large diffs are not rendered by default.

5 changes: 5 additions & 0 deletions sdk/python/v1beta1/kubeflow/katib/constants/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,3 +60,8 @@
BASE_IMAGE_MXNET = "docker.io/mxnet/python:1.9.1_native_py3"

DEFAULT_DB_MANAGER_ADDRESS = "katib-db-manager.kubeflow:6789"

# The default value for dataset and model storage PVC.
PVC_DEFAULT_SIZE = "10Gi"
# The default value for PVC access modes.
PVC_DEFAULT_ACCESS_MODES = ["ReadWriteOnce", "ReadOnlyMany"]
10 changes: 10 additions & 0 deletions sdk/python/v1beta1/kubeflow/katib/types/trainer_resources.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
class TrainerResources(object):
def __init__(
self,
num_workers=None,
num_procs_per_worker=None,
resources_per_worker=None,
):
self.num_workers = num_workers
self.num_procs_per_worker = num_procs_per_worker
self.resources_per_worker = resources_per_worker
142 changes: 140 additions & 2 deletions sdk/python/v1beta1/kubeflow/katib/utils/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,15 +12,19 @@
# See the License for the specific language governing permissions and
# limitations under the License.

import copy
import inspect
import json
import logging
import os
import textwrap
from typing import Any, Callable
from typing import Any, Callable, Dict, List, Optional, Union

from kubeflow.katib import models
from kubeflow.katib.constants import constants

logger = logging.getLogger(__name__)


def is_running_in_k8s():
return os.path.isdir("/var/run/secrets/kubernetes.io/")
Expand Down Expand Up @@ -85,7 +89,6 @@ def validate_metrics_value(value: Any):


def validate_objective_function(objective: Callable):

# Check if objective function is callable.
if not callable(objective):
raise ValueError(
Expand Down Expand Up @@ -129,3 +132,138 @@ class FakeResponse:

def __init__(self, obj):
self.data = json.dumps(obj)


class SetEncoder(json.JSONEncoder):
def default(self, obj):
if isinstance(obj, set):
return list(obj)
if isinstance(obj, type):
return obj.__name__
return json.JSONEncoder.default(self, obj)


def get_trial_substitutions_from_dict(
parameters: Dict[str, Any],
experiment_params: List[models.V1beta1ParameterSpec],
trial_params: List[models.V1beta1TrialParameterSpec],
) -> Dict[str, str]:
for p_name, p_value in parameters.items():
# If input parameter value is Katib Experiment parameter sample.
if isinstance(p_value, models.V1beta1ParameterSpec):
# Wrap value for the function input.
parameters[p_name] = f"${{trialParameters.{p_name}}}"

# Add value to the Katib Experiment parameters.
p_value.name = p_name
experiment_params.append(p_value)

# Add value to the Katib Experiment's Trial parameters.
trial_params.append(
models.V1beta1TrialParameterSpec(name=p_name, reference=p_name)
)
else:
# Otherwise, add value to the function input.
parameters[p_name] = p_value

return parameters


def get_trial_substitutions_from_trainer(
parameters: Union["TrainingArguments", "LoraConfig"], # noqa: F821
experiment_params: List[models.V1beta1ParameterSpec],
trial_params: List[models.V1beta1TrialParameterSpec],
) -> Dict[str, str]:
from peft import LoraConfig # noqa: F401
from transformers import TrainingArguments # noqa: F401

if isinstance(parameters, TrainingArguments):
parameters_dict = parameters.to_dict()
else:
parameters_dict = parameters.__dict__

for p_name, p_value in parameters_dict.items():
if not hasattr(parameters, p_name):
logger.warning(f"Training parameter {p_name} is not supported.")
continue

if isinstance(p_value, models.V1beta1ParameterSpec):
old_attr = getattr(parameters, p_name, None)
if old_attr is not None:
value = f"${{trialParameters.{p_name}}}"
setattr(parameters, p_name, value)
p_value.name = p_name
experiment_params.append(p_value)
trial_params.append(
models.V1beta1TrialParameterSpec(name=p_name, reference=p_name)
)
elif p_value is not None:
old_attr = getattr(parameters, p_name, None)
if old_attr is not None:
if isinstance(p_value, dict):
# Update the existing dictionary without nesting
value = copy.deepcopy(p_value)
else:
value = type(old_attr)(p_value)
setattr(parameters, p_name, value)

if isinstance(parameters, TrainingArguments):
parameters = json.dumps(parameters.to_dict())
else:
parameters = json.dumps(parameters.__dict__, cls=SetEncoder)

return parameters


def get_exec_script_from_objective(
objective: Callable,
input_params: Dict[str, Any] = None,
packages_to_install: Optional[List[str]] = None,
pip_index_url: str = "https://pypi.org/simple",
) -> str:
"""
Get executable script for container args from the given objective function and parameters.
"""
# Validate objective function.
validate_objective_function(objective)

# Extract objective function implementation.
objective_code = inspect.getsource(objective)

# Objective function might be defined in some indented scope
# (e.g. in another function). We need to dedent the function code.
objective_code = textwrap.dedent(objective_code)

# Wrap objective function to execute it from the file. For example:
# def objective(parameters):
# print(f'Parameters are {parameters}')
# objective({
# 'lr': '${trialParameters.lr}',
# 'epochs': '${trialParameters.epochs}',
# 'is_dist': False
# })
objective_code = f"{objective_code}\n{objective.__name__}({input_params})\n"

# Prepare execute script template.
exec_script = textwrap.dedent(
"""
program_path=$(mktemp -d)
read -r -d '' SCRIPT << EOM\n
{objective_code}
EOM
printf "%s" "$SCRIPT" > $program_path/ephemeral_objective.py
python3 -u $program_path/ephemeral_objective.py"""
)

# Add objective code to the execute script.
exec_script = exec_script.format(objective_code=objective_code)

# Install Python packages if that is required.
if packages_to_install is not None:
exec_script = (
get_script_for_python_packages(packages_to_install, pip_index_url)
+ exec_script
)

# Return executable script to execute objective function.
return exec_script
3 changes: 3 additions & 0 deletions sdk/python/v1beta1/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -85,4 +85,7 @@
"Topic :: Software Development :: Libraries :: Python Modules",
],
install_requires=REQUIRES,
extras_require={
"huggingface": ["kubeflow-training[huggingface]==1.8.0"],
},
)