Skip to content

Commit

Permalink
chore: determined_master_host and friends helm support, better defaul…
Browse files Browse the repository at this point in the history
…ts (#10092)

When `determined_master_ip` is unsettable via Helm and defaults to the service IP, life
with proxies is hard. This change renames `determined_master_ip` to
`determined_master_host` everywhere with some backwards compatibility, defaults
`determined_master_host` to `<det_namespace>.<det_service_name>.svc.cluster.local`
instead of the service IP, and makes all of this overridable in Helm.
  • Loading branch information
stoksc authored Oct 25, 2024
1 parent 6755467 commit b3f928b
Show file tree
Hide file tree
Showing 17 changed files with 228 additions and 74 deletions.
4 changes: 2 additions & 2 deletions .circleci/devcluster/multi-k8s.devcluster.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ stages:
slot_resource_requests:
cpu: 1
kubeconfig_path: /tmp/defaultrm-kubeconf
determined_master_ip: $DOCKER_LOCALHOST
determined_master_host: $DOCKER_LOCALHOST
determined_master_port: 8080
internal_task_gateway:
gateway_name: contour
Expand All @@ -60,7 +60,7 @@ stages:
slot_resource_requests:
cpu: 1
kubeconfig_path: /tmp/additionalrm-kubeconf
determined_master_ip: $DOCKER_LOCALHOST
determined_master_host: $DOCKER_LOCALHOST
determined_master_port: 8080
resource_pools:
- pool_name: additional_pool
2 changes: 1 addition & 1 deletion .circleci/devcluster/single-k8s.devcluster.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -33,5 +33,5 @@ stages:
slot_resource_requests:
cpu: 1
kubeconfig_path: ~/.kube/config
determined_master_ip: $DOCKER_LOCALHOST
determined_master_host: $DOCKER_LOCALHOST
determined_master_port: 8080
9 changes: 9 additions & 0 deletions docs/release-notes/add-host-port-scheme-to-helm.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
:orphan:

**New Features**

- Helm: Support configuring ``determined_master_host``, ``determined_master_port``, and
``determined_master_scheme``. These control how tasks address the Determined API server and are
useful when installations span multiple Kubernetes clusters or there are proxies in between tasks
and the master. Also, ``determined_master_host`` now defaults to the service host,
``<det_namespace>.<det_service_name>.svc.cluster.local``, instead of the service IP.
Original file line number Diff line number Diff line change
Expand Up @@ -156,7 +156,7 @@ the same as the “cluster name” for a given cluster.

If an additional resource manager needs to connect to the Determined master through a gateway
requiring TLS, ``resource_manager.determined_master_scheme`` should be set to ``https``. If
``resource_manager.determined_master_scheme`` is not set ``determined_master_ip`` will assume
``resource_manager.determined_master_scheme`` is not set ``determined_master_host`` will assume
``https`` if the master is terminating TLS and ``http`` otherwise.

*******
Expand Down
15 changes: 15 additions & 0 deletions helm/charts/determined/templates/master-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -224,6 +224,21 @@ stringData:
fluent:
{{- toYaml .Values.fluent | nindent 8}}
{{- end }}
{{- if .Values.determinedMasterHost }}
{{- if .Values.determinedMasterScheme }}
determined_master_scheme: {{ .Values.determinedMasterScheme | quote }}
{{- else if .Values.tlsSecret }}
determined_master_scheme: "https"
{{- else }}
determined_master_scheme: "http"
{{- end }}
determined_master_host: {{ .Values.determinedMasterHost | quote }}
{{- if .Values.determinedMasterPort }}
determined_master_port: {{ .Values.determinedMasterPort }}
{{- else }}
determined_master_port: {{ .Values.masterPort }}
{{- end }}
{{- end }}
default_aux_resource_pool: {{.Values.defaultAuxResourcePool}}
default_compute_resource_pool: {{.Values.defaultComputeResourcePool}}
Expand Down
12 changes: 11 additions & 1 deletion helm/charts/determined/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -409,6 +409,16 @@ resourcePools:
## Configure the initial user password for the cluster
# initialUserPassword:

# determinedMasterHost configures the hostname that tasks launched by the primary resource manager use when
# communicating with our API server. This is useful when installations span multiple Kubernetes clusters and when there
# are proxies in between tasks and the master. It defaults to `<det_namespace>.<det_service_name>.svc.cluster.local`.
# determinedMasterHost:
# determinedMasterPort configures the port for the host above. It defaults to `masterPort` specified elsewhere. Must
# be used with determinedMasterHost or it is ineffective.
# determinedMasterPort:
# determinedMasterScheme configures the scheme for the host and port above. It defaults to `https` if our API server
# is deployed with TLS, else `http`. Must be used with determinedMasterHost or it is ineffective.
# determinedMasterScheme:
# additional_resource_managers:
# - resource_manager:
# type: kubernetes
Expand All @@ -417,7 +427,7 @@ resourcePools:
# default_namespace: default
# kubeconfig_secret_name: additionalrm
# kubeconfig_secret_value: config
# determined_master_ip: 10.11.12.13
# determined_master_host: 10.11.12.13
# determined_master_port: 8080
# resource_pools:
# - pool_name: additional_pool
Expand Down
103 changes: 73 additions & 30 deletions master/cmd/determined-master/root.go
Original file line number Diff line number Diff line change
Expand Up @@ -185,23 +185,20 @@ func getConfig(configMap map[string]interface{}) (*config.Config, error) {
}

func applyBackwardsCompatibility(configMap map[string]interface{}) (map[string]interface{}, error) {
// Preemption timeout moved from __internal to task_container_defaults
if internalMap, ok := configMap["__internal"].(map[string]interface{}); ok {
if oldPreemptTimeout, ok := internalMap["preemption_timeout"]; ok && oldPreemptTimeout != nil {
if preemptionDuration, err := time.ParseDuration(oldPreemptTimeout.(string)); err == nil {
preemptionTimeoutSeconds := int(preemptionDuration.Seconds())
if taskContainerMap, ok := configMap["task_container_defaults"].(map[string]interface{}); ok {
// Only set task_container_defaults from __internal if nil
if taskContainerTimeout, ok := taskContainerMap["preemption_timeout"]; !ok || taskContainerTimeout == nil {
taskContainerMap["preemption_timeout"] = preemptionTimeoutSeconds
configMap["task_container_defaults"] = taskContainerMap
}
}
}
}
delete(internalMap, "preemption_timeout")
err := shimOldRMConfig(configMap)
if err != nil {
return nil, err
}

shimPreemptionTimeout(configMap)
shimDeterminedMasterIP(configMap)
return configMap, nil
}

// Ensure we use either the old schema or the new one.
// Use configMap if RMs are not defined at all, or if they are defined using the new schema.
// If use the old schema, convert it to the new one.
func shimOldRMConfig(configMap map[string]interface{}) error {
const (
defaultVal = "default"
agentVal = "agent"
Expand All @@ -213,20 +210,17 @@ func applyBackwardsCompatibility(configMap map[string]interface{}) (map[string]i
vScheduler, schedulerExisted := configMap["scheduler"]
vProvisioner, provisionerExisted := configMap["provisioner"]

// Ensure we use either the old schema or the new one.
oldRMConfig := schedulerExisted || provisionerExisted
newRMConfig := rmExisted || rpsExisted
if newRMConfig && oldRMConfig {
return nil, errors.New(
return errors.New(
"cannot use the old and the new configuration schema at the same time",
)
}
if !oldRMConfig {
// Use configMap if RMs are not defined at all, or if they are defined using the new schema.
return configMap, nil
return nil
}

// If use the old schema, convert it to the new one.
newScheduler := map[string]interface{}{
"type": "priority",
"fitting_policy": "best",
Expand All @@ -237,38 +231,38 @@ func applyBackwardsCompatibility(configMap map[string]interface{}) (map[string]i
if schedulerExisted {
schedulerMap, ok := vScheduler.(map[string]interface{})
if !ok {
return nil, errors.New("wrong type for scheduler field")
return errors.New("wrong type for scheduler field")
}

if vFit, ok := schedulerMap["fit"]; ok {
newScheduler["fitting_policy"], ok = vFit.(string)
if !ok {
return nil, errors.New("wrong type for scheduler.fit field")
return errors.New("wrong type for scheduler.fit field")
}
}
if vType, ok := schedulerMap["type"]; ok {
newScheduler["type"], ok = vType.(string)
if !ok {
return nil, errors.New("wrong type for scheduler.type field")
return errors.New("wrong type for scheduler.type field")
}
}
if vRP, ok := schedulerMap["resource_provider"]; ok {
rpMap, ok := vRP.(map[string]interface{})
if !ok {
return nil, errors.New("wrong type for scheduler.resource_provider field")
return errors.New("wrong type for scheduler.resource_provider field")
}

vRPType, ok := rpMap["type"]
if ok {
switch vRPTypeStr, ok := vRPType.(string); {
case !ok:
return nil, errors.New("wrong type for scheduler.resource_provider.type field")
return errors.New("wrong type for scheduler.resource_provider.type field")
case vRPTypeStr == defaultVal:
newRM["type"] = agentVal
case vRPTypeStr == kubernetesVal:
newRM["type"] = kubernetesVal
default:
return nil, errors.New("wrong value for scheduler.resource_provider.type field")
return errors.New("wrong value for scheduler.resource_provider.type field")
}
} else {
newRM["type"] = agentVal
Expand Down Expand Up @@ -298,18 +292,18 @@ func applyBackwardsCompatibility(configMap map[string]interface{}) (map[string]i
if provisionerExisted {
provisionerMap, ok := vProvisioner.(map[string]interface{})
if !ok {
return nil, errors.New("wrong type for provisioner field")
return errors.New("wrong type for provisioner field")
}

newRP["provider"] = provisionerMap
if vProvider, ok := provisionerMap["provider"]; ok {
vProviderStr, ok := vProvider.(string)
if !ok {
return nil, errors.New("wrong type for provisioner.provider field")
return errors.New("wrong type for provisioner.provider field")
}

if vProviderStr != "aws" && vProviderStr != "gcp" {
return nil, errors.New("wrong value for provisioner.provider field")
return errors.New("wrong value for provisioner.provider field")
}

provisionerMap["type"] = provisionerMap["provider"]
Expand All @@ -321,6 +315,55 @@ func applyBackwardsCompatibility(configMap map[string]interface{}) (map[string]i

delete(configMap, "scheduler")
delete(configMap, "provisioner")
return nil
}

return configMap, nil
// Preemption timeout moved from __internal to task_container_defaults
// Only set task_container_defaults from __internal if nil.
func shimPreemptionTimeout(configMap map[string]interface{}) {
if internalMap, ok := configMap["__internal"].(map[string]interface{}); ok {
if oldPreemptTimeout, ok := internalMap["preemption_timeout"]; ok && oldPreemptTimeout != nil {
if preemptionDuration, err := time.ParseDuration(oldPreemptTimeout.(string)); err == nil {
preemptionTimeoutSeconds := int(preemptionDuration.Seconds())
if taskContainerMap, ok := configMap["task_container_defaults"].(map[string]interface{}); ok {
if taskContainerTimeout, ok := taskContainerMap["preemption_timeout"]; !ok || taskContainerTimeout == nil {
taskContainerMap["preemption_timeout"] = preemptionTimeoutSeconds
configMap["task_container_defaults"] = taskContainerMap
}
}
}
}
delete(internalMap, "preemption_timeout")
}
}

// Rename from `determined_master_ip` to `determined_master_host` in old KubernetesRM configs.
func shimDeterminedMasterIP(configMap map[string]interface{}) {
shimDeterminedMasterIPInRM := func(rm map[string]any) {
if rm["type"] != "kubernetes" {
return
}
if ip, ok := rm["determined_master_ip"]; ok {
if _, ok := rm["determined_master_host"]; !ok {
rm["determined_master_host"] = ip
delete(rm, "determined_master_ip")
} else {
log.Warn("ignoring duplicated configuration `determined_master_ip`")
delete(rm, "determined_master_ip")
}
}
}
if rmi, ok := configMap["resource_manager"]; ok {
if rm, ok := rmi.(map[string]any); ok {
shimDeterminedMasterIPInRM(rm)
}
}
if rmis, ok := configMap["additional_resource_managers"]; ok {
rms, ok := rmis.([]map[string]any)
if ok {
for _, rm := range rms {
shimDeterminedMasterIPInRM(rm)
}
}
}
}
83 changes: 83 additions & 0 deletions master/cmd/determined-master/root_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -228,6 +228,89 @@ func TestApplyBackwardsCompatibility(t *testing.T) {
},
},
},
{
name: "determined master ip to host rename, ip set",
before: map[string]interface{}{
"resource_manager": map[string]interface{}{
"type": "kubernetes",
"determined_master_ip": "10.0.0.1",
},
"additional_resource_managers": []map[string]any{
{
"type": "kubernetes",
"determined_master_ip": "10.0.0.2",
},
},
},
expected: map[string]interface{}{
"resource_manager": map[string]interface{}{
"type": "kubernetes",
"determined_master_host": "10.0.0.1",
},
"additional_resource_managers": []map[string]any{
{
"type": "kubernetes",
"determined_master_host": "10.0.0.2",
},
},
},
},
{
name: "determined master ip to host rename, both set",
before: map[string]interface{}{
"resource_manager": map[string]interface{}{
"type": "kubernetes",
"determined_master_host": "10.0.0.1",
"determined_master_ip": "10.0.0.1",
},
"additional_resource_managers": []map[string]any{
{
"type": "kubernetes",
"determined_master_host": "10.0.0.2",
"determined_master_ip": "10.0.0.2",
},
},
},
expected: map[string]interface{}{
"resource_manager": map[string]interface{}{
"type": "kubernetes",
"determined_master_host": "10.0.0.1",
},
"additional_resource_managers": []map[string]any{
{
"type": "kubernetes",
"determined_master_host": "10.0.0.2",
},
},
},
},
{
name: "determined master ip to host rename, host set get left alone",
before: map[string]interface{}{
"resource_manager": map[string]interface{}{
"type": "kubernetes",
"determined_master_host": "10.0.0.1",
},
"additional_resource_managers": []map[string]any{
{
"type": "kubernetes",
"determined_master_host": "10.0.0.2",
},
},
},
expected: map[string]interface{}{
"resource_manager": map[string]interface{}{
"type": "kubernetes",
"determined_master_host": "10.0.0.1",
},
"additional_resource_managers": []map[string]any{
{
"type": "kubernetes",
"determined_master_host": "10.0.0.2",
},
},
},
},
}
for ix := range tcs {
tc := tcs[ix]
Expand Down
Loading

0 comments on commit b3f928b

Please sign in to comment.