Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenStack: automatically populate RHCOS image #2473

Merged

Conversation

Fedosin
Copy link
Contributor

@Fedosin Fedosin commented Oct 8, 2019

This PR allows to automatically populate a Glance image if it was not pre-created by the user.

Goals:

  1. Automatically create a Glance image with the required version for the cluster.
  2. The name of the created image must be unique. We have chosen names in format "<InfraID>-rhcos".
  3. Delete the image together with the cluster.
  4. If OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE variable is set and it is not a URL, do not create an image in Glance, but reuse the existing one and do not delete it when deleting the cluster.
  5. If OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE variable is set and it is a URL - create a new Glance image from the URL ignoring rhcos.json file data, delete the Glance image along with the cluster.

Not included in goals of this work:
Generation of the URL from which we download the image file. It is obtained by the existing rhcos.OpenStack function based on the data from rhcos.json file.

Prerequisites:
In OpenShift 4.2, the name of the asset was hardcoded to "rhcos". Although it can be overridden with the env variable, users still have to create an image manually.
This is inconvenient for users and can lead to unpleasant consequences when OpenShift clusters of different versions can accidentally use the same RHCOS image.

Technical solution:
We consider 3 situations:

  1. OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE is not set (default behavior)
    RHCOS image file URL is generated from rhcos.json file by rhcos.OpenStack function. New Glance image called "<InfraID>-rhcos" is created by Terraform.
  2. OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE is set and contains a URL.
    RHCOS image file URL is provided by the variable, rhcos.json data is ignored. New Glance image called "<InfraID>-rhcos" is created by Terraform.
  3. OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE is set and contains a string (not a URL).
    Installer will reuse the existing Glance image defined by the variable, no new Glance images will be created.

Gotchas:

  1. When creating a Glance image, Terraform first downloads the image file to the local hard disk in the ~/.terraform/image-cache folder and does not delete it after. Terrafom won't download the same data again afterwards, but anyway this may cause the hard disk to overflow if images with different versions are downloaded frequently.
  2. When a file is loaded into Glance (has a status "saving"), the image cannot be deleted from OpenStack. You must either wait until the end of the download or explicitly interrupt it on the client side.

Future work:

  1. Verify that the image defined by OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE variable exists.
  2. Check that the image defined by the variable is the only one with such a name.
  3. Checksum validations.

@openshift-ci-robot openshift-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Oct 8, 2019
data/data/openstack/main.tf Outdated Show resolved Hide resolved
case azure.Name:
osimage, err = rhcos.VHD(ctx)
case baremetal.Name:
// Note that baremetal IPI currently uses the OpenStack image
// because this contains the necessary ironic config drive
// ignition support, which isn't enabled in the UPI BM images
osimage, err = rhcos.OpenStack(ctx)
case none.Name, vsphere.Name:
case none.Name, openstack.Name, vsphere.Name:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be clusterID.InfraID + "-rhcos" for the openstack case so that you don't have to check for empty value each time?

This isn't immediately obvious why we do things differently for OpenStack. Could you also explain why not using osimage, err = rhcos.OpenStack(ctx) ? I'm thinking it may be so that we can destroy cluster in a tenant without impacting other clusters in the same tenant. Would be nice to add a comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clusterID is not generated yet here, so we don't know its infra id.

We do things differently for OpenStack, because image creation is not idempotent there. When we ask Terraform to create "rhcos" image in Glance 5 times it will create 5 different images, which is not true for other platforms.

rhcos.OpenStack(ctx) returns a url where we can find our binary file (like https://releases-art-rhcos.svc.ci.openshift.org/art/storage/releases/rhcos-4.3/43.80.20191002.1/x86_64/rhcos-43.80.20191002.1-openstack.x86_64.qcow2) But we expect osImage to return a Glance image name, "rhcos" for instance. So we can't use this function here directly. Instead, image names are generated later and there we know if this string is empty, then Terraform needs to create a new image "-rhcos".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is anything preventing us from generating the clusterID earlier then?
My point being, we should never have the image name being an empty string. It is either generated based on the clusterID or set via OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE.


// If baseImage is empty then it was not provided by the user and we need to create it.
// For doing this we get the image URL and give it to Terraform.
// Image name in this case will be: <InfraID>-rhcos, i.e. user-rdd6e-rhcos
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/user-rdd6e-rhcos/clustername-rdd6e-rhcos/

baseImage := string(*rhcosImage)
var baseImageURL string

// If baseImage is empty then it was not provided by the user and we need to create it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to leave the possibility to provide a base image, the installer pins the image for a reason.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an OCP requirement to give the possibility to provide a base image for the installer by setting OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE variable. Also this is good for CI, where we don't need to create new images every time, since we can reuse the existing one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough.

@Fedosin
Copy link
Contributor Author

Fedosin commented Oct 8, 2019

/retest

ctx, cancel := context.WithTimeout(context.TODO(), 30*time.Second)
defer cancel()

baseImageURL, err = rhcosplatforms.OpenStack(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we not calling this function in the rhcos image asset 8a4153b#diff-a5ccce89cb15a18ce4324bf3aa273a6fL81

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rhcosplatforms.OpenStack(ctx) returns a url ( https://releases-art-rhcos.svc.ci.openshift.org/art/storage/releases/rhcos-4.3/43.80.20191002.1/x86_64/rhcos-43.80.20191002.1-openstack.x86_64.qcow2), but we expect osImage function to return a Glance image name (i.e. "rhcos"), so we can't use the function there directly.

@cgwalters
Copy link
Member

Until now, no "production" use case depended on the "RHCOS builds API" at https://releases-art-rhcos.svc.ci.openshift.org/

One thing we probably should do is have the downloader try finding content first at https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/ or so.

@cgwalters
Copy link
Member

Also, is the download code verifying the download using the sha256?

@Fedosin
Copy link
Contributor Author

Fedosin commented Oct 9, 2019

/retest

@Fedosin Fedosin force-pushed the openstack_auto_image_2 branch 2 times, most recently from 6eb858f to a8f2005 Compare October 9, 2019 12:14
@Fedosin
Copy link
Contributor Author

Fedosin commented Oct 9, 2019

/test e2e-openstack

@Fedosin
Copy link
Contributor Author

Fedosin commented Oct 10, 2019

/retest

@Fedosin Fedosin force-pushed the openstack_auto_image_2 branch from a8f2005 to 8029723 Compare October 10, 2019 13:44
@Fedosin
Copy link
Contributor Author

Fedosin commented Oct 10, 2019

@cgwalters download urls are taken from rhcos.json file. I think we can update rhcos.OpenStackImageURL function in a separate PR to add alternative locations there

checksum validation has been added!

@mandre
Copy link
Member

mandre commented Oct 10, 2019

@cgwalters download urls are taken from rhcos.json file. I think we can update rhcos.OpenStackImageURL function in a separate PR to add alternative locations there

checksum validation has been added!

The tf verify_checksum option checks integrity of image in glance (it's enabled by default BTW, but your change is still welcome because better be explicit), we also need to check for integrity of the downloaded image (verify sha is the same as https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/pre-release/latest/sha256sum.txt).

@Fedosin Fedosin force-pushed the openstack_auto_image_2 branch from 3203470 to 028678b Compare October 10, 2019 16:53
@Fedosin Fedosin force-pushed the openstack_auto_image_2 branch 2 times, most recently from a727416 to f0b2535 Compare October 10, 2019 21:24
@Fedosin Fedosin force-pushed the openstack_auto_image_2 branch 2 times, most recently from a26d0ba to 3b90814 Compare October 11, 2019 07:54

resource "openstack_images_image_v2" "base_image" {
// we need to create a new image only if the base image url has been provided + base image name is <cluster_id>-rhcos
count = var.openstack_base_image_url != "" && var.openstack_base_image == "${var.cluster_id}-rhcos"? 1 : 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.
terraform fmt wants to add a space before the question mark.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, done

@Fedosin Fedosin force-pushed the openstack_auto_image_2 branch from 3b90814 to 4801e00 Compare October 11, 2019 13:05
@cgwalters
Copy link
Member

we also need to check for integrity of the downloaded image (verify sha is the same as https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/pre-release/latest/sha256sum.txt).

We already have that in the installer itself.

And more importantly, if the user has GPG verified their installer binary (or its checksum via another means), then that transitively covers the openstack image too. If we fetch a checksum file separately over HTTPS, we only have the TLS guarantee (any CA could have MITM'd).

@cgwalters
Copy link
Member

Just for cross-reference: supporting this was part of the idea in #1399
Also, if we ever implement openshift/os#381 then this code would be incorporated into that operator ideally (although tricky bootstrap issues with installer bootstrap).

@Fedosin
Copy link
Contributor Author

Fedosin commented Oct 11, 2019

@cgwalters Yeah, I saw that! The issue is that OpenStack Glance doesn't support sha256 checksums (md5 only), so we can't ask Terraform to verify it directly.
https://github.com/openstack/glance/blob/master/api-ref/source/v2/samples/schemas-image-show-response.json#L27

My intention was to merge this patch first, because we need the feature badly, and then figure out how we can add sha256 checksums support.

if baseImage == clusterID.InfraID+"-rhcos" {
var err error
ctx, cancel := context.WithTimeout(context.TODO(), 30*time.Second)
defer cancel()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this doing? It looks like it's always cancelling a context that is unused?

@cgwalters
Copy link
Member

OK, overall
/approve
However...my concern about this being the only "production" use case of the "RHCOS build endpoint" remains. I think this is OK to land in master, but if e.g. we were to backport this to 4.2 (and before a product release for 4.3) we should try hard to figure out how the installer can reliably reference artifacts hosted at e.g. https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/

One idea I had in a different PR is to have a URL endpoint like

https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos?sha256=0123abc...

I.e. we fetch by digest, same way we do container images today. So the installer would always try mirror.openshift.com first, but for in-development images that aren't mirrored automatically yet, the installer could fall back.

(And probably we should change things so that "released" installers only use mirror.openshift.com or so)

@abhinavdahiya
Copy link
Contributor

Still looking at the change, but I had some more fundamental questions.

The goal for OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE is to override the source of the image.

for Azure, it is the link of the vhd file in some Azure storage account.

before this the value for openstack was Glance image, and I think now that value should be the link the openstack QCOW.

One this to note is, these envs are internal and development only and I would be fine changing their behavior..
^^ maybe more insight/decision on this by @sdodson @crawford

=========

When going from proto-image (links) to platform-based Images, the machine objects will always point to the platform-based Image. So I would prefer the machine objects should be updated to use clusterid-rhcos rather the the osImage like we go for Azure and GCP

With this the terraform passing should be simple as you only provide the URL to the terraform and it creates the image corresponding/matches ^^

@Fedosin Fedosin force-pushed the openstack_auto_image_2 branch from 4801e00 to 723356a Compare October 14, 2019 14:23
@mandre
Copy link
Member

mandre commented Oct 14, 2019

/hold

Just tried this PR. Because of gophercloud limitation (?) this basically pulls the 1.8GB rhcos image locally before uploading it to glance and it does it for every deployment. It also keeps all images cached in ~/.terraform/image_cache/.
Unless you have fiber at home you're pretty much doomed. We need a way for developers and possibly CI to bypass this image upload.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 14, 2019
@Fedosin Fedosin force-pushed the openstack_auto_image_2 branch from 723356a to cfc511d Compare October 14, 2019 18:16
@sdodson
Copy link
Member

sdodson commented Oct 14, 2019

1. When creating a Glance image, Terraform first downloads the image file to the local hard disk in the ~/.terraform/image-cache folder and does not delete it after. This may cause the hard disk to overflow if the images are downloaded frequently.

This happens once per invocation of openshift-install or once each time the source image changes?

@Fedosin
Copy link
Contributor Author

Fedosin commented Oct 14, 2019

@sdodson once each time the source image changes, but it doesn't delete the data after... I will update the PR description to make it more clear.

@Fedosin
Copy link
Contributor Author

Fedosin commented Oct 14, 2019

/test e2e-openstack

@Fedosin
Copy link
Contributor Author

Fedosin commented Oct 14, 2019

/retest


resource "openstack_images_image_v2" "base_image" {
// we need to create a new image only if the base image url has been provided, plus base image name is <cluster_id>-rhcos
count = var.openstack_base_image_url != "" && var.openstack_base_image_name == "${var.cluster_id}-rhcos" ? 1 : 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be simplified to var.openstack_base_image_url != "" right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather leave it like that, just in case the user exports TF_VAR_openstack_base_image_url. We've had an issue in our CI where a previous version of this patch created a duplicate rhcos image, it caused all the jobs to fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is an optional check to prevent possible regressions that can break our ci

Comment on lines 186 to 193
imageName := clusterID.InfraID + "-rhcos"
// Here we check whether rhcosImage is a url or not. If this is the case, it means that the image was
// created by the installer and we have to use the universal name "<infraID>-rhcos". Otherwise, it means
// that we are given the name of the pre-created Glance image, which we should use for node provisioning.
_, err = url.ParseRequestURI(string(*rhcosImage))
if err != nil {
imageName = string(*rhcosImage)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it... Absolutely agree that it makes sense, but I don't know where to put this function properly. There is no a place for OpenStack util functions, unfortunately. I'll try to figure out where we can put it.


imageName := clusterID.InfraID + "-rhcos"
// Here we check whether rhcosImage is a url or not. If this is the case, it means that the image was
// created by the installer and we have to use the universal name "<infraID>-rhcos". Otherwise, it means
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: rhcosImage being an URL doesn't necessary mean it was created by the installer (it's possible that the user gives an URL to OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE) but it means it needs to be imported to glance under a unique name, clusterID.InfraID + "-rhcos"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly right... If rhcosImage is a URL, then the installer always creates an image from this URL with name clusterID.InfraID + "-rhcos". It doesn't matter how the URL was obtained, from rhcos.json or from OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE variable.


resource "openstack_images_image_v2" "base_image" {
// we need to create a new image only if the base image url has been provided, plus base image name is <cluster_id>-rhcos
count = var.openstack_base_image_url != "" && var.openstack_base_image_name == "${var.cluster_id}-rhcos" ? 1 : 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather leave it like that, just in case the user exports TF_VAR_openstack_base_image_url. We've had an issue in our CI where a previous version of this patch created a duplicate rhcos image, it caused all the jobs to fail.

description = "Name of the base image to use for the nodes."
}

variable "openstack_base_image_url" {
type = string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a default value, otherwise it fails with The input variable "openstack_base_image_url" has not been assigned a value. when exporting OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Funny thing, I did that but forgot to git commit --amend it and pushed the patch without this change. Agree, this is important.

@Fedosin Fedosin force-pushed the openstack_auto_image_2 branch from cfc511d to 545daea Compare October 15, 2019 10:08
@openshift-ci-robot
Copy link
Contributor

@Fedosin: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/prow/e2e-aws-scaleup-rhel7 545daea link /test e2e-aws-scaleup-rhel7

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@Fedosin
Copy link
Contributor Author

Fedosin commented Oct 15, 2019

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 15, 2019
@mandre
Copy link
Member

mandre commented Oct 15, 2019

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 15, 2019
@abhinavdahiya
Copy link
Contributor

/approve

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, cgwalters, Fedosin, mandre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 15, 2019
@openshift-merge-robot openshift-merge-robot merged commit b635933 into openshift:master Oct 15, 2019
@Fedosin Fedosin deleted the openstack_auto_image_2 branch November 1, 2019 11:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants