diff --git a/UserGuide.pdf b/UserGuide.pdf index 49dda4beb..e14c3f642 100644 Binary files a/UserGuide.pdf and b/UserGuide.pdf differ diff --git a/documentation/userguide/docs/datasets.md b/documentation/userguide/docs/datasets.md index 8a6016a40..b8e031678 100644 --- a/documentation/userguide/docs/datasets.md +++ b/documentation/userguide/docs/datasets.md @@ -91,15 +91,15 @@ On left pane choose **Datasets**, then click on the **Create** button. Fill the | Confidentiality | Level of confidentiality: Unclassified, Oficial or Secret | Yes | Yes | Secret | Topics | Topics that can later be used in the Catalog | Yes, at least 1 | Yes | Finance | Tags | Tags that can later be used in the Catalog | Yes, at least 1 | Yes | deleteme, ds - +| Auto Approval | Whether shares for this dataset need approval from dataset owners/stewards | Yes (default `Disabled`) | Yes | Disabled, Enabled ## :material-import: **Import a dataset** -If you already have data stored on Amazon S3 buckets in your data.all environment, data.all got you covered with the import feature. In addition to +If you already have data stored on Amazon S3 buckets in your data.all environment, data.all has got you covered with the import feature. In addition to the fields of a newly created dataset you have to specify the S3 bucket and optionally a Glue database and a KMS key Alias. If the Glue database is left empty, data.all will create a Glue database pointing at the S3 Bucket. As for the KMS key Alias, data.all assumes that if nothing is specified -the S3 Bucket is encrypted with SSE-S3 encryption. +the S3 Bucket is encrypted with SSE-S3 encryption. Data.all performs a validation check to ensure the KMS Key Alias provided (if any) is the one that encrypts the S3 Bucket specified. !!! danger "Imported KMS key and S3 Bucket policies requirements" Data.all pivot role will handle data sharing on the imported Bucket and KMS key (if imported). Make sure that @@ -107,8 +107,7 @@ the S3 Bucket is encrypted with SSE-S3 encryption. ### KMS key policy -In the KMS key policy we need to grant explicit permission to the pivot role. Note that this block is needed even if -permissions for the principal `"AWS": "arn:aws:iam::111122223333:root"` are given. +In the KMS key policy we need to grant explicit permission to the pivot role. At a minimum the following permissions are needed for the pivotRole: ``` { @@ -126,29 +125,45 @@ permissions for the principal `"AWS": "arn:aws:iam::111122223333:root"` are give "kms:ReEncrypt*", "kms:TagResource", "kms:UntagResource" + 'kms:DescribeKey' ], "Resource": "*" } ``` +!!!success "Update imported Datasets" + Imported keys is an addition of V1.6.0 release. Any previously imported bucket will have a KMS Key Alias set to `Undefined`. + If that is the case and you want to update the Dataset and import a KMS key Alias, data.all let's you edit the Dataset on the + **Edit** window. - - - +![import_dataset](pictures/datasets/import_dataset.png#zoom#shadow) | Field | Description | Required | Editable |Example |------------------------|-------------------------------------------------------------------------------------------------|----------|----------|------------- | Amazon S3 bucket name | Name of the S3 bucket you want to import | Yes | No |DOC-EXAMPLE-BUCKET -| Amazon KMS key Alias | Alias of the KMS key used to encrypt the S3 Bucket (do not include alias/, just ) | Yes | No |somealias +| Amazon KMS key Alias | Alias of the KMS key used to encrypt the S3 Bucket (do not include alias/, just ) | No | No |somealias | AWS Glue database name | Name of the Glue database tht you want to import | No | No |anyDatabase -!!!success "Update imported Datasets" - Imported keys is an addition of V1.6.0 release. Any previously imported bucket will have a KMS Key Alias set to `Undefined`. - If that is the case and you want to update the Dataset and import a KMS key Alias, data.all let's you edit the Dataset on the - **Edit** window. -![import_dataset](pictures/datasets/import_dataset.png#zoom#shadow) +### (Going Further) Support for Datasets with Externally-Managed Glue Catalog + +If the dataset you are trying to import relates to Glue Database that is managed in a separate account, data.all's import dataset feature can also handle importing and sharing these type of datasets in data.all. Assuming the following pre-requisites are copmlete: + +- There exists an AWS Account (i.e. the Catalog Account) which is: + - Onboarded as a data.all environment (e.g. Env A) + - Contains the Glue Database with Location URI (as S3 Path from Dataset Producer Account) AND Tables + - Glue Database has a resource tag `owner_account_id=` + - Data Lake Location registered in LakeFormation with the role used to register having permissions to the S3 Bucket from Dataset Producer Account + - Resource Link created on the Glue Database to grant permission for the Dataset Producer Account on the Database and Tables + +- There exists another AWS Account (i.e. the Dataset Producer Account) which is: + - Onboarded as a data.all environment (e.g. Env B) + - Contains the S3 Bucket that contains the data (used as S3 Path in Catalog Account) + +The data.all producer, a member of EnvB Team(s), would import the dataset specifying the S3 bucket as the bucket name that exists in the Dataset Producer Account and specifying the Glue database name as the Glue DB resource link name in the Dataset Producer Account. + +This dataset will then be properly imported and can be discovered and shared the same way as any other dataset in data.all. ## :material-card-search-outline: **Navigate dataset tabs** @@ -182,7 +197,7 @@ Moreover, AWS information related to the resources created by the dataset CloudF AWS Account, Dataset S3 bucket, Glue database, IAM role and KMS Alias. You can also assume this IAM role to access the S3 bucket in the AWS console by clicking on the **S3 bucket** button. -Alternatively, click on **AWS Credentials** to obtain programmatic access to the S3 bucket. +Alternatively, click on **AWS Credentials** to obtain programmatic access to the S3 bucket (only available if `modules.dataset.features.aws_actions` is set to `True` in the `config.json` used for deployment of data.all). ![overview](pictures/datasets/dataset_overview.png#zoom#shadow) @@ -242,11 +257,6 @@ Catalog and then click on Synchronize as we did in step 3. - Or directly, migrating from Hive Metastore. - there are more for sure :) -!!! abstract "Use data.all pipelines to register Glue tables" - data.all pipelines can be used to transform your data including the creation of Glue tables using - AWS Glue pyspark extension . - Visit the pipelines section for more details. - ### Folders diff --git a/documentation/userguide/docs/environments.md b/documentation/userguide/docs/environments.md index 70f2f55ba..a31bf1ca7 100644 --- a/documentation/userguide/docs/environments.md +++ b/documentation/userguide/docs/environments.md @@ -7,7 +7,7 @@ users store data and work with data.** !!! danger "One AWS account, One environment" To ensure correct data access and AWS resources isolation, onboard one environment in each AWS account. - Despite being possible, **we strongly discourage users to use the same AWS account for multiple environments**. + **We strongly discourage users to use the same AWS account for multiple environments**. ## :material-hammer-screwdriver: **AWS account Pre-requisites** *data.all* does not create AWS accounts. You need to provide an AWS account and complete the following bootstraping @@ -49,8 +49,7 @@ cdk bootstrap --trust DATA.ALL_AWS_ACCOUNT_NUMBER -c @aws-cdk/core:newStyleStac In the above command we define the `--cloudformation-execution-policies` to use the AdministratorAccess policy `arn:aws:iam::aws:policy/AdministratorAccess`. This is the default policy that CDK uses to deploy resources, nevertheless it is possible to restrict it to any IAM policy created in the account. -In V1.6.0 a more restricted policy is provided and directly downloadable from the UI. This more restrictive policy can be optionally used as -`--cloudformation-execution-policies` for the CDK Execution role. +In V1.6.0 and later, a more restricted policy is provided and directly downloadable from the UI. This more restrictive policy can be optionally passed in to the parameter `--cloudformation-execution-policies` instead of `arn:aws:iam::aws:policy/AdministratorAccess` for the CDK Execution role. ### 2. (For manual) Pivot role @@ -72,27 +71,13 @@ available in data.all UI to create the role named **dataallPivotRole**. If you have existing environments that were linked to data.all using a manually created Pivot Role you can still benefit from V1.5.0 `enable_pivot_role_auto_create` feature. You just need to update that parameter in the `cdk.json` configuration of your deployment. Once the CICD pipeline has completed: new linked environments - will contain the nested cdk-pivotRole stack (no actions needed) and existing environments can be updated by: a) manually, - by clicking on "update stack" in the environment>stack tab b) automatically, wait for the `stack-updater` ECS task that - runs daily overnight c) automatically, set the added `enable_update_dataall_stacks_in_cicd_pipeline` parameter to `true` in - the `cdk.json` config file. The `stack-updater` ECS task will be triggered from the CICD pipeline + will contain the nested cdk-pivotRole stack (no actions needed) and existing environments can be updated by: +
a) manually, by clicking on "update stack" in the environment --> stack tab +
b) automatically, wait for the `stack-updater` ECS task that runs daily overnight +
c) automatically, set the added `enable_update_dataall_stacks_in_cicd_pipeline` parameter to `true` in the `cdk.json` config file. The `stack-updater` ECS task will be triggered from the CICD pipeline -### 3. (For new accounts) AWS Lake Formation Service role -*data.all* relies on AWS Lake Formation to manage access to your structured data. -If AWS Lake Formation has never been -activated on your AWS account, you need to create -a service-linked role, using the following command: -```bash -aws iam create-service-linked-role --aws-service-name lakeformation.amazonaws.com -``` - -!!! danger "Service link creation error" - If you receive: An error occurred (InvalidInput) when calling the CreateServiceLinkedRole operation: Service - role name AWSServiceRoleForLakeFormationDataAccess has been taken in this account, please try a different suffix. - You can skip this step, as this indicates the Lake formation service-linked role exists. - -### 4. (For Dashboards) Subscribe to Amazon Quicksight +### 3. (For Dashboards) Subscribe to Amazon Quicksight This is an optional step. To link environments with Dashboards enabled , you will also need a running Amazon QuickSight subscription on the bootstraped account. If you have not subscribed to Quicksight before, go to your AWS account and choose the @@ -103,26 +88,29 @@ Enterprise option as show below: ![quicksight](pictures/environments/boot_qs_2.png#zoom#shadow) -### 5. (For ML Studio) Delete or adapt the default VPC -If ML Studio is enabled, data.all checks if there is an existing SageMaker Studio domain. If there is an existing domain -it will use it to create ML Studio profiles. If no pre-existing domain is found, data.all will create a new one. +### 4. (For ML Studio) Specifying a VPC or using default +If ML Studio is enabled, data.all will create a new SageMaker Studio domain in your AWS Account and use the domain later on to create ML Studio profiles. Prior to V1.5.0 data.all always used the default VPC to create a new SageMaker domain. The default VPC had then to be customized to fulfill the networking requirements specified in the Sagemaker [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-notebooks-and-internet-access.html) for VPCOnly domains. -In V1.5.0 we introduce the creation of a suitable VPC for SageMaker as part of the environment stack. However, it is not possible to edit the VPC used by a SageMaker domain, it requires deletion and re-creation. To allow backwards +In V1.5.0 we introduce the creation of a suitable VPC for SageMaker as part of the environment stack. However, it is not possible to edit the VPC used by a SageMaker Studio domain, it requires deletion and re-creation. To allow backwards compatibility and not delete the pre-existing domains, in V1.5.0 the default behavior is still to use the default VPC. -Data.all will create a SageMaker VPC: -- For new environments: (link environment) - - if there is not a pre-existing SageMaker Studio domain - - if there is not a default VPC in the account -- For pre-existing environments: (update environment) - - if all ML Studio profiles have been deleted (from CloudFormation as well) - - if there is not a pre-existing SageMaker Studio domain - - if the default VPC has been deleted in the account +In V2.2.0, we introduced the ability to select your own VPC ID and Subnet IDs to deploy the VPC-Only Sagemaker Studio domain to. + +Data.all will follow the following rules to establish which VPC to use for Sagemaker Studio domain creation: + +- If MLStudio enabled with VPC and subnet IDs specified + - Use the specified VPC and subnet IDs +- If MLStudio enabled with no VPC/subnet IDs specified + - default VPC exists --> Uses default VPC and all subnets available + - default VPC does not exist --> Creates a new VPC and uses with private subnets + +Pre-existing environments from older versions of data.all will have their Sagemaker Studio domain remain unchanged if already enabled. Users can get a better understanding of what VPC configuration is being used by navigating to the environment --> MLStudio Tab in the data.all UI once the environment stack is created. + ## :material-new-box: **Link an environment** ### Necessary permissions @@ -162,20 +150,19 @@ Navigate to your organization, click on the **Link Environment** button, and fil | Short description | Short description about the environment | No | Yes |Finance department teams | Account number | AWS bootstraped account maped to the environment | Yes | No |111111111111 | Region | AWS region | Yes | No |Europe (Ireland) -| IAM Role name | Alternative name of the environment IAM role | No | No |anotherRoleName +| IAM Role ARN | Alternative name of the environment IAM role | No | No |anotherRoleName | Resources prefix | Prefix for all AWS resources created in this environment. Only (^[a-z-]*$) | Yes | Yes |fin | Team | Name of the group initially assigned to this environment | Yes | No |FinancesAdmin | Tags | Tags that can later be used in the Catalog | Yes | Yes |finance, test -| VPC Identifier | VPC provided to host the environment resources instead than the default one created by *data.all* | No | No | vpc-...... -| Public subnets | Public subnets provided to host the environment resources instead than the default created by *data.all* | No | No | subnet-.... -| Private subnets | Private subnets provided to host the environment resources instead than the default created by *data.all* | No | No | subnet-..... +| ML Studio VPC ID | VPC to host the environment sagemaker studio domain (if mlstudio is enabled) instead than the default VPC or the VPC created by *data.all* | No | No | vpc-...... +| ML Studio Subnet ID(s) | Subnet(s) to host the environment sagemaker studio domain (if mlstudio is enabled) instead than the default subnets or the subnets created by *data.all* | No | No | subnet-.... **Features Management** An environment is defined as a workspace and in this workspace we can flexibly activate or deactivate different features, adapting the workspace to the teams' needs. If you want to use Dashboards, you need to complete the optional -fourth step explained in the previous chapter "Bootstrap your AWS account". +third step explained in the previous chapter "Bootstrap your AWS account". !!! success "This is not set in stone!" Don't worry if you change your mind, features are editable. You can always update @@ -192,6 +179,7 @@ the environment organization. There are several tabs just below the environment - Overview: summary of environment information and AWS console and credential access. - Teams: list of all teams onboarded to this environment. - Datasets: list of all datasets owned and shared with for this environment +- MLStudio: summary of Sagemaker Studio domain configuration (if enabled) - Networks: VPCs created and owned by the environment - Subscriptions: SNS topic subscriptions enabled or disabled in the environment - Tags: editable key-value tags @@ -238,11 +226,11 @@ Note that we can keep the environment CloudFormation stack. What is this for? Th using the environment resources (IAM roles, etc) created by *data.all* but outside of *data.all* ### :material-plus-network-outline: **Create networks** -Networks are VPCs created from *data.all* and belonging to an environment and team. To create a network, click in the +Networks are pre-existing VPCs that are onboarded to *data.all* and belonging to an environment and team. To create a network, click in the **Networks** tab in the environment window, then click on **Add** and finally fill the following form. -!!!abstract "I need an example!" - What is the advantage of using networks from *data.all*? ....[MISSING INFO] +!!!abstract "Using Networks" + After onboarding your network(s) in *data.all*, users can easily select the VPC and Subnet information of that network to seamlessly deploy new resources in data.all that require VPC configurations, such as data.all Notebooks. For example, if a User wants to create a notebook in their environment after onboarding a network, the VPC and Subnet ID fields in the create notebook form on data.all will auto-populate with the VPC and subnet information for the user to easily to select (rather than navigating to and from the AWS Console)! ![](pictures/environments/env_networks.png#zoom#shadow) @@ -303,7 +291,8 @@ creator of the environment or invited to the environment). In the following pict For the environment admin team and for each team invited to the environment *data.all* creates an IAM role. From the **Teams** tab of the environment we can assume our team's IAM role to get access to the AWS Console or copy the credentials to the -clipboard. Both options are under the "Actions" column in the Teams table. +clipboard. Both options are under the "Actions" column in the Teams table (these options are only available if `core.features.env_aws_actions` is set to `True` in the `config.json` used for deployment of data.all). + ![](pictures/environments/env_teams_3.png#zoom#shadow) @@ -376,16 +365,19 @@ Any IAM role that exists in the Environment AWS Account can be added to *data.all* can remove the consumption role. +A window like the following will appear for you to introduce a name for the consumption role in data.all, the arn of the IAM role, the Team that owns the consumption role and whether data.all should manage the consumption role. Enabling "data.all managed" on the consumption role allows data.all to attach IAM policies to the role used for data.all related activities, such as sharing data, rather than having a user manually add those policies to the role. + +Only members of this team and tenants of *data.all* can edit or remove the consumption role. ![](pictures/environments/env_consumption_roles_2.png#zoom#shadow) +![](pictures/environments/env_consumption_roles_3.png#zoom#shadow) + !!! success "Existing roles only" *data.all* checks whether that IAM role exists in the AWS account of the environment before adding it as a consumption role. **Data Access** - By default, a new consumption role does NOT have access to any data in *data.all*. -- The team that owns the consumption role needs to open a share request for the consumption role as shown in the picture below. +- The team that owns the consumption role needs to open a share request for the consumption role as discussed more in the Discover --> Shares section. diff --git a/documentation/userguide/docs/notebooks.md b/documentation/userguide/docs/notebooks.md index 6c4692a93..5535f624d 100644 --- a/documentation/userguide/docs/notebooks.md +++ b/documentation/userguide/docs/notebooks.md @@ -28,7 +28,7 @@ To create a Notebook, go to Notebooks on the left pane and click on the **Create | Organization (auto-filled) | Organization of the environment | Yes | No | AnyCompany EMEA | Team | Team that owns the notebook | Yes | No |DataScienceTeam | VPC Identifier | VPC provided to host the notebook | No | No | vpc-...... -| Public subnets | Public subnets provided to host the notebook | No | No | subnet-.... +| Subnets | Subnets provided to host the notebook | No | No | subnet-.... | Instance type | [ml.t3.medium, ml.t3.large, ml.m5.xlarge] | Yes | No |ml.t3.medium | Volume size | [32, 64, 128, 256] | Yes | No |32 diff --git a/documentation/userguide/docs/pictures/datasets/dat_create_form.png b/documentation/userguide/docs/pictures/datasets/dat_create_form.png index 30bcbbea2..c4f50956c 100644 Binary files a/documentation/userguide/docs/pictures/datasets/dat_create_form.png and b/documentation/userguide/docs/pictures/datasets/dat_create_form.png differ diff --git a/documentation/userguide/docs/pictures/datasets/import_dataset.png b/documentation/userguide/docs/pictures/datasets/import_dataset.png index 6b290498f..5d14b7960 100644 Binary files a/documentation/userguide/docs/pictures/datasets/import_dataset.png and b/documentation/userguide/docs/pictures/datasets/import_dataset.png differ diff --git a/documentation/userguide/docs/pictures/environments/env_consumption_roles_2.png b/documentation/userguide/docs/pictures/environments/env_consumption_roles_2.png index f05d124dc..88e34ca80 100644 Binary files a/documentation/userguide/docs/pictures/environments/env_consumption_roles_2.png and b/documentation/userguide/docs/pictures/environments/env_consumption_roles_2.png differ diff --git a/documentation/userguide/docs/pictures/environments/env_consumption_roles_3.png b/documentation/userguide/docs/pictures/environments/env_consumption_roles_3.png new file mode 100644 index 000000000..2ce7182db Binary files /dev/null and b/documentation/userguide/docs/pictures/environments/env_consumption_roles_3.png differ diff --git a/documentation/userguide/docs/pictures/shares/share_reapply.png b/documentation/userguide/docs/pictures/shares/share_reapply.png new file mode 100644 index 000000000..924144592 Binary files /dev/null and b/documentation/userguide/docs/pictures/shares/share_reapply.png differ diff --git a/documentation/userguide/docs/pictures/shares/share_verify.png b/documentation/userguide/docs/pictures/shares/share_verify.png new file mode 100644 index 000000000..6605a158c Binary files /dev/null and b/documentation/userguide/docs/pictures/shares/share_verify.png differ diff --git a/documentation/userguide/docs/pictures/shares/share_verify_dataset.png b/documentation/userguide/docs/pictures/shares/share_verify_dataset.png new file mode 100644 index 000000000..ad5b5c756 Binary files /dev/null and b/documentation/userguide/docs/pictures/shares/share_verify_dataset.png differ diff --git a/documentation/userguide/docs/pictures/shares/shares_2_2.png b/documentation/userguide/docs/pictures/shares/shares_2_2.png index b8f206744..e68acfd3f 100644 Binary files a/documentation/userguide/docs/pictures/shares/shares_2_2.png and b/documentation/userguide/docs/pictures/shares/shares_2_2.png differ diff --git a/documentation/userguide/docs/shares.md b/documentation/userguide/docs/shares.md index 69ec345e8..143fdff09 100644 --- a/documentation/userguide/docs/shares.md +++ b/documentation/userguide/docs/shares.md @@ -11,10 +11,12 @@ to create access permissions to tables, meaning that no data is copied between Under-the-hood, folders are prefixes inside the dataset S3 bucket. To create sharing of folders in data.all, we create an S3 access point per requester group to handle its access to specific prefixes in the dataset. +data.all also supports the sharing of the entire S3 Bucket to requestors using IAM permissions and S3/KMS policies if desired. + **Concepts** - Share request or Share Object: one for each dataset and requester team. -- Share Item refers to the individual tables and folders that are added to the Share request. +- Share Item refers to the individual tables and folders or S3 Bucket that are added to the Share request. **Sharing workflow** @@ -54,7 +56,9 @@ The following window will open. Choose your target environment and team and opti ![share_request_form](pictures/shares/shares_2_1.png#zoom#shadow) -If instead of to a team, you want to request access for a Consumption role, add it to the request as in the picture below. +If instead of to a team, you want to request access for a Consumption role, add it to the request as in the picture below. + +NOTE: If the consumption role selected is not data.all managed - you will have the option to allow data.all to attach the share policies to the consumption role for this particular share object (if not enabled here you will have to manually attach the share policies to be given access to data). ![share_request_form](pictures/shares/shares_2_2.png#zoom#shadow) @@ -122,6 +126,28 @@ regards to the dataset is `SHARED`. ![accept_share](pictures/shares/shares_dataset.png#zoom#shadow) +## **Verify (and Re-apply) Items** + +As of V2.3 of data.all - share requestors or approvers are able to verify the health status of the share items within their share request from the data.all UI. Any set of share items that are in a shared state (i.e. `SHARE_SUCCEEDED` or `REVOKE_FAILED` state) will be able to be selected to start a verify share process. + +![share_verify](pictures/shares/share_verify.png#zoom#shadow) + +Upon completion of the verify share process, each share item's healthStatus will be updated with an updated healthStatus (i.e. `Healthy` or `Unhealthy`) as well as a timestamp representing the last verification time. If the share item is in an `Unhealthy` health status, there will also be included a health message detailing what part of the share is in an unhealthy state. + +In addition to running a verify share process on particular items, dataset owners can run the verify share process on multiple share objects associated with a particular dataset. Navigating to the Dataset --> Shares Tab, dataset owners can start a verify process on multiple share objects. For each share object selected, the share items that are in a shared state for the associated share object will verified and updated with a new health status and so on. + +![share_verify](pictures/shares/share_verify_dataset.png#zoom#shadow) + +!!! success "Scheduled Share Verify Task" + The share verifier process is run against all share object items that are in a shared state every 7 days by default + as a scheduled task which runs in the background of data.all. + +If any share items do end up in an `Unhealthy` status, the data.all approver will have the option to re-apply the share for the selected items that are in an unhealthy state. + +![share_reapply](pictures/shares/share_reapply.png#zoom#shadow) + +Upon successful re-apply process, the share items health status will revert back to a `Healthy` status with an updated timestamp. + ## **Revoke Items** Both approvers and requesters can click on the button **Revoke items** to remove the share grant from chosen items. @@ -157,6 +183,13 @@ For example: aws s3 ls arn:aws:s3:::accesspoint/-/folder2/ ``` +For S3 bucket sharing, IAM policies, S3 bucket policies, and KMS Key policies (if applicable) are updated to enable sharing of the S3 Bucket resource. + +For example, access to the bucket would be similar to: +```json + aws s3 ls s3:// +``` + ## **Email Notification on share requests** In data.all, you can enable email notification to send emails to requesters and approvers of a share request. Email notifications