Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Dataset Events not publishing when AIRFLOW__COSMOS__ENABLE_DATASET_ALIAS enabled #1363

Open
1 task
nishant-gupta-sh opened this issue Dec 4, 2024 · 3 comments
Labels
area:config Related to configuration, like YAML files, environment variables, or executer configuration area:datasets Related to the Airflow datasets feature/module bug Something isn't working triage-needed Items need to be reviewed / assigned to milestone

Comments

@nishant-gupta-sh
Copy link

nishant-gupta-sh commented Dec 4, 2024

Astronomer Cosmos Version

12.1.1

dbt-core version

1.8.7

Versions of dbt adapters

No response

LoadMode

AUTOMATIC

ExecutionMode

AZURE_CONTAINER_INSTANCE

InvocationMode

None

airflow version

2.10.2

Operating System

Astronomer Deployed

If a you think it's an UI issue, what browsers are you seeing the problem on?

No response

Deployment

Astronomer

Deployment details

No response

What happened?

When setting the AIRFLOW__COSMOS__ENABLE_DATASET_ALIAS to True, any task that we have that was previously producing dataset events and was not in a task group, stopped producing the events. I suspect there's something wrong with the following function:

def get_dataset_alias_name(dag: DAG | None, task_group: TaskGroup | None, task_id: str) -> str:

Relevant log output

How to reproduce

When AIRFLOW__COSMOS__ENABLE_DATASET_ALIAS is set to true, and when using the DbtRunLocalOperator or DbtSnapshotOperator, dataset events were no longer being published. Dataset events are being published however when using the DbtTaskGroup operator.

All operators successfully published dataset events when AIRFLOW__COSMOS__ENABLE_DATASET_ALIAS was set to False

Anything else :)?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Contact Details

[email protected]

@nishant-gupta-sh nishant-gupta-sh added bug Something isn't working triage-needed Items need to be reviewed / assigned to milestone labels Dec 4, 2024
Copy link

dosubot bot commented Dec 4, 2024

Hey, @nishant-gupta-sh! 👋 I'm here to help you with your bug. Let's squash it together!

The issue you're experiencing seems to be related to how dataset aliases are handled when AIRFLOW__COSMOS__ENABLE_DATASET_ALIAS is set to True. Specifically, tasks not in a TaskGroup stop producing dataset events, while those in a TaskGroup continue to do so. This suggests a potential problem in the logic for managing dataset aliases for tasks outside of TaskGroups.

The dataset.py file determines whether a task is part of a TaskGroup by checking if the task_group parameter is not None. If task_group is provided, it uses task_group.dag_id and task_group.group_id to construct the dataset alias name. If task_group is None, it defaults to using the dag parameter to determine the DAG ID [1].

The AIRFLOW__COSMOS__ENABLE_DATASET_ALIAS configuration controls the use of DatasetAlias in Cosmos. By default, it is set to True, enabling the use of DatasetAlias to support dynamic dataset creation during task execution. However, if you encounter issues like sqlalchemy.orm.exc.FlushError errors, you can set this configuration to False to avoid these errors [2].

Given this, the problem might be in how the dataset alias is constructed or retrieved for tasks not in a TaskGroup when the alias feature is enabled. You might want to review the logic in the dataset.py file related to alias handling for tasks outside of TaskGroups.

To continue talking to Dosu, mention @dosu.

@dosubot dosubot bot added area:config Related to configuration, like YAML files, environment variables, or executer configuration area:datasets Related to the Airflow datasets feature/module labels Dec 4, 2024
@tatiana
Copy link
Collaborator

tatiana commented Dec 11, 2024

Hi @nishant-gupta-sh, which version of Cosmos are you using? You mentioned 12.1.1 in the ticket description, but the last release of Cosmos was 1.7.1.

Did you recently change from executing Cosmos using ExecutionMode.LOCAL to ExecutionMode.AZURE_CONTAINER_INSTANCE? This could justify the problem. Currently, Cosmos only supports emitting datasets when using ExecutionMode.LOCAL and ExecutionMode.VIRTUALENV:
https://github.com/astronomer/astronomer-cosmos/blob/main/cosmos/operators/local.py

I'll update our docs to make this more evident: https://astronomer.github.io/astronomer-cosmos/configuration/scheduling.html.

@nishant-gupta-sh
Copy link
Author

nishant-gupta-sh commented Dec 11, 2024

Hi @nishant-gupta-sh, which version of Cosmos are you using? You mentioned 12.1.1 in the ticket description, but the last release of Cosmos was 1.7.1.

Did you recently change from executing Cosmos using ExecutionMode.LOCAL to ExecutionMode.AZURE_CONTAINER_INSTANCE? This could justify the problem. Currently, Cosmos only supports emitting datasets when using ExecutionMode.LOCAL and ExecutionMode.VIRTUALENV: https://github.com/astronomer/astronomer-cosmos/blob/main/cosmos/operators/local.py

I'll update our docs to make this more evident: https://astronomer.github.io/astronomer-cosmos/configuration/scheduling.html.

Hi Tatiana, apologies, we're using 1.7.1 for Cosmos and the CeleryExecutor for the ExecutionMode.
Astronomer manages our airflow deployment, and the only thing that is affecting whether or not the dataset events are getting emitted is when we modify the AIRFLOW__COSMOS__ENABLE_DATASET_ALIAS parameter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:config Related to configuration, like YAML files, environment variables, or executer configuration area:datasets Related to the Airflow datasets feature/module bug Something isn't working triage-needed Items need to be reviewed / assigned to milestone
Projects
None yet
Development

No branches or pull requests

2 participants