Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(cli): added cli option for ingestion source #11980

Merged

Conversation

kevinkarchacryl
Copy link
Contributor

@kevinkarchacryl kevinkarchacryl commented Nov 27, 2024

I'm adding a new CLI command called "datahub ingest list-source-runs" (name pending) that will give this result
image

Can also filter by URN and/or source name:
image

Do we like the resulting columns? Are there any we should add or get rid of?

I attempted to filter by platform but it seems like platform is always null in the graphql response.

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@sgomezvillamor
Copy link
Contributor

sgomezvillamor commented Dec 12, 2024

These are output columns for datahub ingest list-runs as of today: runId, rows and created at.
In your proposal, columns are: runId, source, startTime, status and URN.
Assuming startTime is the same as created at, so this is adding source, status and URN.

Instead of adding a new option to the command line, what about extending table in list-runs with the new columns?

I hope this helps:

FYI: @hsheth2

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Dec 12, 2024
@kevinkarchacryl
Copy link
Contributor Author

@sgomezvillamor The problem with the current datahub ingest list-runs today is that it's not working properly. The runId often shows 'no-run-id-provided' and the number of rows is off. This new command uses the graphql endpoint instead of elasticsearch to always get the proper runId. I think we decided that making a new working command was better than changing an old one.

@datahub-cyborg datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Dec 12, 2024
@sgomezvillamor
Copy link
Contributor

I think we decided that making a new working command was better than changing an old one.

I was not aware this had been discussed before, sorry for the confusion I may have caused.

If you discussed this with someone else, feel free to invite them to join the review here.

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Dec 12, 2024
@hsheth2
Copy link
Collaborator

hsheth2 commented Dec 12, 2024

Going to weight in with my two cents here

Main question is if we consider the old behavior to have any real use cases, or if we'd just say the old behavior was buggy.

If the former -> let's do two commands as done here, and we can mark the old one as deprecated
If the latter -> we can just replace the old command, since there's no real reason to keep it around

@asikowitz
Copy link
Collaborator

Talked to Gabe about this separately. I could see the old behavior having use cases, e.g. if you are disciplined about setting run ids while using the API, that is not represented by the list of custom ingestion sources. In general I think run ids and ingestion source runs are different enough that we can have CLI commands for both, as long as they're distinguished esp in docs.

@asikowitz
Copy link
Collaborator

Another use case: a user produces data from an ingestion source, then hard deletes it. If they want to clean up data from that deleted ingestion source, they need to look at run ids, because the ingestion source no longer exists.

@kevinkarchacryl
Copy link
Contributor Author

kevinkarchacryl commented Dec 13, 2024

So it looks like we should keep both options. Do we want to change the help output for either one to make it more clear what they're doing? Any changes to docs necessary like adding deprecated?

@datahub-cyborg datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Dec 13, 2024
@sgomezvillamor
Copy link
Contributor

So it looks like we should keep both options. Do we want to change the help output for either one to make it more clear what they're doing? Any changes to docs necessary like adding deprecated?

Keeping both options 👌

My suggestion:

  • Renaming list-runs to list-run-ids would be a breaking change, which I'd prefer to avoid. Let's maintain list-runs and introduce list-source-runs.
  • In addition to documenting the new list-source-runs, let's also add the missing documentation for list-runs to help users understand the differences.

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Dec 16, 2024
@datahub-cyborg datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Dec 16, 2024
Copy link
Contributor

@sgomezvillamor sgomezvillamor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@datahub-cyborg datahub-cyborg bot added pending-submitter-merge and removed needs-review Label for PRs that need review from a maintainer. labels Dec 16, 2024
@kevinkarchacryl kevinkarchacryl merged commit d0b4f7a into datahub-project:master Dec 16, 2024
73 checks passed
sleeperdeep pushed a commit to sleeperdeep/datahub that referenced this pull request Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ingestion PR or Issue related to the ingestion of metadata pending-submitter-merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants