Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Integration][Gitlab] Added support for gitlab member ingestion #767

Merged
merged 67 commits into from
Nov 14, 2024

Conversation

mk-armah
Copy link
Member

@mk-armah mk-armah commented Jul 3, 2024

Description

Added support for GitLab member ingestion.

This PR introduces two new kinds: PROJECTWITHMEMBERS and GROUPWITHMEMBERS. These are enhanced versions of the original PROJECT and GROUP kinds, now including member data.

  • Public Email Visibility:

    • GitLab group and project members do not include users' emails by default, especially for free plans. To enable users on free plans to view user emails, we have added a flag includePublicEmail to the member selector.
    • Usage: If set to true, members are enriched with the public_email field from the /users endpoint. This is dependent on whether the user has allowed public email visibility in their GitLab account settings. Learn more.
    • Note: It was necessary to call the /users endpoint because the members API does not include public_email.
    • Default Value: The default value for includePublicEmail is false.
  • Bot Members Filtering:

    • GitLab returns all members, including bots and tokens created as members. To filter out bots from actual members, we have added a flag includeBotMembers to the MemberSelector.
    • Usage: If set to false, bot members are excluded from the synchronization process.
    • Default Value: The default value for includeBotMembers is true, meaning bots are included by default.
    • Reasoning: Bot filtering is necessary when syncing members to ensure only actual user accounts are considered.
  • Inherited Members Inclusion:

    • GitLab allows inclusion of inherited members from parent groups. To control this behavior, we have added a flag includeInheritedMembers.
    • Usage: If set to true, members inherited from parent groups are included during synchronization.
    • Default Value: The default value for includeInheritedMembers is false, so inherited members are excluded by default.
    • Note: Including inherited members can provide a more comprehensive view of all members who have access to a project or group through group hierarchies.

Type of change

Please leave one option from the following and delete the rest:

  • New feature (non-breaking change which adds functionality)

Screenshots

image image

@github-actions github-actions bot added the size/L label Jul 3, 2024
@mk-armah mk-armah requested a review from a team July 3, 2024 12:14
@Tankilevitch Tankilevitch changed the title PORT-7708 | Added support for gitlab member ingestion [Integration][Gitlab] Added support for gitlab member ingestion Jul 7, 2024
Comment on lines 95 to 100
"publicEmail": {
"type": "string",
"title": "Public Email",
"description": "User's GitLab public email.",
"icon": "User",
"format": "user"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in what case the user will have a public email? I think we might want to remove it by default

Comment on lines 33 to 35
relations:
gitlabGroup: '[.__groups[].full_path]'
createdBy: .created_by.username
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the blueprint relation you only have gitlabGroup, while in your mapping you have both createdBy and gitlabGroup. I am not sure it is of interest for users who created the user, lets remove it. Let me know if you think otherwise

Comment on lines 559 to 568
async def check_group_membership(group: Group) -> Group | None:
"check if the user is a member of the group"
async with semaphore:
try:
await AsyncFetcher.fetch_single(group.members.get, member.get_id())
return group
except GitlabError as err:
if err.response_code != 404:
raise err
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels very insufficient, eventually for all the if I have 1 member, and 1000 groups, I'll have to query this api 1000 times to find out what groups he related to? isn't there any other way to get that data?

Comment on lines 607 to 611
user_groups: List[dict[str, Any]] = [
{"id": group.id, "full_path": group.full_path}
async for groups in self.get_member_groups(user)
for group in groups
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this should be the default behavior when quering members.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this should be something that is part of group members kind? that will query the list of members of each group.
and instead of having a relation of user -> groups, we will have relation between group -> users

Comment on lines 581 to 601
async def get_all_group_members(
self, group: Group
) -> typing.AsyncIterator[List[GroupMemberAll]]:

logger.info(f"Fetching all members of group {group.name}")

async for users_batch in AsyncFetcher.fetch_batch(
fetch_func=group.members_all.list,
validation_func=self.should_run_for_member,
pagination="offset",
order_by="id",
sort="asc",
):
members: List[GroupMemberAll] = typing.cast(
List[GroupMemberAll], users_batch
)
logger.info(
f"Queried {len(members)} members {[user.username for user in members]} from {group.name}"
)
yield members

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
async def get_all_group_members(
self, group: Group
) -> typing.AsyncIterator[List[GroupMemberAll]]:
logger.info(f"Fetching all members of group {group.name}")
async for users_batch in AsyncFetcher.fetch_batch(
fetch_func=group.members_all.list,
validation_func=self.should_run_for_member,
pagination="offset",
order_by="id",
sort="asc",
):
members: List[GroupMemberAll] = typing.cast(
List[GroupMemberAll], users_batch
)
logger.info(
f"Queried {len(members)} members {[user.username for user in members]} from {group.name}"
)
yield members
async def get_all_group_members(
self, group: Group
) -> typing.AsyncIterator[List[GroupMemberAll]]:
logger.info(f"Fetching all members of group {group.name}")
async for users_batch in AsyncFetcher.fetch_batch(
fetch_func=group.members_all.list,
validation_func=self.should_run_for_member,
pagination="offset",
order_by="id",
sort="asc",
):
members: List[GroupMemberAll] = typing.cast(
List[GroupMemberAll], users_batch
)
logger.info(
f"Queried {len(members)} members {[user.username for user in members]} from {group.name}"
)
members_enriched_with_group = [ {...member, group: group} for member in members ]
yield members

async def enrich_member_with_groups_and_public_email(
self, member
) -> dict[str, Any]:
user: User = await self.get_user(member.id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feature flag in the selector mapping, and defined in docs.

Comment on lines 212 to 222
@ocean.on_resync(ObjectKind.MEMBER)
async def resync_members(kind: str) -> ASYNC_GENERATOR_RESYNC_TYPE:
for service in get_cached_all_services():
for group in service.get_root_groups():
async for members_batch in service.get_all_group_members(group):
tasks = [
service.enrich_member_with_groups_and_public_email(member)
for member in members_batch
]
members = await asyncio.gather(*tasks)
yield members
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@ocean.on_resync(ObjectKind.MEMBER)
async def resync_members(kind: str) -> ASYNC_GENERATOR_RESYNC_TYPE:
for service in get_cached_all_services():
for group in service.get_root_groups():
async for members_batch in service.get_all_group_members(group):
tasks = [
service.enrich_member_with_groups_and_public_email(member)
for member in members_batch
]
members = await asyncio.gather(*tasks)
yield members
@ocean.on_resync(ObjectKind.GROUP_MEMBERS)
async def resync_members(kind: str) -> ASYNC_GENERATOR_RESYNC_TYPE:
for service in get_cached_all_services():
for group in service.get_root_groups():
group_members = []
async for members_batch in service.get_all_group_members(group):
group_memebers.append(members_batch)
yield { group: group, group_members: group_members }

stream_async_iterators_tasks

Comment on lines 109 to 114
"gitlabGroup": {
"title": "Group",
"target": "gitlabGroup",
"required": false,
"many": true
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing group

Comment on lines 69 to 74
"locked": {
"type": "string",
"title": "Locked",
"icon": "GitLab",
"description": "Indicates if the GitLab item is locked."
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this locked?

Comment on lines 62 to 68
"properties": {
"state": {
"title": "State",
"type": "string",
"icon": "GitLab",
"description": "The current state of the GitLab item (e.g., open, closed)."
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are talking about member not gitlab item, please make sure its readable and straight forward for the users

Comment on lines 110 to 123
"visibility": {
"icon": "Lock",
"title": "Visibility",
"type": "string",
"enum": [
"public",
"internal",
"private"
],
"enumColors": {
"public": "red",
"internal": "yellow",
"private": "green"
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add description

publicEmail: .__public_email
relations:
gitlabGroup: '[.__groups[].full_path]'
createdBy: .created_by.username
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is createdBy we don't have that kind of relation please remove

integrations/gitlab/gitlab_integration/ocean.py Outdated Show resolved Hide resolved
locked: .locked
link: .web_url
email: .email
publicEmail: .__public_email
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets remove by default

Comment on lines 142 to 148
"relations": {
"members": {
"title": "Members",
"target": "member",
"required": false,
"many": true
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't be here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relationship is group -> member

Comment on lines 16 to 17
User,
GroupMember,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both, depends on the context

Comment on lines 126 to 132
class MembersSelector(Selector):
public_email_visibility: bool | None = Field(
alias="publicEmailVisibility",
default=False,
description="If set to true, the integration will enrich members with public email field. Default value is false",
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initialize the class outside of GitlabMembersResourceConfig that way we would be able to re-use it

Comment on lines 332 to 334
cached_groups = event.attributes.setdefault(GROUPS_CACHE_KEY, {}).setdefault(
self.gitlab_client.private_token, {}
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets not use this cache, we already have a prebuilt one in ocean core

Comment on lines 127 to 128
public_email_visibility: bool | None = Field(
alias="publicEmailVisibility",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enrich_with_public_email

Comment on lines 146 to 150
filter_bots: bool | None = Field(
alias="filterBots",
default=False,
description="If set to true, bots will be filtered out from the members list. Default value is false",
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be part of the group selector and not of all resources

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both Members and Groups depend on this parameter. Removing from top level means I have to include it in integration for groups and members. Please confirm this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I placed it at the top level to keep a consistent behavior in this system since groups are related to user. I strictly expect the value of filterBots to be consistent for groups and members. What could go wrong is that a user might specify this parameter as false for groups kind and true for member kind. Due to the relationship between members and groups, the catalog will be populated with extra inconsistent data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a comment above the filter_bots so other developers will understand your motivation.
Also I would rename it to include_member_bots and default should be false

GROUPS_CACHE_KEY = "__cache_all_groups"
MEMBERS_CACHE_KEY = "__cache_all_members"

MAX_CONCURRENT_TASKS = 30
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? how can we actually validate and handle it? we want to be able to handle the rate limits most of third parties return headers, but we don't use gitlab api straightforward but rather through the client.

Here are some notes on the rate limits

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually just found this one - https://python-gitlab.readthedocs.io/en/stable/api-usage-advanced.html#rate-limits which means that gitlab client handles this one for us, so we are good 👍

return

async def enrich_group_with_members(self, group: Group) -> dict[str, Any]:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redundant line

Comment on lines 573 to 574
async for members_batch in AsyncFetcher.fetch_batch(
fetch_func=group.members.list,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why group.member.list and not group.member_all.list ?
https://python-gitlab.readthedocs.io/en/stable/gl_objects/groups.html#id10
what do you think about adding this as an option to query all rather than only in one hierarchy?

Comment on lines 226 to 232
if selector.public_email_visibility:
yield [
await service.enrich_member_with_public_email(member)
for member in members
]
else:
yield [member.asdict() for member in members]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens when I have the same user in multiple groups? how would that behave? will I have to perform repeated upserts?

maybe in this method ^ we should use members = group.members_all.list(get_all=True) which will return all and reduce the amount of extra requests that we will have to perform?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my reason for using members as opposed to member_all was because the members_all request returns not the the user in that group but also all inherited and invited members.

Thereby resulting in all groups the same members since the members most commonly belong to the parent group - details

aside this, the behavior and how we will retrieve members does not differ from member and members_all

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok i understand what you are saying, so if we use the members_all only for the members kind wouldn't it reduce the amount of requests by a lot? as we will only have to bring the members for the root groups rather than the subgroups as well.

also just making sure that you have tested subgroups as well. please confirm

Copy link
Member Author

@mk-armah mk-armah Jul 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

calling /members on root groups returns the same results as /members/all, calling /members on subgroups comes with less data than members/all, due to the exclusion of invited and inherited members, for root groups, concept of inherited members does not apply, all members of the root groups are returned regardless how we choose to call them.

I believe the optimization here was getting members from root groups instead of all groups (including subgroups), which would have taken more time.

Copy link
Contributor

@Tankilevitch Tankilevitch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also missing webhook handling for new members.

Comment on lines 31 to 32
relations:
members: '[.__members[].username]'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding members should be optional for groups, so customers will be able to decide whether they want to have it or no

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and need to check that creating a new group, will sync the members correctly

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair, how about project -> group, same ?

Copy link
Contributor

@Tankilevitch Tankilevitch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job @mk-armah , added a few comments

integrations/gitlab/gitlab_integration/events/utils.py Outdated Show resolved Hide resolved
integrations/gitlab/gitlab_integration/git_integration.py Outdated Show resolved Hide resolved
Comment on lines 771 to 778
user: User = await self.get_user(member.id)
member_dict: dict[str, Any] = member.asdict()
member_dict["__public_email"] = user.public_email
return member_dict

async def get_user(self, user_id: str) -> User:
async with semaphore:
logger.info(f"fetching user {user_id}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there more information that can be added to a member other than public email ?

Copy link
Member Author

@mk-armah mk-armah Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a sample ==> User

{
  "id": 15339857,
  "username": "tompo",
  "name": "Tom Kofi Tankilevitch",
  "state": "active",
  "locked": false,
  "avatar_url": "https://secure.gravatar.com/avatar/81e644d6312fa802fa1a11d605104ba745732a7054be8d3579d3024ff6168bcc?s=80&d=identicon",
  "web_url": "https://gitlab.com/tompo",
  "created_at": "2023-08-27T13:39:29.744Z",
  "bio": "",
  "location": "",
  "public_email": null,
  "skype": "",
  "linkedin": "",
  "twitter": "",
  "discord": "",
  "website_url": "",
  "organization": "",
  "job_title": "",
  "pronouns": null,
  "bot": false,
  "work_information": null,
  "followers": 0,
  "following": 0,
  "is_followed": false,
  "local_time": null
}

Copy link
Member Author

@mk-armah mk-armah Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Member

{
  "id": 15339857,
  "username": "tompo",
  "name": "Tom Kofi Tankilevitch",
  "state": "active",
  "locked": false,
  "avatar_url": "https://secure.gravatar.com/avatar/81e644d6312fa802fa1a11d605104ba745732a7054be8d3579d3024ff6168bcc?s=80&d=identicon",
  "web_url": "https://gitlab.com/tompo",
  "access_level": 50,
  "created_at": "2023-08-27T13:39:29.744Z",
  "created_by": {
    "id": 13921399,
    "username": "matanlevi",
    "name": "Matan Levi",
    "state": "active",
    "locked": false,
    "avatar_url": "https://secure.gravatar.com/avatar/814e5edbb657a70c5c9bd09a4171768c245d18cabec5f3b440d43dbc26d36f88?s=80&d=identicon",
    "web_url": "https://gitlab.com/matanlevi"
  },
  "expires_at": null,
  "email": null,
  "membership_state": "active"
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just the socials, if its relevant

Comment on lines 723 to 724
obj_dict: dict[str, Any] = obj.asdict()
obj_dict["__members"] = members_list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are not consistent in the way that we are returning the enriched objects, in enrich_project you are returning objects while here you are returning dict, can you elaborate on why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ProjectWithMembers, we call enrich_project_with_extras after which the results are passed down to enrich_object_with_members, for enrich_object_with_members to get an object, enrich_project_with_extras needed to return an object.

Would you like me to make enrich_object_with_members to also manipulate the object directly and return a dict as well ?

Comment on lines 229 to 238
members_tasks = [
service.enrich_object_with_members(
project,
include_inherited_members,
include_bot_members,
include_public_email,
)
for project in projects_enriched_with_extras
]
projects_enriched_with_members = await asyncio.gather(*members_tasks)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is making sure we are not experiencing rate limits here?

we get 20 projects, and then spawn 20 tasks and in each project it might have multiple members and then for each member batch you are performing another request.

@@ -45,4 +51,7 @@ class ObjectKind:
PIPELINE = "pipeline"
PROJECT = "project"
FOLDER = "folder"
MEMBER = "member"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't seems used anymore

@@ -451,11 +462,17 @@ async def get_group(self, group_id: int) -> Group | None:
else:
return None

async def get_all_groups(self) -> typing.AsyncIterator[List[Group]]:
@cache_iterator_result()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't consistent, on one hand you are using @cache_iterator_result() and for users you are specifically creating cache. Ideally lets use cache_iterator_result if possible

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cache_iterator_results works with generators, not coroutines...
I wrote one for coroutines here, I'll push to ocean core utils so we use it from there

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@Tankilevitch Tankilevitch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@Tankilevitch Tankilevitch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Tankilevitch Tankilevitch merged commit 18adb07 into main Nov 14, 2024
18 checks passed
@Tankilevitch Tankilevitch deleted the improvement/gitlab branch November 14, 2024 16:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants