-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Integration][Gitlab] Added support for gitlab member ingestion #767
Conversation
"publicEmail": { | ||
"type": "string", | ||
"title": "Public Email", | ||
"description": "User's GitLab public email.", | ||
"icon": "User", | ||
"format": "user" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in what case the user will have a public email? I think we might want to remove it by default
relations: | ||
gitlabGroup: '[.__groups[].full_path]' | ||
createdBy: .created_by.username |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the blueprint relation you only have gitlabGroup, while in your mapping you have both createdBy and gitlabGroup. I am not sure it is of interest for users who created the user, lets remove it. Let me know if you think otherwise
async def check_group_membership(group: Group) -> Group | None: | ||
"check if the user is a member of the group" | ||
async with semaphore: | ||
try: | ||
await AsyncFetcher.fetch_single(group.members.get, member.get_id()) | ||
return group | ||
except GitlabError as err: | ||
if err.response_code != 404: | ||
raise err | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this feels very insufficient, eventually for all the if I have 1 member, and 1000 groups, I'll have to query this api 1000 times to find out what groups he related to? isn't there any other way to get that data?
user_groups: List[dict[str, Any]] = [ | ||
{"id": group.id, "full_path": group.full_path} | ||
async for groups in self.get_member_groups(user) | ||
for group in groups | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure this should be the default behavior when quering members.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe this should be something that is part of group members kind? that will query the list of members of each group.
and instead of having a relation of user -> groups, we will have relation between group -> users
async def get_all_group_members( | ||
self, group: Group | ||
) -> typing.AsyncIterator[List[GroupMemberAll]]: | ||
|
||
logger.info(f"Fetching all members of group {group.name}") | ||
|
||
async for users_batch in AsyncFetcher.fetch_batch( | ||
fetch_func=group.members_all.list, | ||
validation_func=self.should_run_for_member, | ||
pagination="offset", | ||
order_by="id", | ||
sort="asc", | ||
): | ||
members: List[GroupMemberAll] = typing.cast( | ||
List[GroupMemberAll], users_batch | ||
) | ||
logger.info( | ||
f"Queried {len(members)} members {[user.username for user in members]} from {group.name}" | ||
) | ||
yield members | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
async def get_all_group_members( | |
self, group: Group | |
) -> typing.AsyncIterator[List[GroupMemberAll]]: | |
logger.info(f"Fetching all members of group {group.name}") | |
async for users_batch in AsyncFetcher.fetch_batch( | |
fetch_func=group.members_all.list, | |
validation_func=self.should_run_for_member, | |
pagination="offset", | |
order_by="id", | |
sort="asc", | |
): | |
members: List[GroupMemberAll] = typing.cast( | |
List[GroupMemberAll], users_batch | |
) | |
logger.info( | |
f"Queried {len(members)} members {[user.username for user in members]} from {group.name}" | |
) | |
yield members | |
async def get_all_group_members( | |
self, group: Group | |
) -> typing.AsyncIterator[List[GroupMemberAll]]: | |
logger.info(f"Fetching all members of group {group.name}") | |
async for users_batch in AsyncFetcher.fetch_batch( | |
fetch_func=group.members_all.list, | |
validation_func=self.should_run_for_member, | |
pagination="offset", | |
order_by="id", | |
sort="asc", | |
): | |
members: List[GroupMemberAll] = typing.cast( | |
List[GroupMemberAll], users_batch | |
) | |
logger.info( | |
f"Queried {len(members)} members {[user.username for user in members]} from {group.name}" | |
) | |
members_enriched_with_group = [ {...member, group: group} for member in members ] | |
yield members | |
async def enrich_member_with_groups_and_public_email( | ||
self, member | ||
) -> dict[str, Any]: | ||
user: User = await self.get_user(member.id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feature flag in the selector mapping, and defined in docs.
@ocean.on_resync(ObjectKind.MEMBER) | ||
async def resync_members(kind: str) -> ASYNC_GENERATOR_RESYNC_TYPE: | ||
for service in get_cached_all_services(): | ||
for group in service.get_root_groups(): | ||
async for members_batch in service.get_all_group_members(group): | ||
tasks = [ | ||
service.enrich_member_with_groups_and_public_email(member) | ||
for member in members_batch | ||
] | ||
members = await asyncio.gather(*tasks) | ||
yield members |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ocean.on_resync(ObjectKind.MEMBER) | |
async def resync_members(kind: str) -> ASYNC_GENERATOR_RESYNC_TYPE: | |
for service in get_cached_all_services(): | |
for group in service.get_root_groups(): | |
async for members_batch in service.get_all_group_members(group): | |
tasks = [ | |
service.enrich_member_with_groups_and_public_email(member) | |
for member in members_batch | |
] | |
members = await asyncio.gather(*tasks) | |
yield members | |
@ocean.on_resync(ObjectKind.GROUP_MEMBERS) | |
async def resync_members(kind: str) -> ASYNC_GENERATOR_RESYNC_TYPE: | |
for service in get_cached_all_services(): | |
for group in service.get_root_groups(): | |
group_members = [] | |
async for members_batch in service.get_all_group_members(group): | |
group_memebers.append(members_batch) | |
yield { group: group, group_members: group_members } |
stream_async_iterators_tasks
"gitlabGroup": { | ||
"title": "Group", | ||
"target": "gitlabGroup", | ||
"required": false, | ||
"many": true | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing group
"locked": { | ||
"type": "string", | ||
"title": "Locked", | ||
"icon": "GitLab", | ||
"description": "Indicates if the GitLab item is locked." | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is this locked?
"properties": { | ||
"state": { | ||
"title": "State", | ||
"type": "string", | ||
"icon": "GitLab", | ||
"description": "The current state of the GitLab item (e.g., open, closed)." | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are talking about member not gitlab item, please make sure its readable and straight forward for the users
"visibility": { | ||
"icon": "Lock", | ||
"title": "Visibility", | ||
"type": "string", | ||
"enum": [ | ||
"public", | ||
"internal", | ||
"private" | ||
], | ||
"enumColors": { | ||
"public": "red", | ||
"internal": "yellow", | ||
"private": "green" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add description
publicEmail: .__public_email | ||
relations: | ||
gitlabGroup: '[.__groups[].full_path]' | ||
createdBy: .created_by.username |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is createdBy
we don't have that kind of relation please remove
locked: .locked | ||
link: .web_url | ||
email: .email | ||
publicEmail: .__public_email |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets remove by default
"relations": { | ||
"members": { | ||
"title": "Members", | ||
"target": "member", | ||
"required": false, | ||
"many": true | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't be here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
relationship is group -> member
User, | ||
GroupMember, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both, depends on the context
…improvement/gitlab
… defaulting to false
class MembersSelector(Selector): | ||
public_email_visibility: bool | None = Field( | ||
alias="publicEmailVisibility", | ||
default=False, | ||
description="If set to true, the integration will enrich members with public email field. Default value is false", | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
initialize the class outside of GitlabMembersResourceConfig
that way we would be able to re-use it
cached_groups = event.attributes.setdefault(GROUPS_CACHE_KEY, {}).setdefault( | ||
self.gitlab_client.private_token, {} | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets not use this cache, we already have a prebuilt one in ocean core
public_email_visibility: bool | None = Field( | ||
alias="publicEmailVisibility", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enrich_with_public_email
filter_bots: bool | None = Field( | ||
alias="filterBots", | ||
default=False, | ||
description="If set to true, bots will be filtered out from the members list. Default value is false", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be part of the group selector and not of all resources
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both Members and Groups depend on this parameter. Removing from top level means I have to include it in integration for groups and members. Please confirm this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I placed it at the top level to keep a consistent behavior in this system since groups are related to user. I strictly expect the value of filterBots to be consistent for groups and members. What could go wrong is that a user might specify this parameter as false for groups kind and true for member kind. Due to the relationship between members and groups, the catalog will be populated with extra inconsistent data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add a comment above the filter_bots
so other developers will understand your motivation.
Also I would rename it to include_member_bots
and default should be false
GROUPS_CACHE_KEY = "__cache_all_groups" | ||
MEMBERS_CACHE_KEY = "__cache_all_members" | ||
|
||
MAX_CONCURRENT_TASKS = 30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why? how can we actually validate and handle it? we want to be able to handle the rate limits most of third parties return headers, but we don't use gitlab
api straightforward but rather through the client.
- https://docs.gitlab.com/ee/administration/settings/rate_limit_on_projects_api.html
- https://docs.gitlab.com/ee/administration/settings/rate_limit_on_members_api.html
- https://forum.gitlab.com/t/ratelimit-headers/93078
Here are some notes on the rate limits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually just found this one - https://python-gitlab.readthedocs.io/en/stable/api-usage-advanced.html#rate-limits which means that gitlab client handles this one for us, so we are good 👍
return | ||
|
||
async def enrich_group_with_members(self, group: Group) -> dict[str, Any]: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
redundant line
async for members_batch in AsyncFetcher.fetch_batch( | ||
fetch_func=group.members.list, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why group.member.list
and not group.member_all.list
?
https://python-gitlab.readthedocs.io/en/stable/gl_objects/groups.html#id10
what do you think about adding this as an option to query all rather than only in one hierarchy?
if selector.public_email_visibility: | ||
yield [ | ||
await service.enrich_member_with_public_email(member) | ||
for member in members | ||
] | ||
else: | ||
yield [member.asdict() for member in members] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens when I have the same user in multiple groups? how would that behave? will I have to perform repeated upserts?
maybe in this method ^ we should use members = group.members_all.list(get_all=True)
which will return all and reduce the amount of extra requests that we will have to perform?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my reason for using members as opposed to member_all was because the members_all request returns not the the user in that group but also all inherited and invited members.
Thereby resulting in all groups the same members since the members most commonly belong to the parent group - details
aside this, the behavior and how we will retrieve members does not differ from member and members_all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok i understand what you are saying, so if we use the members_all
only for the members kind wouldn't it reduce the amount of requests by a lot? as we will only have to bring the members for the root groups rather than the subgroups
as well.
also just making sure that you have tested subgroups as well. please confirm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
calling /members
on root groups returns the same results as /members/all
, calling /members
on subgroups comes with less data than members/all
, due to the exclusion of invited and inherited members, for root groups, concept of inherited members does not apply, all members of the root groups are returned regardless how we choose to call them.
I believe the optimization here was getting members from root groups instead of all groups (including subgroups), which would have taken more time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also missing webhook handling for new members.
relations: | ||
members: '[.__members[].username]' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think adding members should be optional for groups, so customers will be able to decide whether they want to have it or no
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and need to check that creating a new group, will sync the members correctly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fair, how about project -> group, same ?
… into improvement/gitlab
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job @mk-armah , added a few comments
user: User = await self.get_user(member.id) | ||
member_dict: dict[str, Any] = member.asdict() | ||
member_dict["__public_email"] = user.public_email | ||
return member_dict | ||
|
||
async def get_user(self, user_id: str) -> User: | ||
async with semaphore: | ||
logger.info(f"fetching user {user_id}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there more information that can be added to a member
other than public email ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a sample ==> User
{
"id": 15339857,
"username": "tompo",
"name": "Tom Kofi Tankilevitch",
"state": "active",
"locked": false,
"avatar_url": "https://secure.gravatar.com/avatar/81e644d6312fa802fa1a11d605104ba745732a7054be8d3579d3024ff6168bcc?s=80&d=identicon",
"web_url": "https://gitlab.com/tompo",
"created_at": "2023-08-27T13:39:29.744Z",
"bio": "",
"location": "",
"public_email": null,
"skype": "",
"linkedin": "",
"twitter": "",
"discord": "",
"website_url": "",
"organization": "",
"job_title": "",
"pronouns": null,
"bot": false,
"work_information": null,
"followers": 0,
"following": 0,
"is_followed": false,
"local_time": null
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Member
{
"id": 15339857,
"username": "tompo",
"name": "Tom Kofi Tankilevitch",
"state": "active",
"locked": false,
"avatar_url": "https://secure.gravatar.com/avatar/81e644d6312fa802fa1a11d605104ba745732a7054be8d3579d3024ff6168bcc?s=80&d=identicon",
"web_url": "https://gitlab.com/tompo",
"access_level": 50,
"created_at": "2023-08-27T13:39:29.744Z",
"created_by": {
"id": 13921399,
"username": "matanlevi",
"name": "Matan Levi",
"state": "active",
"locked": false,
"avatar_url": "https://secure.gravatar.com/avatar/814e5edbb657a70c5c9bd09a4171768c245d18cabec5f3b440d43dbc26d36f88?s=80&d=identicon",
"web_url": "https://gitlab.com/matanlevi"
},
"expires_at": null,
"email": null,
"membership_state": "active"
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just the socials, if its relevant
obj_dict: dict[str, Any] = obj.asdict() | ||
obj_dict["__members"] = members_list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are not consistent in the way that we are returning the enriched objects, in enrich_project you are returning objects while here you are returning dict, can you elaborate on why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In ProjectWithMembers, we call enrich_project_with_extras
after which the results are passed down to enrich_object_with_members
, for enrich_object_with_members
to get an object, enrich_project_with_extras
needed to return an object.
Would you like me to make enrich_object_with_members
to also manipulate the object directly and return a dict as well ?
members_tasks = [ | ||
service.enrich_object_with_members( | ||
project, | ||
include_inherited_members, | ||
include_bot_members, | ||
include_public_email, | ||
) | ||
for project in projects_enriched_with_extras | ||
] | ||
projects_enriched_with_members = await asyncio.gather(*members_tasks) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is making sure we are not experiencing rate limits here?
we get 20 projects, and then spawn 20 tasks and in each project it might have multiple members and then for each member batch you are performing another request.
@@ -45,4 +51,7 @@ class ObjectKind: | |||
PIPELINE = "pipeline" | |||
PROJECT = "project" | |||
FOLDER = "folder" | |||
MEMBER = "member" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't seems used anymore
@@ -451,11 +462,17 @@ async def get_group(self, group_id: int) -> Group | None: | |||
else: | |||
return None | |||
|
|||
async def get_all_groups(self) -> typing.AsyncIterator[List[Group]]: | |||
@cache_iterator_result() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this isn't consistent, on one hand you are using @cache_iterator_result()
and for users
you are specifically creating cache. Ideally lets use cache_iterator_result
if possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cache_iterator_results
works with generators, not coroutines...
I wrote one for coroutines here, I'll push to ocean core utils so we use it from there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
… into improvement/gitlab
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Description
Added support for GitLab member ingestion.
This PR introduces two new kinds:
PROJECTWITHMEMBERS
andGROUPWITHMEMBERS
. These are enhanced versions of the originalPROJECT
andGROUP
kinds, now including member data.Public Email Visibility:
includePublicEmail
to the member selector.true
, members are enriched with thepublic_email
field from the/users
endpoint. This is dependent on whether the user has allowed public email visibility in their GitLab account settings. Learn more./users
endpoint because the members API does not includepublic_email
.includePublicEmail
isfalse
.Bot Members Filtering:
includeBotMembers
to the MemberSelector.false
, bot members are excluded from the synchronization process.includeBotMembers
istrue
, meaning bots are included by default.Inherited Members Inclusion:
includeInheritedMembers
.true
, members inherited from parent groups are included during synchronization.includeInheritedMembers
isfalse
, so inherited members are excluded by default.Type of change
Please leave one option from the following and delete the rest:
Screenshots