Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory usage #1560

Open
jay0lee opened this issue Sep 14, 2022 · 7 comments
Open

Reduce memory usage #1560

jay0lee opened this issue Sep 14, 2022 · 7 comments
Assignees

Comments

@jay0lee
Copy link
Member

jay0lee commented Sep 14, 2022

Today GAM can use a lot of memory when running a command like:

gam report user

in a large domain. Any command that uses gapi.get_all_pages() may use a LOT of memory while downloading many pages from Google.

Rewriting every usage of get_all_pages to do all parsing after each page would require a LOT of work but we can at least save some memory with the way we generate all_pages.

Rather than all_pages being a normal list object we can use the builtin Python shelve library to write the list to disk and save memory.

@jay0lee jay0lee self-assigned this Sep 14, 2022
@jay0lee
Copy link
Member Author

jay0lee commented Sep 14, 2022

@taers232c fyi

@jay0lee
Copy link
Member Author

jay0lee commented Sep 15, 2022

I'm not totally convinced this is the way to go for low memory handling. It works fine if the returned data is iterated over but as soon as you try to do something like data[0] it fails which will lead to unusual bugs

An alternative approach would be to write a function that replaces get_all_pages and includes a "callback" function that processes each page, saving the need to store 100k or millions sometimes of user, device or other list objects. The challenge here is things like csv output need to know columns before you start adding rows and we may be dynamically adding columns during later page callbacks.

@taers232c
Copy link
Contributor

I already do something like this, although not with callbacks. I get a page, process it, delete it, get the next page.

      printGettingAllEntityItemsForWhom(Ent.DRIVE_FILE_OR_FOLDER, user, i, count, query=DLP.fileIdEntity['query'])
      pageMessage = getPageMessageForWhom()
      pageToken = None
      totalItems = 0
      userError = False
      while True:
        try:
          feed = callGAPI(drive.files(), 'list',
                          throwReasons=GAPI.DRIVE_USER_THROW_REASONS+[GAPI.NOT_FOUND, GAPI.TEAMDRIVE_MEMBERSHIP_REQUIRED],
                          retryReasons=[GAPI.UNKNOWN_ERROR],
                          pageToken=pageToken,
                          orderBy=OBY.orderBy,
                          fields=pagesFields, pageSize=GC.Values[GC.DRIVE_MAX_RESULTS], **btkwargs)
        except (GAPI.notFound, GAPI.teamDriveMembershipRequired) as e:
          entityActionFailedWarning([Ent.USER, user, Ent.SHAREDDRIVE_ID, fileIdEntity['shareddrive']['driveId']], str(e), i, count)
          userError = True
          break
        except (GAPI.serviceNotAvailable, GAPI.authError, GAPI.domainPolicy) as e:
          userSvcNotApplicableOrDriveDisabled(user, str(e), i, count)
          userError = True
          break
        pageToken, totalItems = _processGAPIpagesResult(feed, 'files', None, totalItems, pageMessage, None, Ent.DRIVE_FILE_OR_FOLDER)
        if feed:
          extendFileTree(fileTree, feed.get('files', []), DLP, stripCRsFromName)
          del feed
        if not pageToken:
          _finalizeGAPIpagesResult(pageMessage)
          break

@jay0lee
Copy link
Member Author

jay0lee commented Sep 15, 2022 via email

@taers232c
Copy link
Contributor

Agreed. Something to do in my spare time.

@ejochman
Copy link
Contributor

Rather than having to provide a callback function, you may want to consider creating something like yield_page() as a pythonic way of fetching pages of results and yield'ing them, as needed, to the caller. I've long desired to correct this behavior that writes everything to memory before processing, but (as noted earlier) it will take a lot of work.

I would also be careful with using something like shelve without being explicit about its behavior to users. You may be inadvertently storing the organizations sensitive or PII data, unencrypted, on disk.

@jay0lee
Copy link
Member Author

jay0lee commented Sep 16, 2022

yield is a good idea. It does mean we end up waiting for a page to be processed locally before we get the next page so it's not necessarily faster. I really like the idea of a callback that is getting pages as fast as possible while also parsing them locally in parallel but maybe adding that additional complexity isn't worth the performance gain.

What makes sense to me would be to add a function like yield_all_pages() and then start moving get_all_pages() callers over as we can.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants