Reduce memory usage #1560

jay0lee · 2022-09-14T18:54:28Z

Today GAM can use a lot of memory when running a command like:

gam report user

in a large domain. Any command that uses gapi.get_all_pages() may use a LOT of memory while downloading many pages from Google.

Rewriting every usage of get_all_pages to do all parsing after each page would require a LOT of work but we can at least save some memory with the way we generate all_pages.

Rather than all_pages being a normal list object we can use the builtin Python shelve library to write the list to disk and save memory.

The text was updated successfully, but these errors were encountered:

jay0lee · 2022-09-14T18:55:48Z

@taers232c fyi

jay0lee · 2022-09-15T22:14:57Z

I'm not totally convinced this is the way to go for low memory handling. It works fine if the returned data is iterated over but as soon as you try to do something like data[0] it fails which will lead to unusual bugs

An alternative approach would be to write a function that replaces get_all_pages and includes a "callback" function that processes each page, saving the need to store 100k or millions sometimes of user, device or other list objects. The challenge here is things like csv output need to know columns before you start adding rows and we may be dynamically adding columns during later page callbacks.

taers232c · 2022-09-15T22:31:23Z

I already do something like this, although not with callbacks. I get a page, process it, delete it, get the next page.

      printGettingAllEntityItemsForWhom(Ent.DRIVE_FILE_OR_FOLDER, user, i, count, query=DLP.fileIdEntity['query'])
      pageMessage = getPageMessageForWhom()
      pageToken = None
      totalItems = 0
      userError = False
      while True:
        try:
          feed = callGAPI(drive.files(), 'list',
                          throwReasons=GAPI.DRIVE_USER_THROW_REASONS+[GAPI.NOT_FOUND, GAPI.TEAMDRIVE_MEMBERSHIP_REQUIRED],
                          retryReasons=[GAPI.UNKNOWN_ERROR],
                          pageToken=pageToken,
                          orderBy=OBY.orderBy,
                          fields=pagesFields, pageSize=GC.Values[GC.DRIVE_MAX_RESULTS], **btkwargs)
        except (GAPI.notFound, GAPI.teamDriveMembershipRequired) as e:
          entityActionFailedWarning([Ent.USER, user, Ent.SHAREDDRIVE_ID, fileIdEntity['shareddrive']['driveId']], str(e), i, count)
          userError = True
          break
        except (GAPI.serviceNotAvailable, GAPI.authError, GAPI.domainPolicy) as e:
          userSvcNotApplicableOrDriveDisabled(user, str(e), i, count)
          userError = True
          break
        pageToken, totalItems = _processGAPIpagesResult(feed, 'files', None, totalItems, pageMessage, None, Ent.DRIVE_FILE_OR_FOLDER)
        if feed:
          extendFileTree(fileTree, feed.get('files', []), DLP, stripCRsFromName)
          del feed
        if not pageToken:
          _finalizeGAPIpagesResult(pageMessage)
          break

jay0lee · 2022-09-15T22:33:59Z

Right that's easy enough but getting it into a generic function would make it easier to use elsewhere without all the API error handling.

…

On Thu, Sep 15, 2022, 6:31 PM Ross Scroggs ***@***.***> wrote: I already do something like this, although not with callbacks. I get a page, process it, delete it, get the next page. printGettingAllEntityItemsForWhom(Ent.DRIVE_FILE_OR_FOLDER, user, i, count, query=DLP.fileIdEntity['query']) pageMessage = getPageMessageForWhom() pageToken = None totalItems = 0 userError = False while True: try: feed = callGAPI(drive.files(), 'list', throwReasons=GAPI.DRIVE_USER_THROW_REASONS+[GAPI.NOT_FOUND, GAPI.TEAMDRIVE_MEMBERSHIP_REQUIRED], retryReasons=[GAPI.UNKNOWN_ERROR], pageToken=pageToken, orderBy=OBY.orderBy, fields=pagesFields, pageSize=GC.Values[GC.DRIVE_MAX_RESULTS], **btkwargs) except (GAPI.notFound, GAPI.teamDriveMembershipRequired) as e: entityActionFailedWarning([Ent.USER, user, Ent.SHAREDDRIVE_ID, fileIdEntity['shareddrive']['driveId']], str(e), i, count) userError = True break except (GAPI.serviceNotAvailable, GAPI.authError, GAPI.domainPolicy) as e: userSvcNotApplicableOrDriveDisabled(user, str(e), i, count) userError = True break pageToken, totalItems = _processGAPIpagesResult(feed, 'files', None, totalItems, pageMessage, None, Ent.DRIVE_FILE_OR_FOLDER) if feed: extendFileTree(fileTree, feed.get('files', []), DLP, stripCRsFromName) del feed if not pageToken: _finalizeGAPIpagesResult(pageMessage) break — Reply to this email directly, view it on GitHub <#1560 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABDIZMAI5LSRQORKTRV56L3V6OPUPANCNFSM6AAAAAAQMWGFAM> . You are receiving this because you modified the open/close state.Message ID: ***@***.***>

taers232c · 2022-09-15T22:34:42Z

Agreed. Something to do in my spare time.

ejochman · 2022-09-16T17:05:26Z

Rather than having to provide a callback function, you may want to consider creating something like yield_page() as a pythonic way of fetching pages of results and yield'ing them, as needed, to the caller. I've long desired to correct this behavior that writes everything to memory before processing, but (as noted earlier) it will take a lot of work.

I would also be careful with using something like shelve without being explicit about its behavior to users. You may be inadvertently storing the organizations sensitive or PII data, unencrypted, on disk.

jay0lee · 2022-09-16T20:33:14Z

yield is a good idea. It does mean we end up waiting for a page to be processed locally before we get the next page so it's not necessarily faster. I really like the idea of a callback that is getting pages as fast as possible while also parsing them locally in parallel but maybe adding that additional complexity isn't worth the performance gain.

What makes sense to me would be to add a function like yield_all_pages() and then start moving get_all_pages() callers over as we can.

jay0lee added the enhancement label Sep 14, 2022

jay0lee self-assigned this Sep 14, 2022

jay0lee closed this as completed in 509919d Sep 14, 2022

jay0lee referenced this issue Sep 15, 2022

Update reports.py

550cf47

jay0lee reopened this Sep 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce memory usage #1560

Reduce memory usage #1560

jay0lee commented Sep 14, 2022

jay0lee commented Sep 14, 2022

jay0lee commented Sep 15, 2022

taers232c commented Sep 15, 2022

jay0lee commented Sep 15, 2022 via email

taers232c commented Sep 15, 2022

ejochman commented Sep 16, 2022

jay0lee commented Sep 16, 2022

Reduce memory usage #1560

Reduce memory usage #1560

Comments

jay0lee commented Sep 14, 2022

jay0lee commented Sep 14, 2022

jay0lee commented Sep 15, 2022

taers232c commented Sep 15, 2022

jay0lee commented Sep 15, 2022 via email

taers232c commented Sep 15, 2022

ejochman commented Sep 16, 2022

jay0lee commented Sep 16, 2022