Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify credits for the next year of gke.mybinder.org #463

Closed
1 of 3 tasks
choldgraf opened this issue Oct 30, 2021 · 69 comments
Closed
1 of 3 tasks

Identify credits for the next year of gke.mybinder.org #463

choldgraf opened this issue Oct 30, 2021 · 69 comments

Comments

@choldgraf
Copy link
Member

choldgraf commented Oct 30, 2021

Proposed change

Our annual allotment of credits for gke.mybinder.org runs out in late December (I believe, December 22nd 2021). We won't spend the credits down to 0 at that time, but they will expire on that date.

We need to identify where another round of funding for gke.mybinder.org will come from.

Draft of two pager

See here for a draft two-pager to send to karan

Action plan

We have 3 known options to power gke.mybinder.org:

  1. GCP credits, if they extend them by another year
  2. Jupyter Meets the Earth grant, if we can get approval by @fperez depending on whether they're in scope - this is likely not an option because it only runs on AWS
  3. Pangeo cloud credits

here's the current plan:

  • On Wednesday, December 22nd our credits on gke.mybinder.org run out.
  • We wait until 6PM US/Pacific on Monday the 21st for a response from GCP and/or to confirm that we can use the JMTE funds, and don't take technical action until then.
  • After then: If GCP gives us credits, then we direct the new billing account w/ credits to gke.mybinder.org, and we're golden.
  • If not, but we have JTME approval, then we direct the JMTE billing account to gke.mybinder.org, then we're golden (for a month)
  • If no JMTE approval, then we immediately switch into "deploy with Pangeo funding" mode and hope/plan for this to be completed by Wednesday the 22nd.
  • If for some reason we don't think any of the above cases will work, then here's a blog post draft we should release, saying that capacity will drastically shrink Here is a draft blog post. We should then shut down gke.mybinder.org (unless somebody wants to foot the bill, I cannot pay for this deployment from my credit card again). This will reduce mybinder.org's capacity by about 75%, but I think that's just the reality that we face. Unless somebody has access to a large amount of Google Cloud credits that we can hook into gke.mybinder.org, I'm not sure what else we can do.

Regardless of all this, we need to release a blog post about the current situation, because it is clearly unsustainable (at least, for me it is unsustainable, and I assume for others as well)

Tasks to complete

  • Put together a 2-pager that demonstrates Binder's impact
  • Comments and feedback on the 2-pager (see link above)
  • Wait for Karan's feedback
@choldgraf
Copy link
Member Author

Update: conversation with Karan and the GCP Research team

@consideRatio and I had a conversation with Karan from Google Cloud. He said that he was hopeful they'd be able to fund gke.mybinder.org for another round of cloud credits. In order to explore this, he'd need a 2-pager style document that demonstrated the impact of mybinder.org as well as the costs over time.

In particular, they care about things that demonstrate diverse and worldwide impact, like:

  • Usage from communities outside of the US/Europe
  • Total number of weekly sessions
  • Number of github repositories served

I've updated the top comment with some next steps about putting together this 2-pager

@choldgraf choldgraf moved this to Todo 👍 in Sprint Board Nov 17, 2021
@choldgraf choldgraf self-assigned this Nov 17, 2021
@choldgraf
Copy link
Member Author

I am going to try putting together a 2-pager ASAP that we can send to Karan, because 1.5 months is not that much time for us to get another round of funding. I would really appreciate any suggestions or help from others! Here are a few things that could be useful to help:

  • Analyze some Binder data from the last year and come up with plots that demonstrate impact along the categories described above
  • Come up with ideas for information we could put into a 2-pager that would demonstrate impact
  • Help with writing or editing drafts

@minrk
Copy link
Member

minrk commented Nov 17, 2021

I'll work on gathering some analytics

@MridulS
Copy link

MridulS commented Nov 17, 2021

@minrk I did some work here https://gist.github.com/MridulS/5accc696311c4f381c05cb70922d3624

Screenshot 2021-11-17 at 12 53 30

@minrk
Copy link
Member

minrk commented Nov 17, 2021

Nice! I'll look at getting region data from matomo

@sgibson91
Copy link
Member

I wonder if we can deploy this to the federation too, to make info gathering easier in the future? https://github.com/bitnik/binder-launches (Unfortunately, the link to the instance at GESIS no longer seems to be up)

@minrk
Copy link
Member

minrk commented Nov 17, 2021

Still working on analytics, since I've never really dug into matomo before, but here are monthly visits to GKE by continent:
visualization-9

and summing all non-NA-EU together shows it's about 50% NA, 30% EU, 20% rest-of-world:
visualization-10

https://gist.github.com/28e6c3aeb9e7a208e0986a67892e912d

@choldgraf
Copy link
Member Author

Hey all - thanks for these very helpful graphs! I tried to reproduce some of them but I cannot figure out how to get the Matomo secret to access that data (here's an issue I opened as a result: #473 (comment)).

Can anybody help me get access to the Matomo data so that we can include country information in this report?

@choldgraf
Copy link
Member Author

choldgraf commented Nov 20, 2021

Update: draft is ready

Hey all - I took some of the plots here (some directly, some as inspiration) and put together the 2-pager at the link below:

https://docs.google.com/document/d/1DvW8TYgEVWYvsgZKlr4JrmuhLQoYC-jTie0okgnIjp0/edit?usp=sharing

I'd love feedback from folks if they think this looks OK. The goal of the 2-pager is to demonstrate the impact and usage of Binder, but doesn't need to go into a ton of detail. Also note, I couldn't figure out how to get Matomo data myself, so I just went with copy/pasting Google Analytics images, but happy (and prefer) to use Matomo data if somebody can help me get access to it.

I've uploaded some archive launch data + the notebooks to visualize it here: https://github.com/choldgraf/binder-meta

If people would like to make any changes etc to those notebooks, PRs are welcome!

@choldgraf
Copy link
Member Author

Update: sent to Karan for feedback

I know that this is a short turnaround, but we only have about a month before gke.mybinder.org runs out of credits, so I have sent the two-pager above to Karan for some feedback to see if we need to add anything to the 2-pager before he submits internally. I've cc'ed @minrk (as team lead) and @consideRatio (since he's been helping with the GCP Binder move lately) on the email. Will report back with relevant information.

@choldgraf choldgraf moved this from Todo 👍 to Waiting 🕛 in Sprint Board Nov 22, 2021
@betatim
Copy link
Member

betatim commented Nov 24, 2021

Thanks a lot for putting the numbers together and adding words! I think it is good enough that we could send it already, so now we have a bit of time to make it even better.

I read the draft and left a few comments. Most of them are suggestions/nitpicks.

One thing I was wondering is if we can show/say something about an exciting/new area that is growing in terms of mybinder.org usage. The prime example that comes to my mind is things like executable books where we provide a crucial bit of infrastructure for courses/educational books from around the world that lets them do something that is otherwise super hard to do (executable sections in a text book). But I am not deeply enough into the executable books project/user base to know if there are a handful of neat projects we could point to. Not in detail but as a "this is a new area that is growing and super cool!"

@choldgraf
Copy link
Member Author

Hey all - I have still not heard any specific response from Google, and so I want to start contingency planning for what to do if we do not get new credits in time. Here is what I propose:

Timeline for running out of credits

  • On Wednesday, December 22nd our credits on gke.mybinder.org run out.
  • If we don't have word by tomorrow (Tuesday, December 21st), we should follow this plan:
    1. make a blog post on the Jupyter Blog telling users what to expect. Here is a draft blog post.
    2. On Wednesday the 22nd, shut down gke.mybinder.org (unless somebody wants to foot the bill, I cannot pay for this deployment from my credit card again)

This will reduce mybinder.org's capacity by about 75%, but I think that's just the reality that we face. Unless somebody has access to a large amount of Google Cloud credits that we can hook into gke.mybinder.org, I'm not sure what else we can do.

@betatim
Copy link
Member

betatim commented Dec 20, 2021

Sad times.

We will also need to find a new host for https://github.com/jupyterhub/mybinder.org-deploy/tree/master/images/federation-redirect and tweak its configuration so it will continue to work without GKE as the "prime". I think it makes sense to use the OVH deployment as the new prime site. I think these tasks need to happen to make the move:

  • deploy federation proxy to new cluster
  • double check the config
  • change the IP address that mybinder.org points to

I think we can run two instances of the federation proxy in parallel without weird stuff happening. This means it shouldn't be a huge interruption to users.

Where/how should I give feedback on the blogpost draft?

Should we now tweet about the upcoming change? - If we do we give people (who read the tweet) only about 48h notice which isn't a lot. But hopefully there aren't too many people who rely on gke.mybinder.org explicitly or were planning big demos or some such. I think it would be a good idea to do so.

@choldgraf
Copy link
Member Author

choldgraf commented Dec 20, 2021

We are scrambling a bit to see if we can make up any extra funding from a different source. I also hope to have a more definitive answer from GCP by the end of day US/Pacific. There are two potential other funding sources we might be able to use in a stop-gap fashion.

  • The Jupyter Meets the Earth grant (which we do not yet have explicit approval for, so need that first from @fperez if he thinks it's in-scope)
  • The Pangeo cloud grant (which we do have approval to use from @rabernat, but which would require some technical complexity because that funding is parked at Columbia)

Either case it not a long-term solution, more like a 1-month stopgap to keep the lights on.

Here's my proposed plan:

REMOVED here and added to the top comment above

I'll update the top comment with this plan for visibility

@betatim
Copy link
Member

betatim commented Dec 20, 2021

What does "deploy with Pangeo funding" mean? Switching billing accounts or deploying to a new cluster or something third?

For anything beyond "Switch billing accounts" I think we should start moving the federation proxy as it will be good to have that somewhere else in either case. And it is something we can start doing instead of waiting for the clock to tick down. The closer we get to the lights going out the more hectic things will get, the more hectic things get the more mistakes we will make, the more mistakes we make the more hectic it will get, etc :D So I think starting to move now is worth it.

@choldgraf
Copy link
Member Author

choldgraf commented Dec 20, 2021

@betatim yep, Pangeo has some grant funds parked at Columbia which are earmarked for a Binder deployment, and we can realistically say it is in-scope for that grant to pay for a short time of mybinder.org. However it'd require setting up a new project under the Columbia.edu cloud org, and re-deploying gke.mybinder.org there. This is why it is the last preferred option

@betatim
Copy link
Member

betatim commented Dec 20, 2021

(sorry I edited my last comment above for a long time without clicking "save")

@betatim
Copy link
Member

betatim commented Dec 20, 2021

Has anyone asked the current members of the federation how much spare capacity we have there? Maybe we can increase our allocations there to make up for the lost capacity at GKE.

cc @MridulS for gesis, @sgibson91 for Turing (can you tag the right new person please?) and @mael-le-gal for OVH

@betatim
Copy link
Member

betatim commented Dec 21, 2021

Making the disks smaller sounds like a good plan. One thing I have at the back of my mind is that IO performance is linked to disk size. So maybe that was the reason for having such large disks (and we seem to end up with spare credits at the end of the year any way -> the time limit of the credits is a bigger factor that the amount). Worse performance is better than no performance though, so yay to smaller disks.

We could also ditch the "two disk" approach and use only the main disk to save even more money. I think OVH has been running in that mode for a while now. It needs a bit of a reconfiguration of the image GC to use an absolute size and not an inode based threshold.

@minrk
Copy link
Member

minrk commented Dec 21, 2021

The local SSD is a relatively small cost, so I'm not sure it's an optimization worth making right now. Since the same capacity on the PD SSD is 5x as expensive, merging the two probably doesn't make sense.

It would be interesting if we could get the host docker onto another local SSD and lose the PD-SSD altogether. That would would save tons at a cost of fixed capacity per node. Not sure if that's possible, though.

@choldgraf
Copy link
Member Author

I just got off the phone with the Google OSPO office. They are working to find stop-gap funding (maybe 6 weeks or so) in order to keep Binder running through January (though, if we can bring down the costs, we might be able to extend this a few months). That would buy us some time to work out a longer-term solution that is more sustainable for us (and for them) than this "every 1 year we frantically email people we know at Google" approach thus far. No promises from them, but I'm hopeful we'll work something out and will report back here as I learn more.

@minrk
Copy link
Member

minrk commented Dec 21, 2021

New quota increases from federation members have greatly reduced load on GKE prod. I've helped encourage scale-down a little with some cordoning. But I think between (ongoing) stale image deletions and load redistribution, we're looking at at least a few thousand dollars saved today on the monthly bill.

@minrk
Copy link
Member

minrk commented Dec 21, 2021

One thing we probably still need to do to run in a cost-conscious way is help scale-down with manual cordoning of low-occupancy user nodes. We had 7 42-day-old nodes, which is capacity for ~600 users at our lowest traffic times, I think? That's definitely more than we needed.

@betatim
Copy link
Member

betatim commented Dec 21, 2021

For the downscaling we should investigate the custom scheduler we use and if something has changed there. It used to work well :-/ An alternative we've discussed is using node preferences (similar to what we do for "sticky" build pods) where we work out the "least busy node" and then add a anti-preference for that node to a pod when creating it.

@arnim
Copy link

arnim commented Dec 21, 2021

If absolutely necessary and of help, GESIS would also be able to contribute ~$5k via Linode. For the image repository, we could use our existing ones.

@betatim
Copy link
Member

betatim commented Dec 21, 2021

Somewhat off-topic but also not: does anyone know why https://jupyter.org/try links to https://mybinder.org/v2/gh/jupyterlab/jupyterlab-demo/HEAD?urlpath=lab/tree/demo which currently doesn't build? I thought this must have been a recent commit that broke it but turns out the last commit was in mid October. Maybe no one will notice/complain if we don't provide a seamless transition/switch it over to something a bit different (jupyterlite) given that it seems to have been broken for a while now?

@choldgraf
Copy link
Member Author

choldgraf commented Dec 21, 2021

@betatim nope I don't know who controls that repo or that link, but definitely agree that this is another reason to just use JupyterLite. Here's the issue @minrk brought up to discuss that: jupyter/jupyter.github.io#513

To try and prep for that, I just made a PR to the JupyterLite docs to add a more "introductory" notebook for their links: jupyterlite/jupyterlite#432

@minrk
Copy link
Member

minrk commented Dec 21, 2021

Looks like we got some interim credits from Google just in time, so we have a little slack while we work on the more permanent solution.

@minrk
Copy link
Member

minrk commented Dec 21, 2021

which currently doesn't build?

I think a dependency must have updated out from under it. When I tested, it ran fine on GKE and didn't need a build. This is probably in our top 2 most popular images, so I think folks would notice. It must be getting assigned to turing more often with the recent changes, where it wasn't in the cache already.

pyyaml 6.0 recently dropped support for the long-deprecated, but still widely used yaml.load(str), which is the source of the failure. Fixed in jupyterlab/jupyterlab-demo#113

@choldgraf
Copy link
Member Author

choldgraf commented Dec 21, 2021

As @minrk mentioned - we just got $10,000 of Google Cloud credits deposited into the same GCP billing account, and they expire in 6 months. This means that we don't need to make any chances and the service will keep running.

I'd like to write a short blog post about this episode to give transparency to our user community about what happened and what we're doing to try and improve things in the future. I imagine something like:

  1. Brief description of what happened
  2. Cost reduction efforts (e.g. image cleaning etc)
  3. Load balancing efforts (e.g., other federation members stepping up to carry more load)
  4. Stop-gap credits from Google
  5. Next step is exploring ways that we can further make progress on each of these items

Does anybody object to that plan? I'll try to get a draft ASAP while the experience is still fresh.

@betatim
Copy link
Member

betatim commented Dec 22, 2021

Hadn't thought about dependencies changing :-/

I think a blog post is a good idea. I would lead with (5) though instead of a chronological/experience report order. My reasoning is that (5) is the most important thing out of all this for the reader and what we'd like the reader to help us with. This in turn makes me wonder what we are looking for and if we can express that in a couple of sentences. Some properties that I think we want: (0) someone who enjoys fundraising (1) multi year (2) GKE and (3) <$10000 per month for sure, maybe $5000 per month in credits. (4) a call to action "contact us via this thing" if you can help with funding or the effort to secure funding.

Other things we could be doing: on board more federation members, increase the capacity at existing federation members. But I would put them in a separate blog post or further down in this post to focus on the above give points as the one thing people remember.

I would lead with our ask/next steps because no one reads stuff on the internet and even those who read, should get the most important point first.


Before writing I think we should sharpen what we are asking for and how people can reach us so that we have a very concrete idea for both. This will allow us to write a clear article with a concrete call to action. And means in the team we have alignment on it.

@manics
Copy link
Member

manics commented Dec 22, 2021

Following on from @betatim's last point, also work out how much time (if any) we can devote to cases where compute can be contributed but not people (thinking of jupyterhub/mybinder.org-deploy#1772)

@betatim
Copy link
Member

betatim commented Dec 22, 2021

I think for things like jupyterhub/mybinder.org-deploy#1772 we will not find out if they are a net good/bad without trying it. But someone has to have time and drive to keep moving it forward.

@choldgraf
Copy link
Member Author

Regarding a blog post, I was thinking of just a minimal "what happened, what we did to resolve it, and thank you Google" post rather than a more future-looking post. My reasoning is that I worry raising the bar too much will increase the likelihood that no post will happen at all.

I definitely agree it's a good idea to come up with an action plan, call to action, etc but do we have bandwidth to do this? Is anybody willing to champion this?

I do think the Google team is interested in meeting further in January to find a more sustainable solution, which is where I'm going to put my cycles if I have them.

@betatim
Copy link
Member

betatim commented Jan 5, 2022

I do think the Google team is interested in meeting further in January to find a more sustainable solution, which is where I'm going to put my cycles if I have them.

Is there a way to help with that? More generally I am wondering what the next steps are here.

@choldgraf
Copy link
Member Author

choldgraf commented Jan 5, 2022

Thanks for following up @betatim - a few updates:

Blog post / CTA

Yesterday I put together a short draft that tried to incorporate some of the ideas shared above. I added a section for "what we need / what you can do" but I suspect it'll need a bit of iteration:

https://docs.google.com/document/d/1A2TDXlQ1ap1dM7ek2gRRfSL9O6xudgNgOQqNW3LUwPo/edit?usp=sharing

If folks are interested in having a dedicated brainstorm to think about sustainable pathways forward, I'd be happy to so.

Google credits

I got in touch with Karan yesterday to check up on the status of the credit request we had originally put in. He said they'd like to meet on Thursday to discuss more sustainable pathways forward. He put a meeting on my calendar for 9:30AM US/Pacific time EDIT: he just re-scheduled to 12:30pm US/Pacific. I don't think we want to overwhelm them with a ton of people, but if anybody is interested in joining this conversation as well I'd love to have you there. Just let me know! Either way, I will report back what we discuss after that conversation.

@choldgraf
Copy link
Member Author

choldgraf commented Jan 6, 2022

Update from conversation with Google

Just had a quick meeting with a few folks from Google Research. Here are the notes:

https://docs.google.com/document/d/1W5q3WLeT_sviLrW0zhpmo5DfPww8sA0w7tlBCeeAa6o/edit?usp=sharing

tl;dr: we have roughly two options that we can explore in parallel:

  1. Kick-off a credit request that is similar to the one they approved with fast turnaround in December. This could get mybinder.org credits for another year or so.
  2. Explore a more formal "in-kind sponsorship", which would help formalize the relationship between Binder and Google more clearly, and make it easier / less work to renew this collaboration each year rather than blindly reaching out to folks inside of Google

They're going to start the process for 1 right now, and for 2 we'll need to do two things:

  1. Engage in some kind of conversation with them in the next few months to agree on what an in-kind sponsorship would look like
  2. Decide for ourselves what it means to "support the Binder project". If we could provide some structure for them (e.g., a "sponsorships page" with clear criteria for different sponsorship levels, and a list of organizations that support the project at each level), then this would make things easier for them and others to support us. They can give us feedback if we want to work on this kind of thing.

I like the idea of defining for ourselves what sponsorship means, and then reaching out to Google (or others) for feedback and requests to sponsor us at particular levels. I think that might be a way that we can grow the network of sponsors beyond just Google. What do others think?

@choldgraf
Copy link
Member Author

We've had a few conversations here and in the Matrix channel about sustainability opportunities to explore. Rather than ballooning this issue into a long thread about sustainability, I decided to update the top comment of #430 so that it captures a few of the ideas we've discussed for longer-term sustainability efforts.

Are folks OK taking the long-term sustainability conversation there, so we can focus this thread around extending our credit runway with Google in the short-term?

@choldgraf
Copy link
Member Author

I'm going to close this one, as we have a resolution in jupyterhub/mybinder.org-deploy#2138 and we've also got a longer-term plan for credits in these two issues:

@choldgraf
Copy link
Member Author

Debrief from Mary @ Google

I had a quick phone call to understand from one of the OSPO people what happened this time around. Here are some quick takeaways from that conversation:

  • Their annual budgeting / planning for the next year begins around September.
  • So we should begin the process of next year's credit request in September.
  • The most important thing is that we find a Googler who can "champion" us to the OSPO team
  • They believe the reason it took so long this time is a combination of COVID/unexpected personel changes/miscommunication between googlers.
  • In the future, Mary Radomile is a good person to reach out to about process stuff.

I'll set a personal reminder to start asking Google for credits again in September, and will also connect Karan with her so that they can do some information sharing as well.

Just wanted to update y'all!

@minrk
Copy link
Member

minrk commented May 6, 2022

Super useful, thanks for chasing that down, @choldgraf!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants