Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Scio on Jupyter #226

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Improve Scio on Jupyter #226

wants to merge 2 commits into from

Conversation

anish749
Copy link

I tried to make this more usable.

Since this was almost 2 yrs old lying un used, I took the liberty to break some of the APIs.

TL;DR:

  • upgrade to scio 0.6.1
  • Add api to easily create multiple Scio context with same PipelineOptions.
  • Some helper functions to make use of Scio easier from Jupyter.

This primarily adds the functionality to easily close and recreate Scio contexts with same pipeline options and run on data flow / other runners. It still doesn't make things very interactive as outputs are still not in-memory, and its takes a few minutes to start a Dataflow Job.
This PR however makes it easy to iteratively develop batch pipelines in Scio / Beam.

Also we can use Taps to temporarily materialize a SCollection to the staging bucket and read the data from there. This makes analysis somewhat easier.

anish749 and others added 2 commits September 18, 2018 14:18
* upgrade to scio 0.6.1
* Add api to easily create multiple scio context with same settings.
* Some helper functions to make use of Scio easier.
@anish749
Copy link
Author

Hey @alexarchambault would you kindly review this PR..

@alexarchambault
Copy link
Member

Hey @anish749, sorry for the delay.

I can merge, then make a release, if you have an immediate use of this. But most development now happens on the develop branch. It supports spark via this project that targets Ammonite, that it just extends a bit to get some extra Jupyter-specific niceties (progress bars, …).

For scio, I guess a project similar to ammonite-spark could be written, adding scio support to Ammonite. It can then be used as is from the upcoming version of jupyter-scala. I was thinking of maybe renaming ammonite-spark to something like ammonite-bigdata, and add support for scio in it, among other stuff. Is it something that would be useful for you?

@anish749
Copy link
Author

anish749 commented Sep 23, 2018

Hey @alexarchambault I think the dev branch is quite ahead of the master. What is the plan for merging and release of the next version? If it is longer, then it would be great to have v0.4.3.

If the idea is to separate out the jupyter-scala repo into multiple repos, I feel it might be a good idea to have Scio as ammonite-scio separate from ammonite-spark, given that a majority of users who would plan to use Spark for interactive analysis would not be using Scio at the same time and vice versa.

The problem with Beam / Scio is that it is not very well suited for interactive analysis at the moment, which narrows the use cases while in Jupyter. There were times when I really felt the need of having a notebook based environment, and hence started experimenting with this.

And I was also wondering the support for Almond in a docker image. I was testing this locally in a docker image, which makes collaborative development easier. I was thinking of adding this to https://github.com/jupyter/docker-stacks as well. What do you think?

@alexarchambault
Copy link
Member

@anish749 Sure, go for it for https://github.com/jupyter/docker-stacks, don't hesitate to ping me there for feedback.

FYI, @Atry already wrote and pushed a docker image for almond, see #214 (comment) (but it's not added to docker-stacks)

@Atry
Copy link

Atry commented Oct 5, 2018

@anish749 You can find those docker images at https://hub.docker.com/r/popatry/almond-images/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants