Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add notebook with batch embeddings generation & search #171

Merged

Conversation

jvidhi
Copy link
Contributor

@jvidhi jvidhi commented Dec 30, 2024

No description provided.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@@ -0,0 +1,868 @@
{
Copy link
Member

@Deependra-Patel Deependra-Patel Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

standard setup for more custom BQ based operations [if to be added]

@Deependra-Patel Deependra-Patel changed the title batch embedding generation Add notebook with batch embeddings generation & search Jan 6, 2025
@Deependra-Patel
Copy link
Member

/gcbrun

@Deependra-Patel Deependra-Patel merged commit d459ed5 into GoogleCloudDataproc:master Jan 6, 2025
1 check passed
@@ -0,0 +1,772 @@
{
Copy link
Contributor

@bradmiro bradmiro Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #1.    # Copyright 2023 Google LLC

Change to 2025 please.


Reply via ReviewNB

@@ -0,0 +1,772 @@
{
Copy link
Contributor

@bradmiro bradmiro Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded: In this tutorial you creates a similarity search on Stackoverflow questions to identify similar topics, questions and technologies being discussed. You leverage BigQuery and Dataproc Serverless for distributed prediction on Deep Learning models.

Also, can you link to product pages the first time each product is mentioned? Dataproc Serverless, BigQuery, Workbench etc.


Reply via ReviewNB

@@ -0,0 +1,772 @@
{
Copy link
Contributor

@bradmiro bradmiro Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded: In this tutorial, you use Apache Spark for batch inference/prediction and BigQuery for Vector Search. You run Apache Spark using Dataproc Interactive Sessions inside Vertex AI Workbench.

The example uses open source stackoverflow data and the open source Hugging Face model all-MiniLM-L12-v2 text embeddings. The model maps text data into 384 dimensional dense vector space. The similarity search on vector index is created in BigQuery.

The Hugging Face transformers library is installed by default in Dataproc Serverless runtime version 2.2+. See the full list of Python libraries in the runtime.

--

Please also add the link to the stackoverflow data

--

I think you can delete "This tutorial uses the following Google Cloud ML services and resources:" and below.


Reply via ReviewNB

@@ -0,0 +1,772 @@
{
Copy link
Contributor

@bradmiro bradmiro Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can delete this block.


Reply via ReviewNB

@@ -0,0 +1,772 @@
{
Copy link
Contributor

@bradmiro bradmiro Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enable autoscaling by setting the following parameters in Spark properties:

spark.dynamicAllocation.enabled = true
spark.dynamicAllocation.maxExecutors = 100
spark.dynamicAllocation.minExecutors = 5


Reply via ReviewNB

@@ -0,0 +1,772 @@
{
Copy link
Contributor

@bradmiro bradmiro Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain what BQ Magics do and why the user might find this useful.


Reply via ReviewNB

@@ -0,0 +1,772 @@
{
Copy link
Contributor

@bradmiro bradmiro Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please expand on this. This isn't an easy topic for a user to understand so the more explanation the better.


Reply via ReviewNB

@@ -0,0 +1,772 @@
{
Copy link
Contributor

@bradmiro bradmiro Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #4.    # embeddings

You can probably delete this line.


Reply via ReviewNB

@@ -0,0 +1,772 @@
{
Copy link
Contributor

@bradmiro bradmiro Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a markdown cell explaining this code.


Reply via ReviewNB

@@ -0,0 +1,772 @@
{
Copy link
Contributor

@bradmiro bradmiro Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a hyperlink for how to delete a BigQuery dataset.


Reply via ReviewNB

@bradmiro
Copy link
Contributor

bradmiro commented Jan 6, 2025

@jvidhi I added my review. In general, I would add a lot more explanation throughout. This isn't the easiest topic for a user to grasp, so I would err towards overexplaining.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants