Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1346547: Error when using async queries: "390112: Your session has expired. Please login again." #1119

Closed
williamhbaker opened this issue Apr 25, 2024 · 8 comments
Assignees
Labels
bug Erroneous or unexpected behaviour status-fixed_awaiting_release The issue has been fixed, its PR merged, and now awaiting the next release cycle of the connector. status-triage_done Initial triage done, will be further handled by the driver team

Comments

@williamhbaker
Copy link

Please answer these questions before submitting your issue.
In order to accurately debug the issue this information is required. Thanks!

  1. What version of GO driver are you using?
    1.9.0

  2. What operating system and processor architecture are you using?
    darwin/arm64

  3. What version of GO are you using?
    go version go1.22.0 darwin/arm64

4.Server version:* E.g. 1.90.1
8.16.0

  1. What did you do?
  • Configure the client with client_session_keep_alive set to true.
  • Run a long-lived program that continuously makes frequent requests using async queries, with some of the queries taking several minutes to complete at time
  1. What did you expect to see?

The program continue to run. Instead, it will occasionally crash with an error "390112: Your session has expired. Please login again."

In the driver code, I see here where sessions should be getting renewed. From what I can see, an async query will later end up in this loop. I'm wondering if a session is still valid on that first check but just about to expire, and the session ends up expired in the retry loop without being gracefully handled? This would race with the session renewal heartbeat process that is enabled with client_session_keep_alive.

  1. Can you set logging to DEBUG and collect the logs?

Not easily - this is currently being worked-around by always creating a new client when a batch of async queries need to be run. My guess is that this gets a "fresh" session each time, instead of letting the prior session expire.

  1. What is your Snowflake account identifier, if any? (Optional)
@williamhbaker williamhbaker added the bug Erroneous or unexpected behaviour label Apr 25, 2024
@github-actions github-actions bot changed the title Error when using async queries: "390112: Your session has expired. Please login again." SNOW-1346547: Error when using async queries: "390112: Your session has expired. Please login again." Apr 25, 2024
@sfc-gh-dszmolka sfc-gh-dszmolka self-assigned this Apr 26, 2024
@sfc-gh-dszmolka sfc-gh-dszmolka added the status-triage Issue is under initial triage label Apr 26, 2024
@sfc-gh-dszmolka
Copy link
Contributor

hi and thanks for raising this and the hints too! I'll take a look and see if i can reproduce.
speaking about which, do you by any chance have perhaps a reproduction which when run, leads to the issue you're experiencing? probably would speed up debugging. Thank you in advance !

@sfc-gh-dszmolka
Copy link
Contributor

trying to reproduce the issue, so far with no success. Using the following approach with go1.21 and gosnowflake 1.9.0..

  1. since the error message clearly indicates the authentication token expired, focusing on this direction. This token by default has lifetime of 4 hours. With session policies its possible to change that.

So implemented a 5-minute session policy instead of the 4-hour to speed up tests. Now my authentication token is only valid for 5m.

CREATE SESSION POLICY test_db.public.session_policy_5m
  SESSION_IDLE_TIMEOUT_MINS = 5
  SESSION_UI_IDLE_TIMEOUT_MINS = 5
  COMMENT = 'Session policy 5m'
;
ALTER ACCOUNT SET SESSION POLICY test_db.public.session_policy_5m;
  1. using the following simple repro program:
package main

import  (
    "context"
    "database/sql"
    "fmt"
    "log"
    "time"
    "strconv"
    sf "github.com/snowflakedb/gosnowflake"
)


func main() {
  connectionString := "username:password@myorg-myaccount/test_db/public?warehouse=COMPUTE_WH&role=ACCOUNTADMIN&tracing=TRACE&CLIENT_SESSION_KEEP_ALIVE=true&CLIENT_SESSION_KEEP_ALIVE_HEARTBEAT_FREQUENCY=3600" // default heartbeat is 3600 (1h) anyways
  noCachedResults := "ALTER SESSION SET USE_CACHED_RESULT = FALSE;"
  selectQuery := "SELECT * FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1000.PARTSUPP;" // 800M rows, 35.1GB data
  sleepSeconds := 290

  db, err := sql.Open("snowflake", connectionString)
  if err != nil {
    log.Fatal(err)
  }
  defer db.Close()

  // enable async mode
  ctx := sf.WithAsyncMode(context.Background())
  conn, err := db.Conn(ctx)

  if err != nil {
    log.Fatalf("Failed to acquire connection. err: %v", err)
  }
  defer conn.Close()

  _, err = conn.ExecContext(ctx, noCachedResults)
  if err != nil {
      log.Fatalf("failed to execute ALTER SESSION. err: %v", err)
  }

  // sleep for a little less than 5 minutes (300s)
  // Session idle policy is set to 5 minutes in this test
  fmt.Println("Sleeping " + strconv.Itoa(sleepSeconds) + " seconds.")
  time.Sleep(time.Duration(sleepSeconds) * time.Second)

  // now run the big query in an async manner
  // this query surely should take more than 5 minutes on a x-small warehouse
  rows, err := conn.QueryContext(ctx, selectQuery)
  if err != nil {
    // 'handle' error 
    fmt.Println(err)
  }
  defer rows.Close()

  var counter int
  var v1, v2, v3, v4, v5 sql.NullString

  for rows.Next() {
    err = rows.Scan(&v1, &v2, &v3, &v4, &v5)
    if err != nil {
      log.Fatalf("failed to get result. err: %v", err)
    }
    // not really interested in query result, but still want it executed fully
    counter++
  }
  if rows.Err() != nil {
    fmt.Printf("ERROR: %v\n", rows.Err())
    return
  }
  fmt.Println(counter)
}

The query takes around 8.5 minutes on a x-small warehouse to execute. Here's what happens:

  • ALTER SESSION executes immediately then we wait 290 seconds
  • 10 seconds left from the token lifetime. We issue the big query (= token lifetime is now again 300s)
  • since the query is executed asynchronously, the driver automatically polls every 45 seconds to ask Snowflake if the query is done yet and results available. It is not done yet.
  • Every 45 seconds the polls happen
  • After 8+ minutes, the last poll encounters a SUCCESS response from Snowflake, and starts to download the query results.

Never encountered Your session has expired. Please login again error so far, hinting the authentication token expired, which makes sense since the query itself and the polls should renew it.

So at this point I'm a little baffled how to get to the issue and therefore to troubleshoot further, it would be really appreciated if you could please send a minimal viable reproduction program, which when run, would exhibit the issue you're having.

As a mitigation, you can perhaps consider adding CLIENT_SESSION_KEEP_ALIVE_HEARTBEAT_FREQUENCY=900 to the session parameters and see if it helps you, regardless of whether i can or cannot reproduce the issue ? (supposed to send a heartbeat every 15 minutes, instead of the default 1 hours)

Thank you in advance for the reproducible example !

@sfc-gh-dszmolka sfc-gh-dszmolka added the status-information_needed Additional information is required from the reporter label Apr 29, 2024
@williamhbaker
Copy link
Author

Here's a reproduction of the issue, which has up to this point been illusive to me. It has to run for 1 hour in order for the error to occur, seemingly regardless of what the session idle timeout is set to in the session policy. The 2 hour SYSTEM$WAIT just needs to be long enough to be longer than 1 hour apparently, since the program will exit with the "session expired" error after 1 hour has elapsed.

func main() {
	connectionString := "redacted"

	db, err := sql.Open("snowflake", connectionString)
	if err != nil {
		log.Fatal(err)
	}
	defer db.Close()

	ctx := context.Background()

	if err := db.PingContext(ctx); err != nil {
		log.Fatal(err)
	}

	res, err := db.ExecContext(sf.WithAsyncMode(ctx), "CALL SYSTEM$WAIT(2, 'HOURS');")
	if err != nil {
		log.Fatal(err)
	}

	if _, err := res.RowsAffected(); err != nil {
		log.Fatal(err)
	}
}

@sfc-gh-dszmolka sfc-gh-dszmolka removed the status-information_needed Additional information is required from the reporter label Apr 30, 2024
@sfc-gh-dszmolka
Copy link
Contributor

thank you for the hint and the repro - I'm taking a look.

@sfc-gh-dszmolka
Copy link
Contributor

looks like you were right; 1 hour is indeed relevant - because it looks like the heartbeat itself which is breaking things and the heartbeat happens after one hour. We'll fix this and I'll update this thread on the progress as any new information becomes available.

@sfc-gh-dszmolka sfc-gh-dszmolka added status-triage_done Initial triage done, will be further handled by the driver team and removed status-triage Issue is under initial triage labels Apr 30, 2024
@sfc-gh-dszmolka sfc-gh-dszmolka added the status-in_progress Issue is worked on by the driver team label Jun 12, 2024
@sfc-gh-dszmolka
Copy link
Contributor

fix PR under review at #1160

@sfc-gh-dszmolka sfc-gh-dszmolka added status-pr_pending_merge A PR is made and is under review and removed status-in_progress Issue is worked on by the driver team labels Jun 19, 2024
@sfc-gh-dszmolka
Copy link
Contributor

fix PR is merged and will be part of the next release

@sfc-gh-dszmolka sfc-gh-dszmolka added status-fixed_awaiting_release The issue has been fixed, its PR merged, and now awaiting the next release cycle of the connector. and removed status-pr_pending_merge A PR is made and is under review labels Jun 27, 2024
@sfc-gh-dszmolka
Copy link
Contributor

released with gosnowflake v1.11.0 in July 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Erroneous or unexpected behaviour status-fixed_awaiting_release The issue has been fixed, its PR merged, and now awaiting the next release cycle of the connector. status-triage_done Initial triage done, will be further handled by the driver team
Projects
None yet
Development

No branches or pull requests

3 participants