-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#incidence_summary
temp table on DataBricks
#66
Comments
#incidence_summary
temp tables on DataBricks#incidence_summary
temp table on DataBricks
It should be dropped. In temp table mode, tho: the temp table is the final result that you query to get the result, but then the cleanup should be run to drop all the temp tables. Let me talk to you offline and figure out what's going on. |
So, I've looked through the code, at least in the R context, the cleanup SQL is being invoked (CohortIncidence.R@175):
And the cleanup sql is:
So between each execution of |
Thanks @chrisknoll - it is also unclear to me as well. Linking this to OHDSI/CohortGenerator#188 since these appeared in the same run. |
I'm still unclear why this is happening but will close this out since it does not appear to be anything specific to this package. |
Did other temp tables persist? I'm happy to trace through it to see if the drop table command isn't being executed. |
Yes - there were other temp tables persisted from Cohort Generator, Characterization and FeatureExtraction but it is unclear why that happened. Re-running things has only made it more confusing as the packages are properly dropping the temp tables. |
one thing I can think of is if there's an error during analysis execution where you 1) create the result tables up front, 2) run the analysis, and then 3) clean up: if it fails at 2, 3 never runs and it's left behind....this wouldn't be catastrophic since temp table names should be unique between connections, but if for some reason they are not and you get the same session token prefixed to the temp tables, you could run into a situation where a table already exists. |
Yes, I am thinking this is exactly the scenario that played out. The original error in this issue indicated that a temp table couldn't be created since it already exists. Its unclear to me where an error may have occurred. |
In speaking with @schuemie, we should update our SQL so that we're dropping temp tables ahead of creating them in an effort to prevent problems for those RDBMS platforms that use a tempEmulationSchema. Putting this here for reference to illustrate that point: https://github.com/OHDSI/CohortMethod/blob/main/inst/sql/CreateCohorts.sql#L33 |
Would like to offer some pushback on this @schuemie and @anthonysena: I understand the need to solve a problem, but what is happening here doesn't make sense, and I worry about applying a technique across all our sql packages (and the future ones yet to be written), when it's not solving the actual problem, but obscuring the real one. This is the real problem I think we are obscuring: When a connection is started, a unique session ID should be associated with it so that all temp tables are unique within the context of that connection. If it is the case that we are running an analysis...it gets killed or errors out for some reason that doesn't get tot he point of manually killing the temp tables, it should never be the case that if you start a new connection/re-run that you somehow get colliding temp table names. If that's happening (which I believe is happening here) it's a problem with your temp table handling, not that we need to drop before creating. In fact, we want to see the error that the table exists, it's demonstrating the actual problem! If we go this route (of drop-before-create) and I'm right about the cause, this is what is going to happen: You will have 2 processes running an analysis. Each process is going to create a conflicting temp table name. The first process runs and is in the middle of generating results. The second one starts and drops-creates the tables created by the first run. This will be a silent error and we will have no way of reproducing this based on the concurrency/race conditions of parallel tasks. So, I'm imploring us to get to the underlying cause of why this 'table exists' error is happening, and recognize that the error actually uncovered an issue and we shouldn't take steps to hide it. |
Thanks @chrisknoll - just trying to dig into the details around why this might be happening:
SqlRender creates a unique session ID (see here) for each query translated during a given session that is prefixed to the schema specified by the
After reviewing the Java code, my assumption about the root cause of this issue is that the
Assuming more than 1 process should require a new instance of the SqlTranslate class and then we should not face the conflict you've described here. Again, just an assumption based on the description of the way the Random class is working.
Appreciate the pushback here - it is important to make sure we're not masking a larger issue. I think if there are any changes required (beyond the whole dropping temp tables ahead of creating them) it would be in SqlRender. @schuemie please correct anything I may have gotten wrong in this write up. |
I think the primary issue with this is that the session ids shouldn't be managed by the rendering engine, but the connection. I'm not sure the difference between a database connection and a 'session' but if connections can be re-created within the same session, then our scheme of creating these unique session IDs for temp tables is flawed.
But this would obscure that a temp table issue that's a problem, which is the main basis for my objection.
Not sure we can assume this, like when parallel forks R sessions, there could be process cloning (somethign that can happen on unix and macs, maybe even windows) where it's making a copy of memory and state from the main session into the child sessions (this is so it can transfer state between the forked processes)...so I'm not sure we can assume new processes will automatically yield new instances with new state...the opposite is more likely true.
No worries, again, I'm just trying to avoid a fix that could either 1) not address the problem or 2) obscure an actual problem that we need to know about. I don't think we should persue the drop table option, we should understand when the unique session Id is yielded/rotated/decached....and I also think this might be a function of database connector (ie: when you invoke connect(), a unique session id for it should be attached). If we do want to make it a function of SqlRender, then we shouldn't depend on the automatic creation of the session (because it will never know when to reset it) but rather have the sessionId specifically allocated and applied for the series of sql renders that are being performed in a given analysis. If it is the case that we have to go back to all our analyitic queries to 'fix' them, i'd rather go through all the code and fix it by yielding a correct sessionID vs. drop-create in the sql. IMO, the former is the correct approach, and we'd have to evaluate the same amount of code for the changes as if we had to apply a drop-create. |
I'll let @schuemie comment on this point as he designed SqlRender & DatabaseConnector. Best I can tell the reason to put this into SqlRender is that there is no guarantee of how the query is executed. For example, I render some SQL using SqlRender and then I want to paste it into my favorite query editor to run it. Let me attempt to break this issue down into a few scenarios for discussion. For all of these scenarios, I'm assuming you are running this on a single machine using RStudio. The process stops unexpectedly.Let's start with what I believe is the scenario that prompted the original issue: You run CI (or any package that creates temp tables) and the process stops unexpectedly. You then re-run and find that the temp table is still present leading to the error noted in this issue. In this scenario, the R session was not restarted thus leading to the problem of re-using the same temp table prefix. To illustrate this, here is a reprex that shows that SqlRender produces the same temp table prefix during a single R session: SqlRender::getTempTablePrefix()
#> [1] "q4qwdvbt"
SqlRender::getTempTablePrefix()
#> [1] "q4qwdvbt"
SqlRender::getTempTablePrefix()
#> [1] "q4qwdvbt" Created on 2024-09-18 with reprex v2.1.1 Unfortunately I couldn't find a way to then restart RStudio in a single reprex. So trust me that I restarted my R session and re-ran the code above to get: SqlRender::getTempTablePrefix()
#> [1] "rqsar57a"
SqlRender::getTempTablePrefix()
#> [1] "rqsar57a"
SqlRender::getTempTablePrefix()
#> [1] "rqsar57a" Created on 2024-09-18 with reprex v2.1.1 So one option would be to restart your R session if something unexpectedly fails. I think that's too big a burden to put on the user but it is an option. But this at least demonstrates that if something fails but the R session is not restarted, you'll get the same temp table prefix. Prefix collisions in multiple processesCredit to @gowthamrao for this script which he posted internally on this topic. Sharing it here since it nicely illustrates that the temp table prefix will be unique across R sessions using ParallelLogger: # All credit to @gowthamrao for this reprex!
# Setup parameters for the test
numberOfClusters <- 5
# Create a temporary directory for storing session information
path <- file.path(tempdir(), "ParallelLoggerTest", tempfile() |> basename())
dir.create(path = path, showWarnings = FALSE, recursive = TRUE)
# Generate a unique parent session ID (to differentiate from cluster session IDs)
parentSessionId <- SqlRender::getTempTablePrefix()
# Define a helper function to generate a session ID and save it to a CSV file
getSessionId <- function(path) {
sessionInformation <- dplyr::tibble(
sessionId = SqlRender::getTempTablePrefix(), # Generate a temporary session ID
sessionSysTime = Sys.time() # Record the current system time
)
# Create a random string to ensure unique file names
randomString <- tempfile() |> basename()
# Define the file path for saving session information
filePath <- file.path(path, paste0("sessionInformation_", randomString, ".csv"))
# Save the session information to the file
write.csv(sessionInformation, filePath, row.names = FALSE)
}
# Set up cluster configurations, assigning a path to each cluster node
clusterConfig <- list()
for (i in 1:numberOfClusters) {
clusterConfig[[i]] <- list(path = path)
}
# Initialize the cluster with the specified number of threads
cluster <- ParallelLogger::makeCluster(numberOfThreads = length(clusterConfig))
#> Initiating cluster with 5 threads
# Execute the session ID generation function across all cluster nodes
ParallelLogger::clusterApply(cluster = cluster, x = clusterConfig, fun = getSessionId)
#> | | | 0% | |============== | 20% | |============================ | 40% | |========================================== | 60% | |======================================================== | 80% | |======================================================================| 100%
#> [[1]]
#> NULL
#>
#> [[2]]
#> NULL
#>
#> [[3]]
#> NULL
#>
#> [[4]]
#> NULL
#>
#> [[5]]
#> NULL
# Stop the cluster after execution
ParallelLogger::stopCluster(cluster = cluster)
#> Stopping cluster
# Define a helper function to find and combine all session information CSV files
appendMatchingFiles <- function(path) {
# Match all files with the sessionInformation_*.csv pattern
filePattern <- file.path(path, "sessionInformation_*.csv")
matchingFiles <- Sys.glob(filePattern)
# Initialize an empty tibble to store all session data
allData <- dplyr::tibble()
# Read each matching file and append its data to the combined tibble
for (file in matchingFiles) {
fileData <- read.csv(file, stringsAsFactors = FALSE)
allData <- dplyr::bind_rows(allData, dplyr::as_tibble(fileData))
}
return(allData)
}
# Combine all session data from generated files
allSessions <- appendMatchingFiles(path = path) |>
dplyr::mutate(parentSessionId = parentSessionId) # Append the parent session ID for comparison
allSessions
#> # A tibble: 5 × 3
#> sessionId sessionSysTime parentSessionId
#> <chr> <chr> <chr>
#> 1 vme40vwd 2024-09-18 13:46:13.807492 r9zugdw1
#> 2 ymytk1gp 2024-09-18 13:46:13.803506 r9zugdw1
#> 3 hepypvac 2024-09-18 13:46:13.782942 r9zugdw1
#> 4 dnf9f64h 2024-09-18 13:46:13.751967 r9zugdw1
#> 5 rat9iomb 2024-09-18 13:46:13.767976 r9zugdw1
# Test that no session ID in the generated sessions matches the parent session ID
#testthat::expect_true(nrow(allSessions |> dplyr::filter(parentSessionId == sessionId)) == 0)
# Test that all generated session IDs are unique
#testthat::expect_true(length(unique(allSessions$sessionId)) == length(allSessions$sessionId))
# Clean up: remove the temporary directory and its contents
unlink(path, recursive = TRUE, force = TRUE) Created on 2024-09-18 with reprex v2.1.1 This provides evidence that the session ID generated by SqlRender will be unique per R session, even in subprocesses. As a result, I think it would then make sense to defensively code the SQL to drop those temp tables if they exists before creating them since there is the possibility of reuse of the same prefix during the same R session. |
Just a reminder that temp table emulation has been working without issue for about 10 years now, on all supported operating systems and DBMSs. It does require cleaning up your temp tables. I know there's something to be said for having the prefix be unique to a connection, but that would require splitting translation into two steps, whereas now it is one (by SqlRender). Separate R threads will always have their own prefix. If not, that would mean somehow the Java VM is shared in which case a duplicate prefix would be the least of our worries. |
I think we should also consider that in those 10 years, 99.99999% of the environments have been supporting temp tables. It took us 2 weeks to uncover this issue when we went into an environment that doesn't support temp tables natively (DataBricks).
I think it would be a useful optimization of R process memory if JVMs were instantiated once and shared across R sessions, and this could possibly be a future direction of rJava, so I don't think we should make the assumption that there are unique JVMs per R Sessions for all time in the future. We may want to clean up our semantics of these session IDs (and I'm beginning to think when we say 'session' we mean the R session which could span multiple DB connections). Thanks for the REPls and detailed description. I think it's slightly side-stepping what my argument was by nature of looking at how re-used session ids would lead to problems. if i take your examples above and adjust it slightly:
btw, if we drop-before-create as we're proposing, the parallel approach will blow away each other's intermediate temp tables. If i'm understanding you correctly, the Ok, so that's the flaw: the temp table prefix can't be re-used across db connections else you get into the problem we're seeing here (this breaks the temp table semantics we're trying to emulate). Solutions? Well, one hand we have to write our code with more attention to temp table emulation. Personally, I thought this was handled under the covers but after thinking through it, it's impossible for the render layer to handle this because it doesn't know the connection context that the renderings are occurring under. But the developer does, and the developer will have to be responsible for properly associating the correct session IDs with the series of queries that are being translated. I think this is an inescapable truth. So, what can we do? One idea: provide a mechanism to reset the global session Id. Maybe in Example: In CI, if you run in temp table mode, we actually do 3 steps:
These 4 steps occur within the context of 1 connection, but 1, 2, and 4 involve their own translate, so they need to use the same session identifiers otherwise you'll get different temp table names for the tables you set up in 1, and cleaned up in 4. We need a way to direct translate to get a new id or use the existing one. But, even better, I think how I will approach this is that I will go into the places where translate is called, and fetch my own unique session ID and use it across the So, I propose the solution is that we expose the sessionID parameter to the translate calls at the R-level, and direct developers to use this parameter correctly to handle the cases of emulated temporary tables which means you need to fetch a new session ID to be used for the duration of your DB connection, and use that session ID in the calls to Conclusion: |
In the pseudo code above, when would you translate the query list and then execute in parallel? I'm not really sure where we are doing parallel SQL operations in HADES where this pattern might exist? The scenario I put forward earlier, if we use ParallelLogger to carry out parallel SQL tasks (where you performed the translation in that subprocess) you'd have a guarantee that there are no temp table prefix conflicts between those processes. I understand your desire to have control of how the temp table prefix is specified/recycled but in my view that feels more complex that simply ensuring that for a given R session the temp table is dropped ahead of using it. |
The translate was done in the buildBatchAnalysisQueries...so it was formulated within 1 R session (and therefore the translation occurred in the R session with a single session ID). The next part of the code was just showing how sequentially it would work, but in parallel you'd have the risk of colliding temp tables where simultaneous queries would be dropping-creating each others temp tables. In either the squential case or parallel, the query would be invoked over their own connection. In normal temp table semantics, each temp table would be isolated from the other connections. For our temp-table emulation, we need to take more care in doing it. Enforcing drop-before-create logic is not addressing it...it's just 'making it work until it doesn't.' For an example in practice: The characterization package introduced a sort of splitting up / incremental mode where one characterization design can be split up and run in parallel (I suggested that a choice be given if you want the pieces to be run in sequence (to consume less resources on the server at a time) or in parallel (to maximize resource utilization). So the above example is not so much contrived but based on real world example.
I understand, but as long as this suggestion keeps coming up, I'll continue to state how it will not solve the problem. The complexity at play here is handling the management of temp tables, and considering not every DBMS is supporting it, it is not a simple thing to consider. But if we're going to support platforms that do not have native temp tables, then we have to accept the complexity in the codebase. A simple drop-before-create solution isn't the way to do this. If you run the risk of one connection being able to touch a table that was created in another connection, then you are not providing the correct temp-table semantics that we are trying to emulate. They way to do that is provide a way to specify what the sessionID is going to be at translation time, and the developer (due to the requirement that they need to account for fake-temp tables) will have to understand how to specify the sessionIDs during translation. But, the R api doesn't let you pass in a session ID during translation, and I would prefer that it added this and then I'd be happy to alter code to assign the correct sessionID for the series of translations. but I'm not happy looking at a solution (drop-before-create) that does not solve the problem. Final thought: we had some question about what you should do to clean up temp tables after an error. I was thinking that we have a current pattern where we do a connect() call and then add an handler after the function to do a |
Hey, Everyone, After today's HADES call, we got to a point where there was some question about how this is an actual problem when forked processes will create their own session IDs and we won't have temp table conflicts in practice, so we don't really have to concern ourselves with dealing with reproducing the temp-table semantics. So, to try to illustrate my point, I'll flip the script a bit by asking a simple question: Why do we need to drop-and-create tables?In proper temp table semantics, we wouldn't need to do that. And if no one can see the problem with how we're implemented 'fake-temp-table' semantics (worked fine for 10 years), why do we have the need to drop-and-create now? Shouldn't I be able execute If we wrote these sort of queries with fake-temp-table semantics in mind form the jump, it would look something like this:
The astute reader would recognize this would leave fake-temp-tables littered around our temp schema, so we may also need to adopt the habit of cleaning up our temp space:
The very astute reader may realize that the result of the error may leave the connection in a closed state, so there's no way to go back and clean up the temp tables. This is just an unfortunate reality of trying to emulate temp tables....there's going to be some corner cases we will have to rely on manual intervention (like a scheduled DB maintenance to clean orphaned tables). |
Please let us leave all the theoretical and metaphysical concerns aside. The only real problem we have is this: If a user runs an analysis, The current proposed solutions are:
@chrisknoll : Did you have any solutions you would like to add to this? Solution 3 would introduce several problems, including that the translated SQL now only works with DatabaseConnector. It also moves DatabaseConnector in the opposite direction of where we are going with DatabaseConnector 7, where we are trying to make it lighter and do less. Solution 4 would require major rework of our code, requiring weeks of developer time, and introduce many opportunities for errors. It also would add significantly to our code complexity, making maintenance harder. My vote is for solution 2, which seems the least amount of work, would keep our code clean, and readily solves the problem we have. |
No, only that Solution 4 was my contribution to this discussion, and that it would only impact those queries that involve temp tables (which, admittedly is a majority of our analytical queries). Solution 2 was the first I heard of it, and this solution means we can target this behavior only to dialects which don't support native temp tables. And we don't have to change all of our queries around it, so that would also be one that I'd compromise on. However, we're not fixing the temp table semantics because of allowing session IDs to be shared across connections, and that will continue to be my objection. When this theoretical/metaphysical situation becomes an actual problem, I reserve the right to an 'I told you so'. |
Solution 2 from the earlier post works for me too. Thanks! |
Running CohortIncidence via Strategus multiple times on DataBricks produced the following error:
Guessing that
#incidence_summary
is not dropped for some reason? Given the CohortIncidence build options here I presumed it was not using temp tables so I'm a bit confused as to why this would happen.The text was updated successfully, but these errors were encountered: