Allow blocks with partially resolved parentage information #94

todor-ivanov · 2023-03-08T12:55:06Z

Impact of the new feature
There could be many groups affected by such a change.

Is your feature request related to a problem?
This is a request for a feature change which is related to a recently found custom use case involving miscommunication between WMCore and DBS. Here is the related WMCore issue and description of the situation: dmwm/WMCore#11260 (comment)

Long story short, we need to allow uploading blocks with partially resolved parentage information. This means few of the files present in a block may have no relation to files from the parent dataset (missing or partially resolved parentage information).

The main question here is eventually to Core SW and Physics groups as well - How mandatory is this constraint and How badly would such a change affect other steps?

Describe the solution you'd like

Solution #1: We simply remove the check for fully resolved parentage information for all files in a given block, here: dbs/fileparents.go#L344-L358
Solution #2: We allow files from the child block to refer to a NO parent file by setting the parent file id to a predefined default value (in accordance with the DBS data structures), which should be an indication that these files are missing the upper level files which they have been produced from.

Describe alternatives you've considered

The alternative approach wold be for WMCore to start developing a mechanism for automatically invalidating all those files from DBS and Rucio, which will go silently in the background. So far we have no numbers to cite about how frequent this is and that's why we cannot tell if and to what extent of data reduction it could lead. But in any ways, we are not in favor of this alternative path.

Additional context
None

The text was updated successfully, but these errors were encountered:

vkuznet · 2023-03-08T13:11:00Z

@todor-ivanov , thanks for creating the issue. But I think neither solution is acceptable. For the solution 1 I do not know how it will impact other use-cases. For the solution 2 I do not know how to correctly do that in ORACLE DB, what is a default and why we should have it in file parent table. I think you are looking for solution 3:

client provides list of chiild, parent ids, e.g. [(1,null), (2,3)]
the first pair indicate that there is no parent for child id 1
we adjust DBS code to skip this pair during injection to DBS file parents table
since there is no parent record in file parent table for child id 1 then anyone who will look-up this lfn will no there is no parent for it.

How does it sound to you? @amaltaro please provide your input too.

todor-ivanov · 2023-03-08T13:41:40Z

Thanks @vkuznet,
Actually from your comment I can see we are in coherence on the way how we should modify the logic. I simply failed to clarify that the change should happen through a proper change of the data structure in the client call. But yes, we are indeed talking about the same thing. Thanks for clarifying this. And I agree this would be the minimal change on the WMCore end.

@amaltaro, what is your opinion on this?

vkuznet · 2023-03-08T14:29:40Z

@todor-ivanov , please provide additional information which DBS API is affected by this request and provide an example of HTTP POST input payload. For DBS API please inspect the corresponding python client to identify which DBS API it is using, here is DBS API documentation which lists all DBS APIs.

amaltaro · 2023-03-08T14:58:01Z

Valentin, Todor, this discussion is in agreement with what we discussed yesterday. Just to summarize:

we need to make minor changes to WMCore such that the child/parent tuple is provided to the DBS client API, but with a None/null refeference to the parent id. Make sure there is a log record for these cases.
we need to then pre-process this data on the DBS Server side, removing tuples pointing to a None/null parent file id. Such that no action is taken for that tuple.

For the record, the only - expected - users for this API is WMCore, especifically that ReqMgr2 cherrypy thread.

vkuznet · 2023-03-09T18:39:22Z

I can work on this if @klannon approve my time allocation for this task, or we can delegate this task to @d-ylee . Meanwhile here I provide list of action items which should be done within DBS codebase:

the affected DBS API is fileparents and it comes from fileparents.go module
we need to loose constrain on FileParents data structure, see https://github.com/dmwm/dbs2go/blob/master/dbs/fileparents.go#L63 and remove validate parameters for PARENT_FILE_ID such that it can accept null
the validate function should be adjusted to skip validation of PARENT_FILE_ID if it is not provided, but if it is provided it should still perform the validation.
the insert function should check PARENT_FILE_ID and if it is not provide just exit immediately
and, finally we need to provide new unit test in int_fileparents.go test file to test insertion of pair where file parent id is null.

amaltaro · 2023-03-24T01:37:12Z

@d-ylee Hi Dennis, did you manage to check the feasibility of the developments listed above? Please let us know if further clarification is required here.

d-ylee · 2023-03-24T18:50:33Z

@amaltaro I think the discussion above is clear to me. I can work on this after I get the updating physics group name completed. Thanks!

d-ylee · 2023-03-27T14:49:15Z

@vkuznet After looking through the code, I noticed that for the database schema:

dbs2go/static/schema/DDL/create-oracle-schema.sql

Lines 866 to 870 in 928dc25

    
           CREATE TABLE FILE_PARENTS ( 
        
               THIS_FILE_ID INTEGER CONSTRAINT NN_FP_THIS_FILE_ID NOT NULL, 
        
               PARENT_FILE_ID INTEGER CONSTRAINT NN_FP_PARENT_FILE_ID NOT NULL, 
        
               CONSTRAINT PK_FP PRIMARY KEY (THIS_FILE_ID, PARENT_FILE_ID) 
        
           )

FILE_PARENTS has FILE_PARENT_ID as not null, and also has FILE_PARENT_ID as part of the PRIMARY KEY constraint. Would it be an issue if the CONSTRAINT contains a NULL value?

vkuznet · 2023-03-27T15:11:19Z

Dennis, it is good question and in fact should be answered by @amaltaro , @todor-ivanov . For instance:

if we loose constrain in FILE_PARENTS table then we can have entry in it for file ids which does not have parents
if we will keep constrain then we'll only keep files with parent information.

So, the two approaches will reflect how we'll query the data and how we will interpret the results of such query. For example, if we stick to approach 1, then our fileparents API will return a file id and not parent id because file id will be recoreded in FILE_PARENTS with no parent id. But if we choose second approach then fileparents will not even list such file id, but it will means that it has no parent.

I suggest that @amaltaro , and @todor-ivanov should evaluate both pros and cons and side effects of this decision.

amaltaro · 2023-03-27T15:39:05Z

From the DBS perspective, would there be any impact and/or performance degradation if we allow NN_FP_PARENT_FILE_ID to be Null?

When using the fileparents API, this is an example of call:
https://cmsweb.cern.ch/dbs/prod/global/DBSReader/fileparents?logical_file_name=/store/mc/Phys14DR/WprimeToENu_M_4800_Tune4C_13TeV_pythia8/AODSIM/TEST_PU20bx25_PHYS14_25_V1_TEST_Alan-v9/00000/3E27F2B3-D975-E411-8793-02163E010EE6.root

which returns the following data

[
{"logical_file_name":"/store/mc/Phys14DR/WprimeToENu_M_4800_Tune4C_13TeV_pythia8/AODSIM/TEST_PU20bx25_PHYS14_25_V1_TEST_Alan-v9/00000/3E27F2B3-D975-E411-8793-02163E010EE6.root","parent_file_id":57020800,"parent_logical_file_name":"/store/mc/Fall13/WprimeToENu_M_4800_Tune4C_13TeV_pythia8/GEN-SIM/POSTLS162_V1-v1/30000/AEF4E42B-1462-E311-8D2E-0025901D48B2.root"}
,{"logical_file_name":"/store/mc/Phys14DR/WprimeToENu_M_4800_Tune4C_13TeV_pythia8/AODSIM/TEST_PU20bx25_PHYS14_25_V1_TEST_Alan-v9/00000/3E27F2B3-D975-E411-8793-02163E010EE6.root","parent_file_id":57020803,"parent_logical_file_name":"/store/mc/Fall13/WprimeToENu_M_4800_Tune4C_13TeV_pythia8/GEN-SIM/POSTLS162_V1-v1/30000/A4668E29-1462-E311-AEB5-1CC1DE18CFEA.root"}
]

do I understand it right that, in case 1) above, we would get an output data as (just an example):

[
{"logical_file_name":"/store/mc/Phys14DR/WprimeToENu_M_4800_Tune4C_13TeV_pythia8/AODSIM/TEST_PU20bx25_PHYS14_25_V1_TEST_Alan-v9/00000/3E27F2B3-D975-E411-8793-02163E010EE6.root","parent_file_id":none,"parent_logical_file_name":""}
]

while in case 2) the output response would be an empty list. Can you someone please confirm this behaviour?

vkuznet · 2023-03-27T16:19:49Z

Having or not parent file id in FILE_PARENTS table will not impact DBS performance as DBS APIs use streaming. But allowing null parent will increase table size. By how much it depends on how may null parents we will have for our files. Said that, your understanding of output of fileparents API is correct, for case 1 we'll have [{'logical_file_name': 'bla', "parent_logcial_file_name': null}], while in case 2 you'll get empty list since no record for file bla will be present in FILE_PARENTS table.

todor-ivanov · 2023-03-27T19:51:03Z

Hi @vkuznet @d-ylee @amaltaro,

To me option 2) is a NoGo. It is quite confusing and introduces more complexity and confusion rathar than simplicity. This is exactly the type of solutions relying on hidden logic and unwritten policies and agreement which are to be well forrgotten in really near future and the knowledge left to be forwarded through the generations only by the means of spoken wisdom. That will make the code from both sides quite unreadable only in a year from now. I am much more for the solution 1), where the parrent is explicitly pointed as NULL, meaning that DBS simply has no information for the origin of the child file, but does have all the rest of the information related to this file. Plus, IIUC, in many many cases it is much more important for physics to find the file itself.

d-ylee · 2023-04-04T14:42:43Z

@vkuznet @amaltaro @todor-ivanov

If we go with 1), would we need to update the database schema for the FILE_PARENTS table to

Allow NULL value for PARENT_FILE_ID
Redefine the PRIMARY KEY to not use PARENT_FILE_ID

For the PRIMARY KEY, I am still not clear how we should define this

From a past message:

1. if we loose constrain in FILE_PARENTS table then we can have entry in it for file ids which does not have parents

Would we need to have a entry in the FILES table that is for no parent files, but when getting from /fileparents, return none in the parent_file_id field if this entry ID is referenced?

Also, would there be cases in which allowing null parentage for FILES also affect DATASET_PARENTS and BLOCK_PARENTS?

vkuznet · 2023-04-04T15:03:35Z

Dennis, everything can be much simpler than you describe. How about using -1 for PARENT_FILE_ID when it is not provided? This way, we do not need to adjust schema, the -1 is acceptable integer and we may use it in other tables and APIs to indicate of no parent presence.

todor-ivanov · 2023-04-06T07:48:20Z

hi @vkuznet @d-ylee

The negative value for missing parent file id is a valid workaround, which would prevent the need of table redefinition. But we need to be sure what is about to be returned to the client as well. Because, once recorded as -1 I believe it will be also returned as -1 in subsequent client queries, rather than null, as originally recorded by the client. This is slightly confusing but may work, provided we check if any further change on how we digest the data from those APIs in WMCore would be needed.

Denis, can you list the APIs in DBS which would develop such a silent data type metamorphosis if we opt for this workaround, so that we can check where and when WMCore uses them?

FYI @amaltaro

d-ylee · 2023-04-17T17:02:24Z

@todor-ivanov

Not sure if this is what you are asking for, but the APIs that are related to parentage are these:

/fileparents
/blockparents
/datasetparents

todor-ivanov · 2023-05-08T12:54:35Z

Hi @vkuznet @d-ylee

Coming back to this, it does not seem that all those APIs you list in your previous message would be affected by
the workaround of returning negative value for missing parent file id. And so let me clarify, what it means affected in this case:

The client records null for the fileId field for a given run/lumi pair with missing parentage information
When later called to provide the file parentage information, the server returns -1 for the fileId field for run/lumi pairs with missing parentage information instead of the originally recorded value of null.

So to me, it seems only the APIs which return information on the run/lumi level granularity would be affected. So I'd bet only on /fileparents, but this was exactly the confirmation that we wanted from you actually.

So in the most straightforward use case, this might affect us is in the client's methods calls to list{ParentDataset,Block}Trio:
https://github.com/dmwm/WMCore/blob/31d65166dc62c7251edd52890bc4ac2f0b79d7fe/src/python/WMCore/Services/DBS/DBS3Reader.py#L966

https://github.com/dmwm/WMCore/blob/31d65166dc62c7251edd52890bc4ac2f0b79d7fe/src/python/WMCore/Services/DBS/DBS3Reader.py#L990

So this is the place I think we will have to rethink this mechanism again, but I see no other way to test it other than just making it in testbed .... and check what the server returns and how we aggregate the parentage information.

To me this again would be yet another workaround, which we will have to remember forever.

amaltaro · 2023-05-24T15:47:10Z

Just an update of this issue as an outcome of the Zoom discussion we just had (Valentin, Dennis and myself), this is the list of action items that we came up with:

1. Dennis: Check if null json gets converted to 0 in Go language.
2. Todor/Alan - IF NEEDED: Update WMCore to provide parent id = 0 instead of None
3. Dennis: Find what are the constraints/validation performed on the DBS Server side.
4. Dennis - IF NEEDED: update DBS Server to deal with partial block information
5. Dennis: create a CERN oracle account
6. Dennis: Get a dump of testbed or production
 * re-create the ownership of the table (or preserve the ownership)
7. Dennis: deploy DBSServer in a test k8s cluster and share the instance url
8. Todor: deploy ReqMgr2 in a test cluster pointing to Dennis DBS url
9. Todor: Inject one of these blocks in this test instance
  * note that ReqMgr2 should not mark a workflow as ParentageResolved = True
10. Todor: Validate the read APIs

amaltaro · 2023-05-25T14:39:49Z

@d-ylee Hi Dennis, I wonder if you managed to look into the null marshaling process? Such that we can decide whether we will have to make further changes in WMCore or not.

d-ylee · 2023-05-25T14:42:12Z

After some basic testing, null json values get converted to the nil value for a given type in Go.

package main

import (
        "fmt"
        "encoding/json"
)

type TestStruct struct {
        Test int64 `json:"test"`
}

func main() {
        d := "{'test': null}"
        var t TestStruct
        json.Unmarshal([]byte(d), &t)
        fmt.Printf("%v", t)
}

In this example, the value of t.Test is 0. If the type of TestStruct.Test is string, json.Unmarshal handles it by using empty string.

This behavior is reflected in the documentation: https://pkg.go.dev/encoding/json#Unmarshal

The JSON null value unmarshals into an interface, map, pointer, or slice by setting that Go value to nil. Because null is often used in JSON to mean “not present,” unmarshaling a JSON null into any other Go type has no effect on the value and produces no error.

Null values in JSON are given the nil or 0 value of a given type. For int64, the nil value is 0.

I can do additional testing by creating a simple HTTP server and sending a Python request with a None as the key's value to see if it will have the same result.

amaltaro · 2023-05-25T14:47:36Z

@d-ylee thank you for these details! I think it is good enough and I understand that WMCore can keep providing a tuple of (int_child_file_id, None) whenever the parent file is gone (and/or invalidated). So no further changes should be expected in WMCore.

Please let us know how your communication with the CERN DBA goes and where you need support on anything on this front.

vkuznet · 2023-05-25T14:56:29Z

@d-ylee thanks for confirmation. PLEASE keep in mind that None is pure Python data-type and python client MUST provide proper JSON, e.g.

>>> import json
>>> json.dumps({"bla":None})
'{"bla": null}'

So, on a wire, i.e. when HTTP request is send, the python code MUST provide null and not None, those are two different things. The null is null value in JSON standard, while None is Python none data-type.

@amaltaro , @todor-ivanov you must ensure that whenever Python code provides (int_child_file_id, None) it should be properly converted to JSON data format via json.dumps call, otherwise we'll get a problem on server side.

amaltaro · 2023-05-25T14:59:04Z

@vkuznet yes, the python object is properly encoded as a json object. So no problems on this front are expected.

d-ylee · 2023-05-25T17:07:13Z

I did a request from Python in the same way as what @vkuznet detailed using json.dumps on a Python dict with key that has the value of None. The result is the same: Go decodes/unmarshal this as 0 for int64.

amaltaro · 2023-06-09T15:53:56Z

@d-ylee Hi Dennis, it's been 2 weeks since we had our last chat, and I believe we only managed to cover points 1 and 2. from this comment so far:
#94 (comment)

do you have any updates on the other points? Do you know have a DBS dump such that we can try things in a dev environment?

In addition to that, can you please share what the current constraints are for updating block/file parentage data? It might be that Valentin already provided it in one of the tickets we discussed this, but I could not find it.

It's important to mention that this issue is currently keeping data unnecessarily unlocked in production, so we better fix it sooner than later.

todor-ivanov · 2023-06-19T11:28:30Z

hi @d-ylee would you want us to organize another meeting if you need some more information from our side so we can proceed on this? If you think there is some step that we first need to do and is currently blocking the progress, please let us know?

d-ylee · 2023-08-11T15:50:01Z

I spoke to @yuyiguo about this issue, and she explained to me some of the policies from the past and about this partial parentage issue.

Before DBS3, partial blocks were allowed, and these were able to be inserted into DBS when there were missing blocks. This was because we may have not had all files or we did not have the parent-child relationships yet.

This was an issue because we might forget about these partial blocks and end up not knowing who is the parent.

Yuyi mentioned that the policy in DBS3 was to only allow full blocks. Workflow management was to have its own database to keep unfinished blocks there, then insert into the production database.

If we want to change this policy, we need to have it written down and record who is responsible to complete the block.

Another concern is when a child has multiple parents. The current schema in the BLOCK_PARENTS database is the PARENT_BLOCK_ID and THIS_BLOCK_ID (the child). These two columns have a unique constraint. This means that there can't be more than one -1 allowed for a single child. If there are multiple parents with the same children and we use the -1 logic, the UNIQUE constraint would be broken.

Another question I have is, do you have the location of the written DBS3 design policies?

What are your thoughts?

vkuznet · 2023-08-14T13:49:40Z

@amaltaro , good design practices suggests to separate APIs for different use-cases rather to introduce complicated if/else logic. The later was used in Python server code which (if you had a chance to look at it) is very hard to read and understand. Therefore, I expressed my opinion based on this argument. Plus, you never know that one client will only the case. I can easily foresee when we may have other clients, plain script (instead of python code), or CRAB, or any new development we may have in a future. And, even using single client as it is right now you really address different use-cases. For maintenance reason it is better to address them via separate API rather complicate server code (even if we'll need make additional code-refactoring and wait a little bit longer).

amaltaro · 2023-08-14T15:44:00Z

@amaltaro , good design practices suggests to separate APIs for different use-cases rather to introduce complicated if/else logic.

to me this is exactly the opposite of what you are suggesting. Creating another API just to deal with a slightly different use case IS an over-complication, it is confusing and it causes mistakes on the client side.

The extra query string would be the best approach in my opinion, but that will require changes in:

DBS server side
DBS client side (new release has to be made as well and properly integrated into the WMCore stack)
WMCore

The later was used in Python server code which (if you had a chance to look at it) is very hard to read and understand.

The approach I am suggesting adds NO overcomplication and does not make the code more complicated, does not increase maintenance and is transparent to the end user. It also does not require the release of a new dbs3-client!
It literally is a change in the source code from something like:

files_passed_by_client == files_in_block

to something like

files_passed_by_client <= files_in_block

Anyways, I think we will spend the rest of our days disagreeing on this!
I'd rather let @d-ylee Dennis consider what is the best approach here, noting that at the moment we have 100s of workflows stuck in the system waiting for this development to be deployed in production.

vkuznet · 2023-08-14T19:15:28Z

Alan, I would like to take last change to convince you and the team to not add changes to the existing API. My last argument is the following:

by design (both Python and Go implementation) of DBS code provides comprehensive checks and input validation from the client
by using the same API and rely on different input we break the above requirements and open the door to handling input without cross check
if we'll implement the changes the way you propose DBS server will not have such ability to check if given input is right or wrong (i.e. it does or does not have parents) and it will just follow if/else logic without knowing and validating that client provide correct input

In my suggestion this is not possible by design since client will be forced to use either new API or provide external parameter to the API in question. Doing this way client a-priory will do a right job and DBS server will not rely on if/else logic of the input data. You may provide as many arguments as you want that such server implementation is more easy to do but you open a possibility that one day client will (by whatever mistake) provide incorrect input and DBS server will follow if/else logic to insert the data to the database. For instance, someone by mistake will not provide parents, or we may have a bug elsewhere in WMCore which may skip some parent, etc. In either case DBS server will follow the input if we will use the same API signature. To avoid this possibility we must enforce client to be explicit with the call (in other words client will know what it is doing by calling proper API or API with proper parameter rather providing an input only to DBS server).

I hope that you can take this arguments more seriously and properly evaluate the implication of implantation on DBS, but I'll not argue any more about how to implement this logic and leave it up to your and the rest of WMCore team to decide (here I have dual role, I belong to WMCore team and in my view the implementation should have separate API or API+external parameter and I implemented DBS logic). I think Dennis should implement whatever WMCore will agree.

Said that, I tag @todor-ivanov to evaluate my concern and you both should come to conclusion how the implementation should be done. Once you'll reach such agreement I think for Dennis it will be more clear how to implement it. At the end, it is just implementation but my concern that the choice may impact data consistency in DBS database in a long run.

todor-ivanov · 2023-08-15T13:14:23Z

hi @amaltaro @vkuznet @d-ylee,
It is difficult to say here what is the better solution. The arguments and concerns for both cases are far in the future e.g. "are we going to stay the only user of this API", "Will we affect other's code behavior" etc... But as a general rule of thumb I'd not rely on the current situation and conclude for the far future.

One really important thing to take into account when making such a decision is the knowledge if our current policy, for not inserting files into DBS which lack parentage information, is acknowledged and well understood by every other group that may be affected. And here I mean both sides:

Eventual users of WMCore modules or DBS apis directly (including CRAB), and also
Those who need to consume the output from DBS, such as physics

We should not talk only about WMCore and people responsible for DBS server. And this, I believe, is information we do not have at the moment. Obviously there is no policy document for DBS Server behavior..., ence our blindness on the applicability of this policy. There could easily be some proliferation of the reverse opinion among other groups who rely on the data returned by DBS - meaning they may rely on the parent-child relationship in DBS to be satisfied in any case. So we are actually in the state of lack of consensus on the policy itself.

Now about the possible solutions: The fix only on the server side is something that is less work and is an encapsulated change, but because of the above unknowns, I'd bet for the harder one - the one that forces the client to explicitly and consciously consciously stress which policy he would chose for uploading the files. It is not only backwards compatibility that we are talking about here, but also about bearing the responsibility for doing the one or the other. At the end we will not change the API, regardless of which solution we chose here - !!! we only change Server behavior !!!

Creating a new API is out of question - simply not worth it.
Setting an option to the already existing API for choosing the correct policy which would be defaulting to the current one, in my opinion is the way to go.

Yes this is going to affect/require changes from everybody, as @amaltaro said:

DBS server side
DBS client side (new release has to be made as well and properly integrated into the WMCore stack)
WMCore

In the case of WMCore this should be fairly small change - adding the new option here and here The tests and validation procedure, on the other hand, would stay as usually long and need to be a full one.... well it has its price.
For the rest, I cannot tell .

Something more. Since this situation (or what concerns WMCore) is an expected one, and our use cases are well identified in advance - as early as building the lists for the parentage resolution itself, I'd not further complicate the code by implementing extra calls to DBS for checking dbs errors and retries. This would only affect code readability and scalability both together without any benefit. So I'd go directly for just changing the policy on the WMCore side by using the newly provided option to the same API.

yuyiguo · 2023-08-22T15:08:29Z

I just back from my vacation and saw this discussion. Right before @d-ylee and I left for vacation, @amaltaro, Dennis, and I had a Zoom meeting on this. We discussed why the "-1 " would not work with the DBS DB schema. Alan pointed out that the missing parent files would be lost forever, in other words, the file parentage would not be updated in the future. So we concluded that it is better to just not put anything in DBS for the ones that lost their parents' files.
I think @vkuznet made a valued argument to protect data integrity. To consider both DBS and WMCore, I would propose below API and server upgrade:

Add missingfiles=0 to the existing API. This would be the default. So no effect on others except for WMcore.
When WMCore calls the API, it tells DBS how many missing files it has with missingfiles=n.
DBS server continues to verify the child-parent relationship. If the missing files does not match n, it will reject the input.
This way both client and server know exactly what should happen. This is my two cents.

d-ylee · 2023-08-23T18:25:02Z

@amaltaro @todor-ivanov If this is what we would like to do, I think we would need to incorporate the number of missing files to the check here:

dbs2go/dbs/fileparents.go

Line 352 in 928dc25

if !utils.Equal(utils.OrderedSet(fids), utils.OrderedSet(bfids)) {

I do see that the utils.Equal function also makes sure the file id and the block file ids match. This should probably also be validated during the insert request and may need to be another parameter.

Another question I have is that would DBS need to keep track of these missing files? Or would this be on WMCore's side? Both this question and the above validation seem to affect each other.

Thoughts?

amaltaro · 2023-08-23T21:19:00Z

@d-ylee Hi Dennis, Yuyi's idea looks good to me and having the client to provide how many files are missing in the json data is a good approach (defaulting it to 0, if not provided).

On what concerns keeping track of the missing files. I would say that the DBS server should at least log that a block with incomplete parentage action was performed (sort of an access log). On the WM side, we do log that a given file failed to find its parentage file.

Please let us know if there are any remaining open questions.

amaltaro · 2023-09-14T21:53:47Z

@d-ylee Dennis, in addition to what we have discussed today, I just wanted to update this thread with a way to migrate dataset from prod/global do your dev cluster (such that we can test things out):

from dbs.apis.dbsClient import DbsApi
dbsApi = DbsApi(url = 'https://cmsweb-test3.cern.ch/dbs/dev/global/DBSMigrate/')
dbsUrl = 'https://cmsweb.cern.ch/dbs/prod/global/DBSReader'
dataset = "my dataset name"
migrateArgs = {'migration_url': dbsUrl, 'migration_input': dataset}
dbsApi.submitMigration(migrateArgs)

I looked into the dbs3-client code and came up with the following curl standalone call:

curl -k --cert $X509_USER_CERT --key $X509_USER_KEY -vv -X POST "https://cmsweb-test3.cern.ch:8443/dbs/dev/global/DBSMigrate" -d '{"migration_url": "https://cmsweb.cern.ch/dbs/prod/global/DBSReader", "migration_input": "/DM_MonoZPrime_A_Mx50_Mv5000_gDM1gSM0p25_Zprime2p0_TuneCP5_13TeV_madgraph-pythia8/RunIISummer20UL17MiniAODv2-106X_mc2017_realistic_v9-v2/MINIAODSIM"}' -H 'Content-Type: application/json' -H 'Accept: application/json' -H 'UserID: [email protected]' -H 'User-Agent: Alan-cURL'

but all my attempts failed with:

< Response-Status: 405 Method Not Allowed

and I cannot figure why. Perhaps @vkuznet has a working example with curl other than this:
https://github.com/dmwm/dbs2go/blob/master/docs/MigrationServer.md#examples
which relies in a different service/API, localhost and curl command which is not available in the dbs2go images(?)

vkuznet · 2023-09-15T13:09:41Z

Alan, could you please read closely documentation. It says:

# migration document
cat > m.json << EOF
{
    "migration_url": "https://.../dbs/prod/global/DBSReader",
    "migration_input": "/a/b/c#123"
}
EOF

# submit migration request
curl -H "Content-Type: application/json" -d@$PWD/m.json \
    http://localhost:9898/dbs2go-migrate/submit

while in your request your URL does not specify /submit API.

vkuznet · 2023-09-15T13:34:17Z

Moreover, each DBS server has its own set of APIs which you can easily find via /apis end-point, e.g. here is list of APIs for DBSMigrate which I took from testbed:

scurl https://cmsweb-testbed.cern.ch/dbs/int/global/DBSMigrate/apis
[{"api":"/dbs/int/global/DBSMigrate/blocks","methods":["GET"]},{"api":"/dbs/int/global/DBSMigrate/submit","methods":["POST"]},{"api":"/dbs/int/global/DBSMigrate/remove","methods":["POST"]},{"api":"/dbs/int/global/DBSMigrate/blockparents","methods":["GET"]},{"api":"/dbs/int/global/DBSMigrate/serverinfo","methods":["GET"]},{"api":"/dbs/int/global/DBSMigrate/","methods":["GET"]},{"api":"/dbs/int/global/DBSMigrate/total","methods":["GET"]},{"api":"/dbs/int/global/DBSMigrate/bulkblocks","methods":["POST"]},{"api":"/dbs/int/global/DBSMigrate/dummy","methods":["GET","POST"]},{"api":"/dbs/int/global/DBSMigrate/errors","methods":["GET"]},{"api":"/dbs/int/global/DBSMigrate/help","methods":["GET"]},{"api":"/dbs/int/global/DBSMigrate/status","methods":["GET"]},{"api":"/dbs/int/global/DBSMigrate/datasetparents","methods":["GET"]},{"api":"/dbs/int/global/DBSMigrate/healthz","methods":["GET"]},{"api":"/dbs/int/global/DBSMigrate/metrics","methods":["GET"]},{"api":"/dbs/int/global/DBSMigrate/apis","methods":["GET"]},{"api":"/dbs/int/global/DBSMigrate/process","methods":["POST"]},{"api":"/dbs/int/global/DBSMigrate/cancel","methods":["POST"]}]

Similar /apis end-point exist for reader, writer servers and you can easily see which method each API is using.

amaltaro · 2023-09-15T14:13:30Z

@vkuznet Valentin, you probably missed a few important points:

we are trying to test dbs2go-global-m, which is the service that receives dataset migration requests
the server expected API endpoint is dbs/dev/global/DBSMigrate/submit, instead of dbs2go-migrate/submit. Unless you need to run a different service which provides a different set of APIs. If so, then you start deviating from what you actually run in production (given that you are testing a different setup)
none of the dbs2go images provide curl, so we need to deploy another service in the kubernetes cluster just so we can use curl to reach a the service that we need with dbs2go-global-m.dbs.svc. Which is inconvenient.

The apis endpoint will be useful though. Thanks

vkuznet · 2023-09-15T14:29:01Z

Alan, I understand what you are doing. The provided dbs2go documentation does not list our production/test/dev URLS and refers to actual service. Of course I would expect that you know your server endpoint, e.g. /dbs/dev/global/DBSMigrate and use it instead of listed in documentation dbs2go-migrate. The curl should not be inside of dbs2go images. Instead, as I pointed many times just deploy new curl image in your k8s and use it to access any service. Here its manifest file: https://github.com/dmwm/CMSKubernetes/blob/master/kubernetes/cmsweb/services/curl.yaml And, you can deploy it as easy as

kubectl apply -f curl.yaml

Then curl pod is created in default namespace and you can login to it and use curl to access your services.

amaltaro · 2023-09-15T21:17:55Z

Okay, it looks like Dennis has some documentation work to be done at some point, such that a developer can go as closer to the production environment as possible, to have more meaningful tests.

I didn't remember we had a manifest only for curl, that will definitely be handy. Thanks.

@d-ylee Dennis, as we discussed yesterday, I finally got many json dumps for a StepChain workflow that you could play with. Note that they only provide dataset parent information, no file and or block parentage. You can find them here:
https://amaltaro.web.cern.ch/amaltaro/forDennis/

Their parentage relationship is:

  "ChainParentageMap": {
    "Step1": {
      "ParentDset": null,
      "ChildDsets": [
        "/RelValWToLNu_14TeV/CMSSW_12_4_0_pre2-GenSimFull_SC_EL8_Sept2023_Val_Alanv4-v11/GEN-SIM"
      ]
    },
    "Step2": {
      "ParentDset": "/RelValWToLNu_14TeV/CMSSW_12_4_0_pre2-GenSimFull_SC_EL8_Sept2023_Val_Alanv4-v11/GEN-SIM",
      "ChildDsets": [
        "/RelValWToLNu_14TeV/CMSSW_12_4_0_pre2-Digi_2021_SC_EL8_Sept2023_Val_Alanv4-v11/GEN-SIM-DIGI-RAW"
      ]
    },
    "Step3": {
      "ParentDset": "/RelValWToLNu_14TeV/CMSSW_12_4_0_pre2-Digi_2021_SC_EL8_Sept2023_Val_Alanv4-v11/GEN-SIM-DIGI-RAW",
      "ChildDsets": [
        "/RelValWToLNu_14TeV/CMSSW_12_4_0_pre2-RecoNano_2021_SC_EL8_Sept2023_Val_Alanv4-v11/DQMIO",
        "/RelValWToLNu_14TeV/CMSSW_12_4_0_pre2-RecoNano_2021_SC_EL8_Sept2023_Val_Alanv4-v11/MINIAODSIM",
        "/RelValWToLNu_14TeV/CMSSW_12_4_0_pre2-RecoNano_2021_SC_EL8_Sept2023_Val_Alanv4-v11/GEN-SIM-RECO",
        "/RelValWToLNu_14TeV/CMSSW_12_4_0_pre2-RecoNano_2021_SC_EL8_Sept2023_Val_Alanv4-v11/NANOAODSIM"
      ]
    }
  },

Please let me know if this enables you to move forward with this development and testing.

d-ylee · 2023-09-18T17:17:30Z

@amaltaro Thanks for the json dump.

@vkuznet I was going through the fileparent insert API and noticed that we are validating the incoming json twice:

dbs2go/dbs/fileparents.go

Line 85 in e71c8a4

err = r.Validate()

dbs2go/dbs/fileparents.go

Line 368 in e71c8a4

err = r.Validate()

Was there a reason for this?

To implement the missingFiles, I need to modify this validation step to accept 0 for PARENT_FILE_ID or to handle the second error in the Validate function:

dbs2go/dbs/fileparents.go

Lines 209 to 222 in e71c8a4

    
           func (r *FileParents) Validate() error { 
        
           	if err := RecordValidator.Struct(*r); err != nil { 
        
           		return DecodeValidatorError(r, err) 
        
           	} 
        
           	if r.THIS_FILE_ID == 0 { 
        
           		msg := "missing this_file_id" 
        
           		return Error(InvalidParamErr, ParametersErrorCode, msg, "dbs.fileparents.Validate") 
        
           	} 
        
           	if r.PARENT_FILE_ID == 0 { 
        
           		msg := "missing parent_file_id" 
        
           		return Error(InvalidParamErr, ParametersErrorCode, msg, "dbs.fileparents.Validate") 
        
           	} 
        
           	return nil 
        
           }

vkuznet · 2023-09-18T17:31:56Z

Why do you think we are validating twice? Those are two different APIs, the former one is Insert in FileParents

// FileParents represents file parents DBS DB table
type FileParents struct {
	THIS_FILE_ID   int64 `json:"this_file_id" validate:"required,number,gt=0"`
	PARENT_FILE_ID int64 `json:"parent_file_id" validate:"required,number,gt=0"`
}

record. The corresponding json will looks like this:

{"this_file_id": 123, "parent_file_id" : 456}

While latter is from InsertFileParentsBlockTxt APIs which reads FileParentBlockRecord

type FileParentBlockRecord struct {
	BlockName         string    `json:"block_name"`
	ChildParentIDList [][]int64 `json:"child_parent_id_list"`
}

record. The corresponding JSON record will look like this:

{"block_name": "block#123", "child_parent_id_list": [123,456]}

Both records comes in different DBS requests (you may need to figure out which ones). Therefore both require their own validations.

Regarding validation procedure. The 0 is default value if it is not provided in JSON record for given data-type. Therefore, I'm not a big fan to change validation code for it. Instead, I suggested to use -1 to clearly indicate missing parent. I'll leave you to discuss this with Alan.

…d issue with parentage check Kept validation check Partial Parentage is indicated by providing -1 to ChildParentIDList Only insert valid parentage

d-ylee · 2023-09-27T20:48:00Z

@amaltaro and I decided to use -1 in order to keep the required tag for PARENT_FILE_ID

d-ylee · 2023-09-27T20:49:40Z

@vkuznet I made a PR and would like to deploy it to the CERN image registry. My plan is to use a conditional on the tag (adding -dev to the tag name) to not have it be pushed to preprod. Is it possible for me to publish to the image registry from my fork?

amaltaro · 2023-09-28T12:30:15Z

@d-ylee Dennis, provided that you can authenticate, yes, you should be able because I have done that for some WMCore images.
Manual build and upload should work as well, in case you prefer that option.

vkuznet · 2023-09-28T13:16:03Z

@d-ylee I second Alan reply. Yes, you can upload new image, and yes it would be useful to have -dev suffix to distinguish it, and yes you should use your new image for testing, and yes you can build it locally and use it too. Bottom line, in your dv k8s cluster you can do whatever you want. You may build new dbs2go executable and copy it too and restart your service within existing pod too. There are so many possibilities that it is up to you to pick up the one which gives you best way to do debugging.

d-ylee · 2023-10-02T17:29:10Z

@vkuznet I am having difficulty making the image based off the commands in the GitHub build action. I looked at the DockerFile in dmwm/CMSKubernetes, and it requires the tag to be in this repository: https://github.com/dmwm/CMSKubernetes/blob/5ca68d24585051156eb08b6d1d85ff8e2188f2e9/docker/dbs2go/Dockerfile#L27-L28

Instead of merging, should I pull it into this repository as a new branch, then tag it to v00.06.43-dev to trigger the GitHub action?

vkuznet · 2023-10-02T18:00:15Z

@d-ylee why do you need CI/CD to build your custom image. Instead do the following:

clone CMSKubernetes repo
cd to docker/dbs2go area
modify Dockerfile to include your specific tag, branch, etc.
build yoiur image using docker build -t ... command
upload manually your image to registry.cern.ch
iterate as many times as necessary until you satisfied with your branch and merge, then CI/CD will build official image.

d-ylee · 2023-10-20T18:59:53Z

@amaltaro has tested the changes on test3, and we have agreed to move it to preprod for further testing.

d-ylee · 2023-10-24T21:35:38Z

I have pushed the changes to dbs2go-global-w and dbs2go-phys03-w on testbed.

amaltaro · 2023-10-24T23:22:08Z

Awesome, thanks Dennis! I hope to be able to upgrade testbed ReqMgr2 later today, otherwise it will happen tomorrow afternoon.

BTW, should we close this ticket with #101 ?

amaltaro · 2023-10-30T13:33:38Z

@d-ylee kind ping (please see above)

d-ylee · 2023-11-03T19:37:56Z

Closing due to changes deployed to production.

todor-ivanov mentioned this issue Mar 8, 2023

StepChainParentageFixTask ReqMgr2 thread failing to resolve parentage for 10 workflows dmwm/WMCore#11260

Closed

vkuznet assigned d-ylee Mar 21, 2023

todor-ivanov mentioned this issue Mar 23, 2023

Adding Null parentage information for RunLumi pairs missing at the parent Dataset dmwm/WMCore#11520

Merged

amaltaro mentioned this issue Sep 12, 2023

StepChainParentageFixTask ReqMgr2 thread failing to resolve dataset with incomplete parentage information dmwm/WMCore#11715

Closed

d-ylee mentioned this issue Sep 27, 2023

Allow Partial Parentage #101

Merged

d-ylee closed this as completed Nov 3, 2023

Allow blocks with partially resolved parentage information #94

Allow blocks with partially resolved parentage information #94

Comments

todor-ivanov commented Mar 8, 2023 • edited Loading

vkuznet commented Mar 8, 2023

todor-ivanov commented Mar 8, 2023 • edited Loading

vkuznet commented Mar 8, 2023

amaltaro commented Mar 8, 2023

vkuznet commented Mar 9, 2023

amaltaro commented Mar 24, 2023

d-ylee commented Mar 24, 2023

d-ylee commented Mar 27, 2023

vkuznet commented Mar 27, 2023

amaltaro commented Mar 27, 2023

vkuznet commented Mar 27, 2023

todor-ivanov commented Mar 27, 2023

d-ylee commented Apr 4, 2023

vkuznet commented Apr 4, 2023

todor-ivanov commented Apr 6, 2023

d-ylee commented Apr 17, 2023

todor-ivanov commented May 8, 2023

amaltaro commented May 24, 2023

amaltaro commented May 25, 2023

d-ylee commented May 25, 2023

amaltaro commented May 25, 2023

vkuznet commented May 25, 2023 • edited Loading

amaltaro commented May 25, 2023

d-ylee commented May 25, 2023

amaltaro commented Jun 9, 2023

todor-ivanov commented Jun 19, 2023 • edited Loading

d-ylee commented Aug 11, 2023

vkuznet commented Aug 14, 2023

amaltaro commented Aug 14, 2023 • edited Loading

vkuznet commented Aug 14, 2023

todor-ivanov commented Aug 15, 2023

yuyiguo commented Aug 22, 2023

d-ylee commented Aug 23, 2023

amaltaro commented Aug 23, 2023

amaltaro commented Sep 14, 2023

vkuznet commented Sep 15, 2023

vkuznet commented Sep 15, 2023

amaltaro commented Sep 15, 2023

vkuznet commented Sep 15, 2023 • edited Loading

amaltaro commented Sep 15, 2023

d-ylee commented Sep 18, 2023

vkuznet commented Sep 18, 2023

d-ylee commented Sep 27, 2023

d-ylee commented Sep 27, 2023

amaltaro commented Sep 28, 2023

vkuznet commented Sep 28, 2023

d-ylee commented Oct 2, 2023

vkuznet commented Oct 2, 2023

d-ylee commented Oct 20, 2023

d-ylee commented Oct 24, 2023

amaltaro commented Oct 24, 2023

amaltaro commented Oct 30, 2023

d-ylee commented Nov 3, 2023

todor-ivanov commented Mar 8, 2023 •

edited

Loading

todor-ivanov commented Mar 8, 2023 •

edited

Loading

vkuznet commented May 25, 2023 •

edited

Loading

todor-ivanov commented Jun 19, 2023 •

edited

Loading

amaltaro commented Aug 14, 2023 •

edited

Loading

vkuznet commented Sep 15, 2023 •

edited

Loading