Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow blocks with partially resolved parentage information #94

Closed
todor-ivanov opened this issue Mar 8, 2023 · 61 comments
Closed

Allow blocks with partially resolved parentage information #94

todor-ivanov opened this issue Mar 8, 2023 · 61 comments
Assignees

Comments

@todor-ivanov
Copy link

todor-ivanov commented Mar 8, 2023

Impact of the new feature
There could be many groups affected by such a change.

Is your feature request related to a problem?
This is a request for a feature change which is related to a recently found custom use case involving miscommunication between WMCore and DBS. Here is the related WMCore issue and description of the situation: dmwm/WMCore#11260 (comment)

Long story short, we need to allow uploading blocks with partially resolved parentage information. This means few of the files present in a block may have no relation to files from the parent dataset (missing or partially resolved parentage information).

The main question here is eventually to Core SW and Physics groups as well - How mandatory is this constraint and How badly would such a change affect other steps?

Describe the solution you'd like

  • Solution #1: We simply remove the check for fully resolved parentage information for all files in a given block, here: dbs/fileparents.go#L344-L358
  • Solution #2: We allow files from the child block to refer to a NO parent file by setting the parent file id to a predefined default value (in accordance with the DBS data structures), which should be an indication that these files are missing the upper level files which they have been produced from.

Describe alternatives you've considered

The alternative approach wold be for WMCore to start developing a mechanism for automatically invalidating all those files from DBS and Rucio, which will go silently in the background. So far we have no numbers to cite about how frequent this is and that's why we cannot tell if and to what extent of data reduction it could lead. But in any ways, we are not in favor of this alternative path.

Additional context
None

@vkuznet
Copy link
Contributor

vkuznet commented Mar 8, 2023

@todor-ivanov , thanks for creating the issue. But I think neither solution is acceptable. For the solution 1 I do not know how it will impact other use-cases. For the solution 2 I do not know how to correctly do that in ORACLE DB, what is a default and why we should have it in file parent table. I think you are looking for solution 3:

  • client provides list of chiild, parent ids, e.g. [(1,null), (2,3)]
  • the first pair indicate that there is no parent for child id 1
  • we adjust DBS code to skip this pair during injection to DBS file parents table
  • since there is no parent record in file parent table for child id 1 then anyone who will look-up this lfn will no there is no parent for it.

How does it sound to you? @amaltaro please provide your input too.

@todor-ivanov
Copy link
Author

todor-ivanov commented Mar 8, 2023

Thanks @vkuznet,
Actually from your comment I can see we are in coherence on the way how we should modify the logic. I simply failed to clarify that the change should happen through a proper change of the data structure in the client call. But yes, we are indeed talking about the same thing. Thanks for clarifying this. And I agree this would be the minimal change on the WMCore end.

@amaltaro, what is your opinion on this?

@vkuznet
Copy link
Contributor

vkuznet commented Mar 8, 2023

@todor-ivanov , please provide additional information which DBS API is affected by this request and provide an example of HTTP POST input payload. For DBS API please inspect the corresponding python client to identify which DBS API it is using, here is DBS API documentation which lists all DBS APIs.

@amaltaro
Copy link

amaltaro commented Mar 8, 2023

Valentin, Todor, this discussion is in agreement with what we discussed yesterday. Just to summarize:

  • we need to make minor changes to WMCore such that the child/parent tuple is provided to the DBS client API, but with a None/null refeference to the parent id. Make sure there is a log record for these cases.
  • we need to then pre-process this data on the DBS Server side, removing tuples pointing to a None/null parent file id. Such that no action is taken for that tuple.

For the record, the only - expected - users for this API is WMCore, especifically that ReqMgr2 cherrypy thread.

@vkuznet
Copy link
Contributor

vkuznet commented Mar 9, 2023

I can work on this if @klannon approve my time allocation for this task, or we can delegate this task to @d-ylee . Meanwhile here I provide list of action items which should be done within DBS codebase:

  • the affected DBS API is fileparents and it comes from fileparents.go module
  • we need to loose constrain on FileParents data structure, see https://github.com/dmwm/dbs2go/blob/master/dbs/fileparents.go#L63 and remove validate parameters for PARENT_FILE_ID such that it can accept null
  • the validate function should be adjusted to skip validation of PARENT_FILE_ID if it is not provided, but if it is provided it should still perform the validation.
  • the insert function should check PARENT_FILE_ID and if it is not provide just exit immediately
  • and, finally we need to provide new unit test in int_fileparents.go test file to test insertion of pair where file parent id is null.

@amaltaro
Copy link

@d-ylee Hi Dennis, did you manage to check the feasibility of the developments listed above? Please let us know if further clarification is required here.

@d-ylee
Copy link
Contributor

d-ylee commented Mar 24, 2023

@amaltaro I think the discussion above is clear to me. I can work on this after I get the updating physics group name completed. Thanks!

@d-ylee
Copy link
Contributor

d-ylee commented Mar 27, 2023

@vkuznet After looking through the code, I noticed that for the database schema:

CREATE TABLE FILE_PARENTS (
THIS_FILE_ID INTEGER CONSTRAINT NN_FP_THIS_FILE_ID NOT NULL,
PARENT_FILE_ID INTEGER CONSTRAINT NN_FP_PARENT_FILE_ID NOT NULL,
CONSTRAINT PK_FP PRIMARY KEY (THIS_FILE_ID, PARENT_FILE_ID)
)

FILE_PARENTS has FILE_PARENT_ID as not null, and also has FILE_PARENT_ID as part of the PRIMARY KEY constraint. Would it be an issue if the CONSTRAINT contains a NULL value?

@vkuznet
Copy link
Contributor

vkuznet commented Mar 27, 2023

Dennis, it is good question and in fact should be answered by @amaltaro , @todor-ivanov . For instance:

  1. if we loose constrain in FILE_PARENTS table then we can have entry in it for file ids which does not have parents
  2. if we will keep constrain then we'll only keep files with parent information.

So, the two approaches will reflect how we'll query the data and how we will interpret the results of such query. For example, if we stick to approach 1, then our fileparents API will return a file id and not parent id because file id will be recoreded in FILE_PARENTS with no parent id. But if we choose second approach then fileparents will not even list such file id, but it will means that it has no parent.

I suggest that @amaltaro , and @todor-ivanov should evaluate both pros and cons and side effects of this decision.

@amaltaro
Copy link

From the DBS perspective, would there be any impact and/or performance degradation if we allow NN_FP_PARENT_FILE_ID to be Null?

When using the fileparents API, this is an example of call:
https://cmsweb.cern.ch/dbs/prod/global/DBSReader/fileparents?logical_file_name=/store/mc/Phys14DR/WprimeToENu_M_4800_Tune4C_13TeV_pythia8/AODSIM/TEST_PU20bx25_PHYS14_25_V1_TEST_Alan-v9/00000/3E27F2B3-D975-E411-8793-02163E010EE6.root

which returns the following data

[
{"logical_file_name":"/store/mc/Phys14DR/WprimeToENu_M_4800_Tune4C_13TeV_pythia8/AODSIM/TEST_PU20bx25_PHYS14_25_V1_TEST_Alan-v9/00000/3E27F2B3-D975-E411-8793-02163E010EE6.root","parent_file_id":57020800,"parent_logical_file_name":"/store/mc/Fall13/WprimeToENu_M_4800_Tune4C_13TeV_pythia8/GEN-SIM/POSTLS162_V1-v1/30000/AEF4E42B-1462-E311-8D2E-0025901D48B2.root"}
,{"logical_file_name":"/store/mc/Phys14DR/WprimeToENu_M_4800_Tune4C_13TeV_pythia8/AODSIM/TEST_PU20bx25_PHYS14_25_V1_TEST_Alan-v9/00000/3E27F2B3-D975-E411-8793-02163E010EE6.root","parent_file_id":57020803,"parent_logical_file_name":"/store/mc/Fall13/WprimeToENu_M_4800_Tune4C_13TeV_pythia8/GEN-SIM/POSTLS162_V1-v1/30000/A4668E29-1462-E311-AEB5-1CC1DE18CFEA.root"}
]

do I understand it right that, in case 1) above, we would get an output data as (just an example):

[
{"logical_file_name":"/store/mc/Phys14DR/WprimeToENu_M_4800_Tune4C_13TeV_pythia8/AODSIM/TEST_PU20bx25_PHYS14_25_V1_TEST_Alan-v9/00000/3E27F2B3-D975-E411-8793-02163E010EE6.root","parent_file_id":none,"parent_logical_file_name":""}
]

while in case 2) the output response would be an empty list. Can you someone please confirm this behaviour?

@vkuznet
Copy link
Contributor

vkuznet commented Mar 27, 2023

Having or not parent file id in FILE_PARENTS table will not impact DBS performance as DBS APIs use streaming. But allowing null parent will increase table size. By how much it depends on how may null parents we will have for our files. Said that, your understanding of output of fileparents API is correct, for case 1 we'll have [{'logical_file_name': 'bla', "parent_logcial_file_name': null}], while in case 2 you'll get empty list since no record for file bla will be present in FILE_PARENTS table.

@todor-ivanov
Copy link
Author

Hi @vkuznet @d-ylee @amaltaro,

To me option 2) is a NoGo. It is quite confusing and introduces more complexity and confusion rathar than simplicity. This is exactly the type of solutions relying on hidden logic and unwritten policies and agreement which are to be well forrgotten in really near future and the knowledge left to be forwarded through the generations only by the means of spoken wisdom. That will make the code from both sides quite unreadable only in a year from now. I am much more for the solution 1), where the parrent is explicitly pointed as NULL, meaning that DBS simply has no information for the origin of the child file, but does have all the rest of the information related to this file. Plus, IIUC, in many many cases it is much more important for physics to find the file itself.

@d-ylee
Copy link
Contributor

d-ylee commented Apr 4, 2023

@vkuznet @amaltaro @todor-ivanov

If we go with 1), would we need to update the database schema for the FILE_PARENTS table to

  1. Allow NULL value for PARENT_FILE_ID
  2. Redefine the PRIMARY KEY to not use PARENT_FILE_ID

For the PRIMARY KEY, I am still not clear how we should define this

From a past message:

1. if we loose constrain in FILE_PARENTS table then we can have entry in it for file ids which does not have parents

Would we need to have a entry in the FILES table that is for no parent files, but when getting from /fileparents, return none in the parent_file_id field if this entry ID is referenced?

Also, would there be cases in which allowing null parentage for FILES also affect DATASET_PARENTS and BLOCK_PARENTS?

@vkuznet
Copy link
Contributor

vkuznet commented Apr 4, 2023

Dennis, everything can be much simpler than you describe. How about using -1 for PARENT_FILE_ID when it is not provided? This way, we do not need to adjust schema, the -1 is acceptable integer and we may use it in other tables and APIs to indicate of no parent presence.

@todor-ivanov
Copy link
Author

hi @vkuznet @d-ylee

The negative value for missing parent file id is a valid workaround, which would prevent the need of table redefinition. But we need to be sure what is about to be returned to the client as well. Because, once recorded as -1 I believe it will be also returned as -1 in subsequent client queries, rather than null, as originally recorded by the client. This is slightly confusing but may work, provided we check if any further change on how we digest the data from those APIs in WMCore would be needed.

Denis, can you list the APIs in DBS which would develop such a silent data type metamorphosis if we opt for this workaround, so that we can check where and when WMCore uses them?

FYI @amaltaro

@d-ylee
Copy link
Contributor

d-ylee commented Apr 17, 2023

@todor-ivanov

Not sure if this is what you are asking for, but the APIs that are related to parentage are these:

  • /fileparents
  • /blockparents
  • /datasetparents

@todor-ivanov
Copy link
Author

Hi @vkuznet @d-ylee

Coming back to this, it does not seem that all those APIs you list in your previous message would be affected by
the workaround of returning negative value for missing parent file id. And so let me clarify, what it means affected in this case:

  • The client records null for the fileId field for a given run/lumi pair with missing parentage information
  • When later called to provide the file parentage information, the server returns -1 for the fileId field for run/lumi pairs with missing parentage information instead of the originally recorded value of null.

So to me, it seems only the APIs which return information on the run/lumi level granularity would be affected. So I'd bet only on /fileparents, but this was exactly the confirmation that we wanted from you actually.

So in the most straightforward use case, this might affect us is in the client's methods calls to list{ParentDataset,Block}Trio:
https://github.com/dmwm/WMCore/blob/31d65166dc62c7251edd52890bc4ac2f0b79d7fe/src/python/WMCore/Services/DBS/DBS3Reader.py#L966

https://github.com/dmwm/WMCore/blob/31d65166dc62c7251edd52890bc4ac2f0b79d7fe/src/python/WMCore/Services/DBS/DBS3Reader.py#L990

So this is the place I think we will have to rethink this mechanism again, but I see no other way to test it other than just making it in testbed .... and check what the server returns and how we aggregate the parentage information.

To me this again would be yet another workaround, which we will have to remember forever.

@amaltaro
Copy link

Just an update of this issue as an outcome of the Zoom discussion we just had (Valentin, Dennis and myself), this is the list of action items that we came up with:

1. Dennis: Check if null json gets converted to 0 in Go language.
2. Todor/Alan - IF NEEDED: Update WMCore to provide parent id = 0 instead of None
3. Dennis: Find what are the constraints/validation performed on the DBS Server side.
4. Dennis - IF NEEDED: update DBS Server to deal with partial block information
5. Dennis: create a CERN oracle account
6. Dennis: Get a dump of testbed or production
 * re-create the ownership of the table (or preserve the ownership)
7. Dennis: deploy DBSServer in a test k8s cluster and share the instance url
8. Todor: deploy ReqMgr2 in a test cluster pointing to Dennis DBS url
9. Todor: Inject one of these blocks in this test instance
  * note that ReqMgr2 should not mark a workflow as ParentageResolved = True
10. Todor: Validate the read APIs

@amaltaro
Copy link

@d-ylee Hi Dennis, I wonder if you managed to look into the null marshaling process? Such that we can decide whether we will have to make further changes in WMCore or not.

@d-ylee
Copy link
Contributor

d-ylee commented May 25, 2023

After some basic testing, null json values get converted to the nil value for a given type in Go.

package main

import (
        "fmt"
        "encoding/json"
)

type TestStruct struct {
        Test int64 `json:"test"`
}

func main() {
        d := "{'test': null}"
        var t TestStruct
        json.Unmarshal([]byte(d), &t)
        fmt.Printf("%v", t)
}

In this example, the value of t.Test is 0. If the type of TestStruct.Test is string, json.Unmarshal handles it by using empty string.

This behavior is reflected in the documentation: https://pkg.go.dev/encoding/json#Unmarshal

The JSON null value unmarshals into an interface, map, pointer, or slice by setting that Go value to nil. Because null is often used in JSON to mean “not present,” unmarshaling a JSON null into any other Go type has no effect on the value and produces no error.

Null values in JSON are given the nil or 0 value of a given type. For int64, the nil value is 0.

I can do additional testing by creating a simple HTTP server and sending a Python request with a None as the key's value to see if it will have the same result.

@amaltaro
Copy link

@d-ylee thank you for these details! I think it is good enough and I understand that WMCore can keep providing a tuple of (int_child_file_id, None) whenever the parent file is gone (and/or invalidated). So no further changes should be expected in WMCore.

Please let us know how your communication with the CERN DBA goes and where you need support on anything on this front.

@vkuznet
Copy link
Contributor

vkuznet commented May 25, 2023

@d-ylee thanks for confirmation. PLEASE keep in mind that None is pure Python data-type and python client MUST provide proper JSON, e.g.

>>> import json
>>> json.dumps({"bla":None})
'{"bla": null}'

So, on a wire, i.e. when HTTP request is send, the python code MUST provide null and not None, those are two different things. The null is null value in JSON standard, while None is Python none data-type.

@amaltaro , @todor-ivanov you must ensure that whenever Python code provides (int_child_file_id, None) it should be properly converted to JSON data format via json.dumps call, otherwise we'll get a problem on server side.

@amaltaro
Copy link

@vkuznet yes, the python object is properly encoded as a json object. So no problems on this front are expected.

@d-ylee
Copy link
Contributor

d-ylee commented May 25, 2023

I did a request from Python in the same way as what @vkuznet detailed using json.dumps on a Python dict with key that has the value of None. The result is the same: Go decodes/unmarshal this as 0 for int64.

@amaltaro
Copy link

amaltaro commented Jun 9, 2023

@d-ylee Hi Dennis, it's been 2 weeks since we had our last chat, and I believe we only managed to cover points 1 and 2. from this comment so far:
#94 (comment)

do you have any updates on the other points? Do you know have a DBS dump such that we can try things in a dev environment?

In addition to that, can you please share what the current constraints are for updating block/file parentage data? It might be that Valentin already provided it in one of the tickets we discussed this, but I could not find it.

It's important to mention that this issue is currently keeping data unnecessarily unlocked in production, so we better fix it sooner than later.

@todor-ivanov
Copy link
Author

todor-ivanov commented Jun 19, 2023

hi @d-ylee would you want us to organize another meeting if you need some more information from our side so we can proceed on this? If you think there is some step that we first need to do and is currently blocking the progress, please let us know?

@d-ylee
Copy link
Contributor

d-ylee commented Aug 11, 2023

I spoke to @yuyiguo about this issue, and she explained to me some of the policies from the past and about this partial parentage issue.

Before DBS3, partial blocks were allowed, and these were able to be inserted into DBS when there were missing blocks. This was because we may have not had all files or we did not have the parent-child relationships yet.

This was an issue because we might forget about these partial blocks and end up not knowing who is the parent.

Yuyi mentioned that the policy in DBS3 was to only allow full blocks. Workflow management was to have its own database to keep unfinished blocks there, then insert into the production database.

If we want to change this policy, we need to have it written down and record who is responsible to complete the block.

Another concern is when a child has multiple parents. The current schema in the BLOCK_PARENTS database is the PARENT_BLOCK_ID and THIS_BLOCK_ID (the child). These two columns have a unique constraint. This means that there can't be more than one -1 allowed for a single child. If there are multiple parents with the same children and we use the -1 logic, the UNIQUE constraint would be broken.

Another question I have is, do you have the location of the written DBS3 design policies?

What are your thoughts?

@vkuznet
Copy link
Contributor

vkuznet commented Aug 14, 2023

@amaltaro , good design practices suggests to separate APIs for different use-cases rather to introduce complicated if/else logic. The later was used in Python server code which (if you had a chance to look at it) is very hard to read and understand. Therefore, I expressed my opinion based on this argument. Plus, you never know that one client will only the case. I can easily foresee when we may have other clients, plain script (instead of python code), or CRAB, or any new development we may have in a future. And, even using single client as it is right now you really address different use-cases. For maintenance reason it is better to address them via separate API rather complicate server code (even if we'll need make additional code-refactoring and wait a little bit longer).

@amaltaro
Copy link

amaltaro commented Aug 14, 2023

@amaltaro , good design practices suggests to separate APIs for different use-cases rather to introduce complicated if/else logic.

to me this is exactly the opposite of what you are suggesting. Creating another API just to deal with a slightly different use case IS an over-complication, it is confusing and it causes mistakes on the client side.

The extra query string would be the best approach in my opinion, but that will require changes in:

  • DBS server side
  • DBS client side (new release has to be made as well and properly integrated into the WMCore stack)
  • WMCore

The later was used in Python server code which (if you had a chance to look at it) is very hard to read and understand.

The approach I am suggesting adds NO overcomplication and does not make the code more complicated, does not increase maintenance and is transparent to the end user. It also does not require the release of a new dbs3-client!
It literally is a change in the source code from something like:

files_passed_by_client == files_in_block

to something like

files_passed_by_client <= files_in_block

Anyways, I think we will spend the rest of our days disagreeing on this!
I'd rather let @d-ylee Dennis consider what is the best approach here, noting that at the moment we have 100s of workflows stuck in the system waiting for this development to be deployed in production.

@vkuznet
Copy link
Contributor

vkuznet commented Aug 14, 2023

Alan, I would like to take last change to convince you and the team to not add changes to the existing API. My last argument is the following:

  • by design (both Python and Go implementation) of DBS code provides comprehensive checks and input validation from the client
  • by using the same API and rely on different input we break the above requirements and open the door to handling input without cross check
  • if we'll implement the changes the way you propose DBS server will not have such ability to check if given input is right or wrong (i.e. it does or does not have parents) and it will just follow if/else logic without knowing and validating that client provide correct input

In my suggestion this is not possible by design since client will be forced to use either new API or provide external parameter to the API in question. Doing this way client a-priory will do a right job and DBS server will not rely on if/else logic of the input data. You may provide as many arguments as you want that such server implementation is more easy to do but you open a possibility that one day client will (by whatever mistake) provide incorrect input and DBS server will follow if/else logic to insert the data to the database. For instance, someone by mistake will not provide parents, or we may have a bug elsewhere in WMCore which may skip some parent, etc. In either case DBS server will follow the input if we will use the same API signature. To avoid this possibility we must enforce client to be explicit with the call (in other words client will know what it is doing by calling proper API or API with proper parameter rather providing an input only to DBS server).

I hope that you can take this arguments more seriously and properly evaluate the implication of implantation on DBS, but I'll not argue any more about how to implement this logic and leave it up to your and the rest of WMCore team to decide (here I have dual role, I belong to WMCore team and in my view the implementation should have separate API or API+external parameter and I implemented DBS logic). I think Dennis should implement whatever WMCore will agree.

Said that, I tag @todor-ivanov to evaluate my concern and you both should come to conclusion how the implementation should be done. Once you'll reach such agreement I think for Dennis it will be more clear how to implement it. At the end, it is just implementation but my concern that the choice may impact data consistency in DBS database in a long run.

@todor-ivanov
Copy link
Author

hi @amaltaro @vkuznet @d-ylee,
It is difficult to say here what is the better solution. The arguments and concerns for both cases are far in the future e.g. "are we going to stay the only user of this API", "Will we affect other's code behavior" etc... But as a general rule of thumb I'd not rely on the current situation and conclude for the far future.

One really important thing to take into account when making such a decision is the knowledge if our current policy, for not inserting files into DBS which lack parentage information, is acknowledged and well understood by every other group that may be affected. And here I mean both sides:

  • Eventual users of WMCore modules or DBS apis directly (including CRAB), and also
  • Those who need to consume the output from DBS, such as physics

We should not talk only about WMCore and people responsible for DBS server. And this, I believe, is information we do not have at the moment. Obviously there is no policy document for DBS Server behavior..., ence our blindness on the applicability of this policy. There could easily be some proliferation of the reverse opinion among other groups who rely on the data returned by DBS - meaning they may rely on the parent-child relationship in DBS to be satisfied in any case. So we are actually in the state of lack of consensus on the policy itself.

Now about the possible solutions: The fix only on the server side is something that is less work and is an encapsulated change, but because of the above unknowns, I'd bet for the harder one - the one that forces the client to explicitly and consciously consciously stress which policy he would chose for uploading the files. It is not only backwards compatibility that we are talking about here, but also about bearing the responsibility for doing the one or the other. At the end we will not change the API, regardless of which solution we chose here - !!! we only change Server behavior !!!

  • Creating a new API is out of question - simply not worth it.
  • Setting an option to the already existing API for choosing the correct policy which would be defaulting to the current one, in my opinion is the way to go.

Yes this is going to affect/require changes from everybody, as @amaltaro said:

  • DBS server side
  • DBS client side (new release has to be made as well and properly integrated into the WMCore stack)
  • WMCore

In the case of WMCore this should be fairly small change - adding the new option here and here The tests and validation procedure, on the other hand, would stay as usually long and need to be a full one.... well it has its price.
For the rest, I cannot tell .

Something more. Since this situation (or what concerns WMCore) is an expected one, and our use cases are well identified in advance - as early as building the lists for the parentage resolution itself, I'd not further complicate the code by implementing extra calls to DBS for checking dbs errors and retries. This would only affect code readability and scalability both together without any benefit. So I'd go directly for just changing the policy on the WMCore side by using the newly provided option to the same API.

@yuyiguo
Copy link
Member

yuyiguo commented Aug 22, 2023

I just back from my vacation and saw this discussion. Right before @d-ylee and I left for vacation, @amaltaro, Dennis, and I had a Zoom meeting on this. We discussed why the "-1 " would not work with the DBS DB schema. Alan pointed out that the missing parent files would be lost forever, in other words, the file parentage would not be updated in the future. So we concluded that it is better to just not put anything in DBS for the ones that lost their parents' files.
I think @vkuznet made a valued argument to protect data integrity. To consider both DBS and WMCore, I would propose below API and server upgrade:

  1. Add missingfiles=0 to the existing API. This would be the default. So no effect on others except for WMcore.
  2. When WMCore calls the API, it tells DBS how many missing files it has with missingfiles=n.
  3. DBS server continues to verify the child-parent relationship. If the missing files does not match n, it will reject the input.
    This way both client and server know exactly what should happen. This is my two cents.

@d-ylee
Copy link
Contributor

d-ylee commented Aug 23, 2023

@amaltaro @todor-ivanov If this is what we would like to do, I think we would need to incorporate the number of missing files to the check here:

if !utils.Equal(utils.OrderedSet(fids), utils.OrderedSet(bfids)) {

I do see that the utils.Equal function also makes sure the file id and the block file ids match. This should probably also be validated during the insert request and may need to be another parameter.

Another question I have is that would DBS need to keep track of these missing files? Or would this be on WMCore's side? Both this question and the above validation seem to affect each other.

Thoughts?

@amaltaro
Copy link

@d-ylee Hi Dennis, Yuyi's idea looks good to me and having the client to provide how many files are missing in the json data is a good approach (defaulting it to 0, if not provided).

On what concerns keeping track of the missing files. I would say that the DBS server should at least log that a block with incomplete parentage action was performed (sort of an access log). On the WM side, we do log that a given file failed to find its parentage file.

Please let us know if there are any remaining open questions.

@amaltaro
Copy link

@d-ylee Dennis, in addition to what we have discussed today, I just wanted to update this thread with a way to migrate dataset from prod/global do your dev cluster (such that we can test things out):

from dbs.apis.dbsClient import DbsApi
dbsApi = DbsApi(url = 'https://cmsweb-test3.cern.ch/dbs/dev/global/DBSMigrate/')
dbsUrl = 'https://cmsweb.cern.ch/dbs/prod/global/DBSReader'
dataset = "my dataset name"
migrateArgs = {'migration_url': dbsUrl, 'migration_input': dataset}
dbsApi.submitMigration(migrateArgs)

I looked into the dbs3-client code and came up with the following curl standalone call:

curl -k --cert $X509_USER_CERT --key $X509_USER_KEY -vv -X POST "https://cmsweb-test3.cern.ch:8443/dbs/dev/global/DBSMigrate" -d '{"migration_url": "https://cmsweb.cern.ch/dbs/prod/global/DBSReader", "migration_input": "/DM_MonoZPrime_A_Mx50_Mv5000_gDM1gSM0p25_Zprime2p0_TuneCP5_13TeV_madgraph-pythia8/RunIISummer20UL17MiniAODv2-106X_mc2017_realistic_v9-v2/MINIAODSIM"}' -H 'Content-Type: application/json' -H 'Accept: application/json' -H 'UserID: [email protected]' -H 'User-Agent: Alan-cURL'

but all my attempts failed with:

< Response-Status: 405 Method Not Allowed

and I cannot figure why. Perhaps @vkuznet has a working example with curl other than this:
https://github.com/dmwm/dbs2go/blob/master/docs/MigrationServer.md#examples
which relies in a different service/API, localhost and curl command which is not available in the dbs2go images(?)

@vkuznet
Copy link
Contributor

vkuznet commented Sep 15, 2023

Alan, could you please read closely documentation. It says:

# migration document
cat > m.json << EOF
{
    "migration_url": "https://.../dbs/prod/global/DBSReader",
    "migration_input": "/a/b/c#123"
}
EOF

# submit migration request
curl -H "Content-Type: application/json" -d@$PWD/m.json \
    http://localhost:9898/dbs2go-migrate/submit

while in your request your URL does not specify /submit API.

@vkuznet
Copy link
Contributor

vkuznet commented Sep 15, 2023

Moreover, each DBS server has its own set of APIs which you can easily find via /apis end-point, e.g. here is list of APIs for DBSMigrate which I took from testbed:

scurl https://cmsweb-testbed.cern.ch/dbs/int/global/DBSMigrate/apis
[{"api":"/dbs/int/global/DBSMigrate/blocks","methods":["GET"]},{"api":"/dbs/int/global/DBSMigrate/submit","methods":["POST"]},{"api":"/dbs/int/global/DBSMigrate/remove","methods":["POST"]},{"api":"/dbs/int/global/DBSMigrate/blockparents","methods":["GET"]},{"api":"/dbs/int/global/DBSMigrate/serverinfo","methods":["GET"]},{"api":"/dbs/int/global/DBSMigrate/","methods":["GET"]},{"api":"/dbs/int/global/DBSMigrate/total","methods":["GET"]},{"api":"/dbs/int/global/DBSMigrate/bulkblocks","methods":["POST"]},{"api":"/dbs/int/global/DBSMigrate/dummy","methods":["GET","POST"]},{"api":"/dbs/int/global/DBSMigrate/errors","methods":["GET"]},{"api":"/dbs/int/global/DBSMigrate/help","methods":["GET"]},{"api":"/dbs/int/global/DBSMigrate/status","methods":["GET"]},{"api":"/dbs/int/global/DBSMigrate/datasetparents","methods":["GET"]},{"api":"/dbs/int/global/DBSMigrate/healthz","methods":["GET"]},{"api":"/dbs/int/global/DBSMigrate/metrics","methods":["GET"]},{"api":"/dbs/int/global/DBSMigrate/apis","methods":["GET"]},{"api":"/dbs/int/global/DBSMigrate/process","methods":["POST"]},{"api":"/dbs/int/global/DBSMigrate/cancel","methods":["POST"]}]

Similar /apis end-point exist for reader, writer servers and you can easily see which method each API is using.

@amaltaro
Copy link

@vkuznet Valentin, you probably missed a few important points:

  1. we are trying to test dbs2go-global-m, which is the service that receives dataset migration requests
  2. the server expected API endpoint is dbs/dev/global/DBSMigrate/submit, instead of dbs2go-migrate/submit. Unless you need to run a different service which provides a different set of APIs. If so, then you start deviating from what you actually run in production (given that you are testing a different setup)
  3. none of the dbs2go images provide curl, so we need to deploy another service in the kubernetes cluster just so we can use curl to reach a the service that we need with dbs2go-global-m.dbs.svc. Which is inconvenient.

The apis endpoint will be useful though. Thanks

@vkuznet
Copy link
Contributor

vkuznet commented Sep 15, 2023

Alan, I understand what you are doing. The provided dbs2go documentation does not list our production/test/dev URLS and refers to actual service. Of course I would expect that you know your server endpoint, e.g. /dbs/dev/global/DBSMigrate and use it instead of listed in documentation dbs2go-migrate. The curl should not be inside of dbs2go images. Instead, as I pointed many times just deploy new curl image in your k8s and use it to access any service. Here its manifest file: https://github.com/dmwm/CMSKubernetes/blob/master/kubernetes/cmsweb/services/curl.yaml And, you can deploy it as easy as

kubectl apply -f curl.yaml

Then curl pod is created in default namespace and you can login to it and use curl to access your services.

@amaltaro
Copy link

Okay, it looks like Dennis has some documentation work to be done at some point, such that a developer can go as closer to the production environment as possible, to have more meaningful tests.

I didn't remember we had a manifest only for curl, that will definitely be handy. Thanks.

@d-ylee Dennis, as we discussed yesterday, I finally got many json dumps for a StepChain workflow that you could play with. Note that they only provide dataset parent information, no file and or block parentage. You can find them here:
https://amaltaro.web.cern.ch/amaltaro/forDennis/

Their parentage relationship is:

  "ChainParentageMap": {
    "Step1": {
      "ParentDset": null,
      "ChildDsets": [
        "/RelValWToLNu_14TeV/CMSSW_12_4_0_pre2-GenSimFull_SC_EL8_Sept2023_Val_Alanv4-v11/GEN-SIM"
      ]
    },
    "Step2": {
      "ParentDset": "/RelValWToLNu_14TeV/CMSSW_12_4_0_pre2-GenSimFull_SC_EL8_Sept2023_Val_Alanv4-v11/GEN-SIM",
      "ChildDsets": [
        "/RelValWToLNu_14TeV/CMSSW_12_4_0_pre2-Digi_2021_SC_EL8_Sept2023_Val_Alanv4-v11/GEN-SIM-DIGI-RAW"
      ]
    },
    "Step3": {
      "ParentDset": "/RelValWToLNu_14TeV/CMSSW_12_4_0_pre2-Digi_2021_SC_EL8_Sept2023_Val_Alanv4-v11/GEN-SIM-DIGI-RAW",
      "ChildDsets": [
        "/RelValWToLNu_14TeV/CMSSW_12_4_0_pre2-RecoNano_2021_SC_EL8_Sept2023_Val_Alanv4-v11/DQMIO",
        "/RelValWToLNu_14TeV/CMSSW_12_4_0_pre2-RecoNano_2021_SC_EL8_Sept2023_Val_Alanv4-v11/MINIAODSIM",
        "/RelValWToLNu_14TeV/CMSSW_12_4_0_pre2-RecoNano_2021_SC_EL8_Sept2023_Val_Alanv4-v11/GEN-SIM-RECO",
        "/RelValWToLNu_14TeV/CMSSW_12_4_0_pre2-RecoNano_2021_SC_EL8_Sept2023_Val_Alanv4-v11/NANOAODSIM"
      ]
    }
  },

Please let me know if this enables you to move forward with this development and testing.

@d-ylee
Copy link
Contributor

d-ylee commented Sep 18, 2023

@amaltaro Thanks for the json dump.

@vkuznet I was going through the fileparent insert API and noticed that we are validating the incoming json twice:

err = r.Validate()

err = r.Validate()

Was there a reason for this?

To implement the missingFiles, I need to modify this validation step to accept 0 for PARENT_FILE_ID or to handle the second error in the Validate function:

dbs2go/dbs/fileparents.go

Lines 209 to 222 in e71c8a4

func (r *FileParents) Validate() error {
if err := RecordValidator.Struct(*r); err != nil {
return DecodeValidatorError(r, err)
}
if r.THIS_FILE_ID == 0 {
msg := "missing this_file_id"
return Error(InvalidParamErr, ParametersErrorCode, msg, "dbs.fileparents.Validate")
}
if r.PARENT_FILE_ID == 0 {
msg := "missing parent_file_id"
return Error(InvalidParamErr, ParametersErrorCode, msg, "dbs.fileparents.Validate")
}
return nil
}

@vkuznet
Copy link
Contributor

vkuznet commented Sep 18, 2023

Why do you think we are validating twice? Those are two different APIs, the former one is Insert in FileParents

// FileParents represents file parents DBS DB table
type FileParents struct {
	THIS_FILE_ID   int64 `json:"this_file_id" validate:"required,number,gt=0"`
	PARENT_FILE_ID int64 `json:"parent_file_id" validate:"required,number,gt=0"`
}

record. The corresponding json will looks like this:

{"this_file_id": 123, "parent_file_id" : 456}

While latter is from InsertFileParentsBlockTxt APIs which reads FileParentBlockRecord

type FileParentBlockRecord struct {
	BlockName         string    `json:"block_name"`
	ChildParentIDList [][]int64 `json:"child_parent_id_list"`
}

record. The corresponding JSON record will look like this:

{"block_name": "block#123", "child_parent_id_list": [123,456]}

Both records comes in different DBS requests (you may need to figure out which ones). Therefore both require their own validations.

Regarding validation procedure. The 0 is default value if it is not provided in JSON record for given data-type. Therefore, I'm not a big fan to change validation code for it. Instead, I suggested to use -1 to clearly indicate missing parent. I'll leave you to discuss this with Alan.

d-ylee added a commit to d-ylee/dbs2go that referenced this issue Sep 27, 2023
…d issue with parentage check

Kept validation check
Partial Parentage is indicated by providing -1 to ChildParentIDList
Only insert valid parentage
@d-ylee
Copy link
Contributor

d-ylee commented Sep 27, 2023

@amaltaro and I decided to use -1 in order to keep the required tag for PARENT_FILE_ID

@d-ylee
Copy link
Contributor

d-ylee commented Sep 27, 2023

@vkuznet I made a PR and would like to deploy it to the CERN image registry. My plan is to use a conditional on the tag (adding -dev to the tag name) to not have it be pushed to preprod. Is it possible for me to publish to the image registry from my fork?

@amaltaro
Copy link

@d-ylee Dennis, provided that you can authenticate, yes, you should be able because I have done that for some WMCore images.
Manual build and upload should work as well, in case you prefer that option.

@vkuznet
Copy link
Contributor

vkuznet commented Sep 28, 2023

@d-ylee I second Alan reply. Yes, you can upload new image, and yes it would be useful to have -dev suffix to distinguish it, and yes you should use your new image for testing, and yes you can build it locally and use it too. Bottom line, in your dv k8s cluster you can do whatever you want. You may build new dbs2go executable and copy it too and restart your service within existing pod too. There are so many possibilities that it is up to you to pick up the one which gives you best way to do debugging.

@d-ylee
Copy link
Contributor

d-ylee commented Oct 2, 2023

@vkuznet I am having difficulty making the image based off the commands in the GitHub build action. I looked at the DockerFile in dmwm/CMSKubernetes, and it requires the tag to be in this repository: https://github.com/dmwm/CMSKubernetes/blob/5ca68d24585051156eb08b6d1d85ff8e2188f2e9/docker/dbs2go/Dockerfile#L27-L28

Instead of merging, should I pull it into this repository as a new branch, then tag it to v00.06.43-dev to trigger the GitHub action?

@vkuznet
Copy link
Contributor

vkuznet commented Oct 2, 2023

@d-ylee why do you need CI/CD to build your custom image. Instead do the following:

  1. clone CMSKubernetes repo
  2. cd to docker/dbs2go area
  3. modify Dockerfile to include your specific tag, branch, etc.
  4. build yoiur image using docker build -t ... command
  5. upload manually your image to registry.cern.ch
  6. iterate as many times as necessary until you satisfied with your branch and merge, then CI/CD will build official image.

@d-ylee
Copy link
Contributor

d-ylee commented Oct 20, 2023

@amaltaro has tested the changes on test3, and we have agreed to move it to preprod for further testing.

@d-ylee
Copy link
Contributor

d-ylee commented Oct 24, 2023

I have pushed the changes to dbs2go-global-w and dbs2go-phys03-w on testbed.

@amaltaro
Copy link

Awesome, thanks Dennis! I hope to be able to upgrade testbed ReqMgr2 later today, otherwise it will happen tomorrow afternoon.

BTW, should we close this ticket with #101 ?

@amaltaro
Copy link

@d-ylee kind ping (please see above)

@d-ylee
Copy link
Contributor

d-ylee commented Nov 3, 2023

Closing due to changes deployed to production.

@d-ylee d-ylee closed this as completed Nov 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants