Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add cycle validation #111

Merged
merged 3 commits into from
Jul 28, 2022
Merged

add cycle validation #111

merged 3 commits into from
Jul 28, 2022

Conversation

rjawesome
Copy link
Contributor

@rjawesome rjawesome commented Jul 19, 2022

Meant to solve this issue. Uses a DFS to find cycles and assumes the direction of edges doesn't matter.

@ariutta
Copy link
Collaborator

ariutta commented Jul 19, 2022

Just based on looking at the code, this appears correct. But I didn't actually run any tests.

One question: does conNode mean "connected node"? After looking more closely, it's pretty clear it means "connected node" or "connection node," but it took me a moment.

@rjawesome
Copy link
Contributor Author

Yes conNode means a connected node to the current node in the DFS (depth-first search), just for anyone interested I'll give a quick summary of the algorithm.

  • For each node, create an object which contains a property of whether it has been visited and a list of connected nodes.
  • Loop through all nodes
    • if this node has been visited, then continue
    • Complete a DFS starting at this node using a stack
      • If at any stage, a connected node that is not the parent node has already been visited, then we can detect a cycle

I also did some tests with the following query graphs, where queryGraph1 and queryGraph2 should be considered cycles and queryGraph3 and queryGraph4 should not be considered cycles.

Query Graphs
const queryGraph1 = 
{
	"nodes": {
		"n0": {
			"ids":["PUBCHEM.COMPOUND:222284"],
			"categories":["biolink:ChemicalEntity"]
		},
		"n1": {
			"ids":[
				"MONDO:0005267",
				"MONDO:0005542",
				"MONDO:0005311",
				"MONDO:0005542",
				"MONDO:0004995"
				],
			"categories":["biolink:DiseaseOrPhenotypicFeature"]
	   },
		"n2": {
			"ids":["MONDO:0100096"],
			"categories":["biolink:DiseaseOrPhenotypicFeature"]
		},
		"n3": {
			"categories":["biolink:Gene"]
		}
	},
	"edges": {
		"e01": {
			"subject": "n0",
			"object": "n1",
			"predicates": ["biolink:related_to"]
		},
		"e02": {
			"subject": "n1",
			"object": "n2",
			"predicates": ["biolink:related_to"]
		},
		"e03": {
			"subject": "n0",
			"object": "n3",
			"predicates": ["biolink:related_to"]
		},
		"e04": {
			"subject": "n3",
			"object": "n2",
			"predicates": ["biolink:related_to"]
		}
	}
};

const queryGraph2 = 
{
	"nodes": {
		"n0": {
			"ids":["PUBCHEM.COMPOUND:222284"],
			"categories":["biolink:ChemicalEntity"]
		},
		"n1": {
			"ids":[
				"MONDO:0005267",
				"MONDO:0005542",
				"MONDO:0005311",
				"MONDO:0005542",
				"MONDO:0004995"
				],
			"categories":["biolink:DiseaseOrPhenotypicFeature"]
	   },
		"n2": {
			"ids":["MONDO:0100096"],
			"categories":["biolink:DiseaseOrPhenotypicFeature"]
		},
		"n3": {
			"categories":["biolink:Gene"]
		}
	},
	"edges": {
		"e02": {
			"subject": "n1",
			"object": "n2",
			"predicates": ["biolink:related_to"]
		},
		"e03": {
			"subject": "n2",
			"object": "n3",
			"predicates": ["biolink:related_to"]
		},
		"e04": {
			"subject": "n3",
			"object": "n1",
			"predicates": ["biolink:related_to"]
		}
	}
};

const queryGraph3 = 
{
	"nodes": {
		"n0": {
			"ids":["PUBCHEM.COMPOUND:222284"],
			"categories":["biolink:ChemicalEntity"]
		},
		"n1": {
			"ids":[
				"MONDO:0005267",
				"MONDO:0005542",
				"MONDO:0005311",
				"MONDO:0005542",
				"MONDO:0004995"
				],
			"categories":["biolink:DiseaseOrPhenotypicFeature"]
	   },
		"n2": {
			"ids":["MONDO:0100096"],
			"categories":["biolink:DiseaseOrPhenotypicFeature"]
		},
		"n3": {
			"categories":["biolink:Gene"]
		}
	},
	"edges": {
		"e02": {
			"subject": "n1",
			"object": "n2",
			"predicates": ["biolink:related_to"]
		},
		"e03": {
			"subject": "n1",
			"object": "n3",
			"predicates": ["biolink:related_to"]
		}
	}
};

const queryGraph4 = 
{
	"nodes": {
		"n0": {
			"ids":["PUBCHEM.COMPOUND:222284"],
			"categories":["biolink:ChemicalEntity"]
		},
		"n1": {
			"ids":[
				"MONDO:0005267",
				"MONDO:0005542",
				"MONDO:0005311",
				"MONDO:0005542",
				"MONDO:0004995"
				],
			"categories":["biolink:DiseaseOrPhenotypicFeature"]
	   },
		"n2": {
			"ids":["MONDO:0100096"],
			"categories":["biolink:DiseaseOrPhenotypicFeature"]
		},
		"n3": {
			"categories":["biolink:Gene"]
		}
	},
	"edges": {
		"e02": {
			"subject": "n1",
			"object": "n2",
			"predicates": ["biolink:related_to"]
		},
		"e03": {
			"subject": "n1",
			"object": "n3",
			"predicates": ["biolink:related_to"]
		},
		"e05": {
			"subject": "n2",
			"object": "n0"
		}
	}
};

@tokebe
Copy link
Member

tokebe commented Jul 20, 2022

This looks great! If you could turn those test QGraphs into a couple of tests it would help with testing coverage/ongoing verification.

@rjawesome
Copy link
Contributor Author

Tests have been added!

@tokebe
Copy link
Member

tokebe commented Jul 20, 2022

Tagging @andrewsu @colleenXu for final approval to merge -- everything seems ready and tests pass (on local -- github tests still broken)

@andrewsu
Copy link
Member

Looks good to me!

@colleenXu
Copy link
Contributor

colleenXu commented Jul 27, 2022

This is not recognized as a cycle by this code; should it be? Or another case that we want to address (stopping execution)?

Disease ID <-> Disease

Screen Shot 2022-07-26 at 6 05 59 PM

This and any larger QGraph that has something like this continues execution (isn't ID'd as a cycle and stopped)

{
    "message": {
        "query_graph": {
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1"
                },
                "e02": {
                    "subject": "n1",
                    "object": "n0"
                }
            },
            "nodes": {
                "n0": {
                    "ids": ["MONDO:0033373"],
                    "categories": ["biolink:Disease"]
                },
                "n1": {
                    "categories": ["biolink:Disease"]
                }
            }
        }
    }
}

TDLR (for this post and the one below it): it looks pretty good otherwise (I didn't fully execute anything, I only started querying and stopped execution if it looked like it passed the checkpoint for raising the "cycle" error...)

@colleenXu
Copy link
Contributor

colleenXu commented Jul 27, 2022

Notes on other things I tried


Correctly identified as cycle (stopped execution):

self-edge
{
    "message": {
        "query_graph": {
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n0"
                }
            },
            "nodes": {
                "n0": {
                    "ids": ["MONDO:0033373"],
                    "categories": ["biolink:Disease"]
                }
            }
        }
    }
}

self-edge as part of QGraph
{
    "message": {
        "query_graph": {
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n0"
                },
                "e02": {
                    "subject": "n0",
                    "object": "n1"
                }
            },
            "nodes": {
                "n0": {
                    "ids": ["MONDO:0033373"],
                    "categories": ["biolink:Disease"]
                },
                "n1": {
                    "categories": ["biolink:Disease"]
                }
            }
        }
    }
}

a triangle (like rjawesome's QG2)

Can switch around subject/object within an edge or swap two edges....and it's still found to be a cycle

{
    "message": {
        "query_graph": {
            "edges": {
                "e01": {
                    "subject": "n1",
                    "object": "n0"
                },
                "e02": {
                    "subject": "n2",
                    "object": "n0"
                },
                "e03": {
                    "subject": "n2",
                    "object": "n1"
                }
            },
            "nodes": {
                "n0": {
                    "ids": ["MONDO:0033373"],
                    "categories": ["biolink:Disease"]
                },
                "n1": {
                    "categories": ["biolink:Disease"]
                },
                "n2": {
                    "categories": ["biolink:Disease"]
                }
            }
        }
    }
}

a triangle with a line sticking out

Can switch around subject/object within an edge or swap two edges....and it's still found to be a cycle

{
    "message": {
        "query_graph": {
            "edges": {
                "e01": {
                    "subject": "n1",
                    "object": "n0"
                },
                "e02": {
                    "subject": "n1",
                    "object": "n2"
                },
                "e03": {
                    "subject": "n2",
                    "object": "n0"
                },
                "e04": {
                    "subject": "n2",
                    "object": "n3"
                }
            },
            "nodes": {
                "n0": {
                    "ids": ["MONDO:0033373"],
                    "categories": ["biolink:Disease"]
                },
                "n1": {
                    "categories": ["biolink:Disease"]
                },
                "n2": {
                    "categories": ["biolink:Disease"]
                },
                "n3": {
                    "categories": ["biolink:Disease"]
                }
            }
        }
    }
}

square (like rjawesome's QG1)

Can switch around subject/object within an edge or swap two edges....and it's still found to be a cycle

{
    "message": {
        "query_graph": {
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1"
                },
                "e02": {
                    "subject": "n1",
                    "object": "n2"
                },
                "e03": {
                    "subject": "n2",
                    "object": "n3"
                },
                "e04": {
                    "subject": "n3",
                    "object": "n0"
                }
            },
            "nodes": {
                "n0": {
                    "ids": ["MONDO:0033373"],
                    "categories": ["biolink:Disease"]
                },
                "n1": {
                    "categories": ["biolink:Disease"]
                },
                "n2": {
                    "categories": ["biolink:Disease"]
                },
                "n3": {
                    "categories": ["biolink:Disease"]
                }
            }
        }
    }
}
one of the issue's examples: another square

From this and this, I looked at the PK in ARAX and wrote up this QGraph

{
    "message": {
        "query_graph": {
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1"
                },
                "e02": {
                    "subject": "n1",
                    "object": "n2"
                },
                "e03": {
                    "subject": "n2",
                    "object": "n3"
                },
                "e04": {
                    "subject": "n3",
                    "object": "n0"
                }
            },
            "nodes": {
                "n0": {
                    "ids": ["MONDO:0033373"],
                    "categories": ["biolink:Disease"]
                },
                "n1": {
                    "categories": ["biolink:Disease"]
                },
                "n2": {
                    "categories": ["biolink:Disease"]
                },
                "n3": {
                    "categories": ["biolink:Disease"]
                }
            }
        }
    }
}
one of andrew's examples (another square)

From internal lab slack:

n0 --┬----> n1 ----┬--> n3
     └----> n2 ----┘
{
    "message": {
        "query_graph": {
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1"
                },
                "e02": {
                    "subject": "n0",
                    "object": "n2"
                },
                "e03": {
                    "subject": "n1",
                    "object": "n3"
                },
                "e04": {
                    "subject": "n2",
                    "object": "n3"
                }
            },
            "nodes": {
                "n0": {
                    "ids": ["MONDO:0033373"],
                    "categories": ["biolink:Disease"]
                },
                "n1": {
                    "categories": ["biolink:Disease"]
                },
                "n2": {
                    "categories": ["biolink:Disease"]
                },
                "n3": {
                    "categories": ["biolink:Disease"]
                }
            }
        }
    }
}


Correctly identified as not a cycle (continued execution):

"cross" not-a-triangle (like rjawesome's QG4)

look carefully, it's actually n0 -> n3 <- n1 <- n2 (aka linear)

Screen Shot 2022-07-26 at 6 30 34 PM

{
    "message": {
        "query_graph": {
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n3"
                },
                "e02": {
                    "subject": "n2",
                    "object": "n1"
                },
                "e03": {
                    "subject": "n1",
                    "object": "n3"
                }
            },
            "nodes": {
                "n0": {
                    "ids": ["MONDO:0033373"],
                    "categories": ["biolink:Disease"]
                },
                "n1": {
                    "categories": ["biolink:Disease"]
                },
                "n2": {
                    "categories": ["biolink:Disease"]
                },
                "n3": {
                    "categories": ["biolink:Disease"]
                }
            }
        }
    }
}

rjawesome's QG4 looks like this. Also linear once you trace the path

Screen Shot 2022-07-26 at 6 39 36 PM

C2 and C3 from Dec2021 demo aren't cycles (see that they are executing, then stop them since they run for a while)

@rjawesome
Copy link
Contributor Author

rjawesome commented Jul 27, 2022

For the Disease ID <-> Disease query, my code works on the principle that direction doesn't matter, so n0 -> n1 is considered the same as n1 -> n0. If this is an issue one way to handle it would be to flag an error if there are "duplicate" edges (where n0 -> n1 and n1 -> n0 are considered equivalent). Alternatively, this could be treated as a special case.

Side Note. I fully ran that query and it seems to end in an error

TypeError: undefined is not iterable (cannot read property Symbol(Symbol.iterator))
at TrapiResultsAssembler.update (/mnt/c/Users/User/Documents/Scripps/bte-trapi-workspace/packages/@biothings-explorer/query_graph_handler/built/query_results.js:239:55)
at runMicrotasks (<anonymous>)
at processTicksAndRejections (node:internal/process/task_queues:96:5)
at async TRAPIQueryHandler.query (/mnt/c/Users/User/Documents/Scripps/bte-trapi-workspace/packages/@biothings-explorer/query_graph_handler/built/index.js:609:9)
at async Object.task [as query_v1] (/mnt/c/Users/User/Documents/Scripps/bte-trapi-workspace/packages/@biothings-explorer/bte-trapi/src/routes/v1/query_v1.js:34:13)
at async runTask (/mnt/c/Users/User/Documents/Scripps/bte-trapi-workspace/packages/@biothings-explorer/bte-trapi/src/controllers/threading/taskHandler.js:12:9)

From other testing, this seems to be the error that comes up when cycles are passed into the results assembler, so I believe this case does need to be handled.

Side Note 2. The actual reason it seems to be giving an error is that it expects there to be a node with one edge. Therefore, I think the best course of action is to throw an error if there are duplicate edges (where direction doesn't matter)

@colleenXu
Copy link
Contributor

colleenXu commented Jul 27, 2022

I agree that this is a "duplicated edge" issue that ends up with QGraph-cycle-like problems!

do you think a fix to this can bundled into this PR / issue, or should it be something separate?

@rjawesome
Copy link
Contributor Author

rjawesome commented Jul 27, 2022

Since the code would basically be in the same place and it causes a similar error, I can probably put it into this PR and add an additional test for that case.

Also, Do you think it would be preferred to bundle this in the _validateCycle or create a separate function for validating duplicate edges?

@tokebe
Copy link
Member

tokebe commented Jul 27, 2022

It's probably worth making it a separate function.

@rjawesome
Copy link
Contributor Author

New commit should fix that query, and includes an additional test

@colleenXu
Copy link
Contributor

looks good to me now! @tokebe

(again, I didn't fully execute anything, I only started querying and stopped execution if it looked like it passed the checkpoint for raising the "cycle" error...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants