Graph representation and data structuress #2
Replies: 7 comments 6 replies
-
GraphA Graph object should be able to represent most of the different types of networks people use. Its basic components are: Nodes and Edges. Nodes can have an arbitrary number of qualitative and quantitative attributes. Some attributes of a node are mandatory:
A Graph object must contain exactly one Nodes object. We define the data structure for Nodes more in detail in further comments. Edges can be stored in multiple formats (see further comments). A Graph must have at least one Edges object, but possibly more. Each Edges object defines one type of connection. The type of an Edges object should be flexible enough to represent different things: layers in a multilayer networks, snapshots in a dynamic network, and couplings between network layers. We define the data structure for Edges more in detail in further comments. We should not allow duplicate entries in the Edges object. Multigraphs with parallel but otherwise identical edges are logically equivalent to weighted graphs. If the parallel edges are distinct somehow, data should be stored as a multilayer network, i.e. via multiple Edges objects. The Graph object should have a boolean flag indicating whether the Graph is directed or not. If set to true, then we do not need to store each edge as two distinct entries, but we store only the upper triangular part of the matrix. |
Beta Was this translation helpful? Give feedback.
-
Edges Structure 1: COO MatrixThis should be our default data structure, as it support fast conversions to other data structures. Stores edge information in three arrays:
Notes:
Example (note: everything is 0-indexed): Row: [0,0,1,3,4] Result: [0,0,1,0,5] |
Beta Was this translation helpful? Give feedback.
-
Edges Structure 2: Edge List (DOK Matrix)This format might be more suitable for algorithmic operations that need to iterate over the edges. Stores edge information in a dictionary:
Notes:
Example (note: everything is 0-indexed): { Result: [0,0,1,0,5] |
Beta Was this translation helpful? Give feedback.
-
Edges Structure 3: Adjacency List (LIL Matrix)This format might be more suitable for algorithmic operations that need to iterate over the nodes. Stores edge information in two lists:
Notes:
Example (note: everything is 0-indexed): Column: [[2,4], [1], [], [4], [4]] Result: [0,0,1,0,5] |
Beta Was this translation helpful? Give feedback.
-
Edges Structure 4: CSR MatrixThis format is suitable for efficient (row-oriented) linear algebra operations. Stores edge information in three arrays:
Notes:
Example (note: everything is 0-indexed): Column: [2,4,1,4,4] Result: [0,0,1,0,5] |
Beta Was this translation helpful? Give feedback.
-
Edges Structure 5: CSC MatrixThis format is suitable for efficient (column-oriented) linear algebra operations. Stores edge information in three arrays:
Notes:
Example (note: everything is 0-indexed): Row: [1,0,0,3,4] Result: [0,0,1,0,5] |
Beta Was this translation helpful? Give feedback.
-
Hi @mikk-c , Thanks for the list of detailed data structures. I'm sympathetic to using indices as first-class citizens for: The downside of it is that it makes removing nodes and edges expensive operations. Ie when a node is removed, every edge the subsequent nodes appear in need to be updated with the new index. As far as I can tell, graph-tool runs into the same issue where it's not straightforward to remove a node. Nevertheless, I think relying on indices is a net positive. I've implemented some toy version of the options you listed as well as some others. The main problem with sparse matrices I've checked out so far is that edge lookups are expensive. That is, in order to find all edges for an arbitrary node, one needs to iterate over all edges in the graph. This can be optimized with some acrobatics but then the worry is this will hinder development speed and make the code unmaintainable. I've been looking into graph-tool's implementation as a reference for a fast library. Tiago also treats indices as first class citizens and uses adjacency lists. Adjacency lists provide speed for node specific lookups. In fact, for directed graphs, he keeps two different lists for OutEdges and InEdges. Granted, InEdges can be derived from OutEdges, but the lookup speed is apparently worth the extra memory space. So for the core data structure I'm drawn to adjacency lists similar to what you outlined in Edges 3: Here's a simplified representation of what it would look like. It's basically a zipped version of your LIL structure with Row (OutEdges) and Column (InEdges) views. // Node is a generic type so that it can have an arbitrary amount of information as Michele suggests
// Edges is an array of arrays where the index of the inner array corresponds to the index of the corresponding node in the Nodes array.
// So the 0th array in Edges contains the edges belonging to the node with the same index.
[<Measure>] // This is a 0-cost F# feature that enhances readability
type NodeIx
type Graph<'Node> = { // 'Node is a generic type to be identified on runtime. Could be int, string, object, etc.
IdMap: Map<string, int<NodeIx>>
Nodes: 'Node []
Edges: (int<NodeIx> * float) [] []
}
// Directed
// InEdges can be derived from OutEdges but then lookup times for InEdges would suffer.
// InEdges can be an option which the user might want to opt out of.
type DiGraph<'Node> = {
IdMap: Map<string, int<NodeIx>>
Nodes: 'Node []
OutEdges: (int<NodeIx> * float) [] []
InEdges: (int<NodeIx> * float) [] []
}
// MultiGraph
// Tentative; to demonstrate similar data structures can be used without much modification.
type MultiGraph<'Node, 'Label> = {
IdMap: Map<string, int<NodeIx>>
Nodes: 'Node []
OutEdges: Dictionary<'Label, (int<NodeIx> * float) [] []>
InEdges: Dictionary<'Label, (int<NodeIx> * float) [] []>
}
// So the following array
// [0,0,1,0,5]
// [0,2,0,0,0]
// [0,0,0,0,0]
// [0,0,0,0,3]
// [0,0,0,0,4]
// would be represented as follows:
let d: DiGraph<int> = {
IdMap = [("zero", 0<NodeIx>); ("one", 1<NodeIx>); ("two", 2<NodeIx>); ("three", 3<NodeIx>); ("four", 4<NodeIx>)] |> Map
Nodes = [|0; 1; 2; 3; 4|]
OutEdges = [|
[|(2<NodeIx>, 1.); (4<NodeIx>, 5.)|]
[|(1<NodeIx>, 2.)|]
[||]
[|(4<NodeIx>, 3.)|]
[|(4<NodeIx>, 4.)|]
|]
InEdges = [|
[||]
[|(1<NodeIx>, 2.)|]
[|(0<NodeIx>, 1.)|]
[||]
[|(0<NodeIx>,5.); (3<NodeIx>, 3.); (4<NodeIx>, 4.)|]
|]
}
The IdMap serves to hide the index-level representation from the user. Needs a little bit more thinking. Looking forward to feedback and suggestions. |
Beta Was this translation helpful? Give feedback.
-
An important design choice for the library is to support multiple different data structures for graph representation. This is partly to make sure different graph types and algorithms have the most efficient structure, and also to encourage community contributions. We will define a core representation - probably a coordinate matrix - and all other representations, including those added by community contributions, must have functions to convert to and from this.
This discussion thread covers the choice of core representation and the other structures the lib will initially support. @mikk-c is defining a list of 5-10 structures while @DoganCK and @LibraChris are providing implementations. There are separate issues for these in sprint 1.
Beta Was this translation helpful? Give feedback.
All reactions