Graph representation and data structuress #2

HarryMcCarney · 2023-06-07T09:10:40Z

HarryMcCarney
Jun 7, 2023
Maintainer

An important design choice for the library is to support multiple different data structures for graph representation. This is partly to make sure different graph types and algorithms have the most efficient structure, and also to encourage community contributions. We will define a core representation - probably a coordinate matrix - and all other representations, including those added by community contributions, must have functions to convert to and from this.

This discussion thread covers the choice of core representation and the other structures the lib will initially support. @mikk-c is defining a list of 5-10 structures while @DoganCK and @LibraChris are providing implementations. There are separate issues for these in sprint 1.

mikk-c · 2023-06-08T06:51:18Z

mikk-c
Jun 8, 2023
Maintainer

Graph

A Graph object should be able to represent most of the different types of networks people use.

Its basic components are: Nodes and Edges.

Nodes can have an arbitrary number of qualitative and quantitative attributes. Some attributes of a node are mandatory:

The name of a node (can be a string). If the user does not specify names, by default the name is assumed to be the index of the node.
The type of a node (can be a string). If the user does not specify types, by default all nodes have the same type.

A Graph object must contain exactly one Nodes object. We define the data structure for Nodes more in detail in further comments.

Edges can be stored in multiple formats (see further comments). A Graph must have at least one Edges object, but possibly more. Each Edges object defines one type of connection. The type of an Edges object should be flexible enough to represent different things: layers in a multilayer networks, snapshots in a dynamic network, and couplings between network layers. We define the data structure for Edges more in detail in further comments.

We should not allow duplicate entries in the Edges object. Multigraphs with parallel but otherwise identical edges are logically equivalent to weighted graphs. If the parallel edges are distinct somehow, data should be stored as a multilayer network, i.e. via multiple Edges objects.

The Graph object should have a boolean flag indicating whether the Graph is directed or not. If set to true, then we do not need to store each edge as two distinct entries, but we store only the upper triangular part of the matrix.

1 reply

mikk-c Jun 8, 2023
Maintainer

The user should be able to recover attributes (both for nodes and for edges) with an intelligible label. Users can provide this label as a string when creating/updating the Graph. Default should be a numeric index if the user does not specify anything.

mikk-c · 2023-06-08T07:04:12Z

mikk-c
Jun 8, 2023
Maintainer

Edges Structure 1: COO Matrix

This should be our default data structure, as it support fast conversions to other data structures. Stores edge information in three arrays:

Row array: the i-th value in this array contains the row index of the i-th non zero vaue
Column array: the j-th value in this array contains the column index of the j-th non zero vaue
Data array: the k-th value in this array contains the k-th non zero value.

Notes:

Entries should be sorted first by row index then by column index.
Edge attributes can be handled by having multiple Data arrays, each represented one attribute. Alternatively, Data could be a matrix with a column per attribute.

Example (note: everything is 0-indexed):

Row: [0,0,1,3,4]
Column: [2,4,1,4,4]
Data: [1,5,2,3,4]

Result:

[0,0,1,0,5]
[0,2,0,0,0]
[0,0,0,0,0]
[0,0,0,0,3]
[0,0,0,0,4]

0 replies

mikk-c · 2023-06-08T07:19:18Z

mikk-c
Jun 8, 2023
Maintainer

Edges Structure 2: Edge List (DOK Matrix)

This format might be more suitable for algorithmic operations that need to iterate over the edges. Stores edge information in a dictionary:

The keys of the dictionary is a tuple of node indexes.
The values of the dictionary are the edge weights.

Notes:

Entries can be in an arbitrary order.
Edge attributes can be handled by having values to be lists of values. Alternatively, they can be stored in a separate dense matrix sorted consistently with the entries of the dictionary.

Example (note: everything is 0-indexed):

{
(0,2): 1,
(0,4): 5,
(1,1): 2,
(3,4): 3,
(4,4): 4,
}

Result:

[0,0,1,0,5]
[0,2,0,0,0]
[0,0,0,0,0]
[0,0,0,0,3]
[0,0,0,0,4]

0 replies

mikk-c · 2023-06-08T12:14:53Z

mikk-c
Jun 8, 2023
Maintainer

Edges Structure 3: Adjacency List (LIL Matrix)

This format might be more suitable for algorithmic operations that need to iterate over the nodes. Stores edge information in two lists:

The column list is a list where the i-th element of that list is the column index of a nonzero value for the i-th row.
The values list is a list where the i-th element of that list is a nonzero value for the i-th row, for the corresponding column.

Notes:

The order of elements in the column and value list needs to be consistent and to respect the order of the rows.
Edge attributes can be handled by multiple values lists.

Example (note: everything is 0-indexed):

Column: [[2,4], [1], [], [4], [4]]
Values: [[1,5], [2], [], [3], [4]]

Result:

[0,0,1,0,5]
[0,2,0,0,0]
[0,0,0,0,0]
[0,0,0,0,3]
[0,0,0,0,4]

0 replies

mikk-c · 2023-06-08T13:03:00Z

mikk-c
Jun 8, 2023
Maintainer

Edges Structure 4: CSR Matrix

This format is suitable for efficient (row-oriented) linear algebra operations. Stores edge information in three arrays:

Column array: the i-th value in this array contains the column index of the i-th non zero value.
Data array: the j-th value in this array contains the j-th non zero value.
Bounds array: the k-th value in this array stores the number of nonzero entries up until row k-1.

Notes:

Both Column and Data have length equal to the number of nonzero elements. Bounds, instead, has length equal to the number of rows plus one.
The first element of Bounds is always zero and the last is always the number of nonzero elements.
Edge attributes can be handled by multiple Data arrays. Alternatively, Data could be a matrix with a column per attribute.

Example (note: everything is 0-indexed):

Column: [2,4,1,4,4]
Data: [1,5,2,3,4]
Bounds: [0,2,3,3,4,5]

Result:

[0,0,1,0,5]
[0,2,0,0,0]
[0,0,0,0,0]
[0,0,0,0,3]
[0,0,0,0,4]

0 replies

mikk-c · 2023-06-08T13:14:16Z

mikk-c
Jun 8, 2023
Maintainer

Edges Structure 5: CSC Matrix

This format is suitable for efficient (column-oriented) linear algebra operations. Stores edge information in three arrays:

Row array: the i-th value in this array contains the row index of the i-th non zero value.
Data array: the j-th value in this array contains the j-th non zero value.
Bounds array: the k-th value in this array stores the number of nonzero entries up until column k-1.

Notes:

Both Row and Data have length equal to the number of nonzero elements. Bounds, instead, has length equal to the number of columns plus one.
The first element of Bounds is always zero and the last is always the number of nonzero elements.
Edge attributes can be handled by multiple Data arrays. Alternatively, Data could be a matrix with a column per attribute.

Example (note: everything is 0-indexed):

Row: [1,0,0,3,4]
Data: [2,1,5,3,4]
Bounds: [0,0,1,2,2,5]

Result:

[0,0,1,0,5]
[0,2,0,0,0]
[0,0,0,0,0]
[0,0,0,0,3]
[0,0,0,0,4]

0 replies

DoganCK · 2023-06-08T16:14:17Z

DoganCK
Jun 8, 2023
Maintainer

Hi @mikk-c ,

Thanks for the list of detailed data structures.

I'm sympathetic to using indices as first-class citizens for:
a) fast lookups
b) core data structure being closer to a matrix representation.

The downside of it is that it makes removing nodes and edges expensive operations. Ie when a node is removed, every edge the subsequent nodes appear in need to be updated with the new index. As far as I can tell, graph-tool runs into the same issue where it's not straightforward to remove a node. Nevertheless, I think relying on indices is a net positive.

I've implemented some toy version of the options you listed as well as some others. The main problem with sparse matrices I've checked out so far is that edge lookups are expensive. That is, in order to find all edges for an arbitrary node, one needs to iterate over all edges in the graph. This can be optimized with some acrobatics but then the worry is this will hinder development speed and make the code unmaintainable.

I've been looking into graph-tool's implementation as a reference for a fast library. Tiago also treats indices as first class citizens and uses adjacency lists. Adjacency lists provide speed for node specific lookups. In fact, for directed graphs, he keeps two different lists for OutEdges and InEdges. Granted, InEdges can be derived from OutEdges, but the lookup speed is apparently worth the extra memory space.

So for the core data structure I'm drawn to adjacency lists similar to what you outlined in Edges 3:
a) Because of fast lookups, it lends itself easily to exploratory work before getting into heavy duty linear algebraic operations.
b) It's relatively easier to parse by humans.
c) It's relatively straightforward to convert to other data types.
d) It's versatile to be used in different types of graphs (see MultiGraph in the code below)

Here's a simplified representation of what it would look like. It's basically a zipped version of your LIL structure with Row (OutEdges) and Column (InEdges) views.

// Node is a generic type so that it can have an arbitrary amount of information as Michele suggests
// Edges is an array of arrays where the index of the inner array corresponds to the index of the corresponding node in the Nodes array.
// So the 0th array in Edges contains the edges belonging to the node with the same index.

[<Measure>] // This is a 0-cost F# feature that enhances readability
type NodeIx

type Graph<'Node> = { // 'Node is a generic type to be identified on runtime. Could be int, string, object, etc.
    IdMap: Map<string, int<NodeIx>>
    Nodes: 'Node []
    Edges: (int<NodeIx> * float) [] []
}

// Directed
// InEdges can be derived from OutEdges but then lookup times for InEdges would suffer.
// InEdges can be an option which the user might want to opt out of.
type DiGraph<'Node> = {
    IdMap: Map<string, int<NodeIx>>
    Nodes: 'Node []
    OutEdges: (int<NodeIx> * float) [] [] 
    InEdges: (int<NodeIx> * float) [] [] 
}

// MultiGraph
// Tentative; to demonstrate similar data structures can be used without much modification.
type MultiGraph<'Node, 'Label> = {
    IdMap: Map<string, int<NodeIx>>
    Nodes: 'Node []
    OutEdges: Dictionary<'Label, (int<NodeIx> * float) [] []>
    InEdges: Dictionary<'Label, (int<NodeIx> * float) [] []> 
}

// So the following array
// [0,0,1,0,5]
// [0,2,0,0,0]
// [0,0,0,0,0]
// [0,0,0,0,3]
// [0,0,0,0,4]
// would be represented as follows:

let d: DiGraph<int> = {
    IdMap = [("zero", 0<NodeIx>); ("one", 1<NodeIx>); ("two", 2<NodeIx>); ("three", 3<NodeIx>); ("four", 4<NodeIx>)] |> Map
    Nodes = [|0; 1; 2; 3; 4|]
    OutEdges = [|
                    [|(2<NodeIx>, 1.); (4<NodeIx>, 5.)|]
                    [|(1<NodeIx>, 2.)|]
                    [||]
                    [|(4<NodeIx>, 3.)|]
                    [|(4<NodeIx>, 4.)|]
                |]
    InEdges = [|
                    [||]
                    [|(1<NodeIx>, 2.)|]
                    [|(0<NodeIx>, 1.)|]
                    [||]
                    [|(0<NodeIx>,5.); (3<NodeIx>, 3.); (4<NodeIx>, 4.)|]
                |]
}

The IdMap serves to hide the index-level representation from the user. Needs a little bit more thinking.
There is a bit more work to do on the underlying array structure.

Looking forward to feedback and suggestions.

5 replies

mikk-c Jun 9, 2023
Maintainer

Hey @DoganCK,

thanks for this. Yes, this is essentially structure 3 (LIL) and it should be fine as basic data structure. I see one advantage of structure 1 (COO) -- see below -- but maybe it is not worth preferring COO over LIL because edge lookup is probably the most used operation.

Question: do I understand correctly that Edges in Graph still stores both directions?

What I think needs to be developed is how we store node and edge attributes. In both cases I was envisioning dense non-square matrices. Doing so should make things easier when down the line we will want to move them to the GPU as tensors to do GNNs, and I don't see a downside for normal lookup operations: you can always access all attribute values for a node by accessing its row, and the distribution of one attribute values over the network by accessing its column.

If we do it like this, it will highlight one case in which COO might be better than LIL, for some use cases. Let's suppose V is our node list, E is the list of all edges, and A is our list of node attributes, and B is our list of edge attributes.

Node attributes can be stored in a |V| X |A| matrix. The rows of the matrix are sorted consistently with V, so the first refers to node 0, the second to node 1, etc. The first column of this matrix has the value of the first attribute for node 0, and so on. This is essentially the same no matter the Edges data structure we use.

For edge attributes we could do the same thing: a |E| X |B| matrix. In this case, if we have a COO matrix for storing the edges we also have a natural and immediate order for E, which this matrix can honor, and allows for fast lookup of edge attributes. In LIL, I'm not sure there is a natural E order (am I wrong?).

Of course, we don't have to store attributes like this, I see in your current work a different approach, but I think using matrices for this can have long term advantages, at the price of a little implementation pain now.

Let me know what you think.

DoganCK Jun 9, 2023
Maintainer

Thanks for the feedback @mikk-c.

For undirected graphs, I didn’t think it would be necessary to store In- and OutEdges separately because they are the same. No?

Edge Lookups
Here’s how I understand this problem.

Suppose we have the following COO matrix.
[0,0,1,3,3,4] // Row
[2,4,1,2,4,4] // Column
[1,5,2,6,3,4] // Data

And we want to get the data for Edge(3,4).
We first need to identify the range in Row where the value is 3.
[-,-,-,+,+,-]
[0,0,1,3,3,4] // Row

We then need to find the value 4 in Column within the previously identified range.
[ , , ,-,+, ]
[2,4,1,2,4,4] // Column

Now we have identified that the Edge(3,4) has the index of 4 and a value of 3 in Data.

What LIL does is that it circumvents the first step reducing that index range search to constant time.

What we could do to optimize COO regarding the first step would be to maintain an index range. For the matrix above this would look like the following, where (-1,-1) represents absence:

Ix 0     1      2      3     4
[(0,1),(2,2),(-1,-1),(3,4),(5,5)]

I will dig into this further.

Node & Edge Attributes

I see the advantages COO when multiple node and edge attributes are involved.

This is also achievable with LIL fairly straightforwardly, where instead of having a single value per edge we can have a list of values.

Do you have an example of a canonical but simple use-case that has multiple node & edge attributes? It would help thinking about the API implications of having multiple attributes.

Natural Edge Order
Let’s take OutEdges representation in LIL. The order of the nested lists already mimics the order of nodes. So it’s first ordered by origin node. As for what goes on inside of those lists, I don’t think there is a hard & fast rule that they are ordered according to the destination node indices. But everything I’ve created so far naturally follows that format. And we can force it as a rule, if need be.

As I mentioned above I’ll look into a fix for COO.

As a final update I did a quick benchmark of converting from LIL to COO. And a 1% sparsity, 100k node, 5-attribute graph can be converted within 2 seconds. So we have options :)

mikk-c Jun 12, 2023
Maintainer

For undirected graphs, I didn’t think it would be necessary to store In- and OutEdges separately because they are the same. No?

Correct, I think there should be one Edges object, not separate InEdges and OutEdges. However, I was wondering if the presence of edge connecting node 0 and node 1 in the network leads two entries in Edges (putting node 1 in node 0's list and node 0 in node 1's list) or only one.

This is also achievable with LIL fairly straightforwardly, where instead of having a single value per edge we can have a list of values.

I know that in numpy some advantages come from storing array in contiguous memory. It feels that storing attributes as lists of values per edge won't make this possible. But perhaps you're thinking that for LIL it doesn't matter, because it is for data conversion and algorithmic approaches. Then when we implement CSR/CSC solutions, that are all about computational efficiency, we can switch to contiguous arrays?

Do you have an example of a canonical but simple use-case that has multiple node & edge attributes?

I don't have readily available, but I can think of many scenarios in which it would make sense. For instance, in a trade network connecting countries trading with each other, we have as node attributes the GDP of a country for a given set of years, and each edge has, as attributes, the trade volume between the two countries per year.

Should I just make up a small synthetic dataset for your tests?

we can force it as a rule, if need be.

Yes, I think we should. If we have a fixed and guaranteed order for Edges in LIL, then this might make things easier down the line when converting data formats.

As a final update I did a quick benchmark of converting from LIL to COO. And a 1% sparsity, 100k node, 5-attribute graph can be converted within 2 seconds. So we have options :)

1% sparsity at 100k nodes is 50M edges, right? That's pretty cool!

DoganCK Jun 12, 2023
Maintainer

Hi again,

I've looked into the implementation with index ranges I mentioned earlier. Updating the graph becomes more expensive because each time an edge is added the index ranges need to be updated. Also, currently I don't think we have the infrastructure to support a matrix that can be updated frequently in a performant manner.

At H&C, we have some pressing issues that we need to look at for which LIL will provide a neat interface. So LIL shouldn't be a terrible starting point.

The LIL structure we've been talking about is in between the "dictionaries all the way down" approach that networkx takes and a sparse matrix approach in terms of being index based rather than key-based. So, working with it on a real-world problem will provide the best insights about the up- and downsides.

Having a working example will also shed light on what kind of API we want for the library.

There's a lot of quirks to be figured out with the LIL approach as well, so it will at least help us lay the conceptual and technical groundwork for the long-term goals, even if we don't end up using it long-term.

mikk-c Jun 13, 2023
Maintainer

This is good for me, I wasn't actually objecting to LIL at all, I agree it's a perfect starting point. I'm just pointing out potential things that we need to take into account to be in good shape medium-long term.

Go for it!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graph representation and data structuress #2

{{title}}

Replies: 7 comments 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Graph representation and data structuress #2

HarryMcCarney Jun 7, 2023 Maintainer

Replies: 7 comments · 6 replies

mikk-c Jun 8, 2023 Maintainer

Graph

mikk-c Jun 8, 2023 Maintainer

mikk-c Jun 8, 2023 Maintainer

Edges Structure 1: COO Matrix

mikk-c Jun 8, 2023 Maintainer

Edges Structure 2: Edge List (DOK Matrix)

mikk-c Jun 8, 2023 Maintainer

Edges Structure 3: Adjacency List (LIL Matrix)

mikk-c Jun 8, 2023 Maintainer

Edges Structure 4: CSR Matrix

mikk-c Jun 8, 2023 Maintainer

Edges Structure 5: CSC Matrix

DoganCK Jun 8, 2023 Maintainer

mikk-c Jun 9, 2023 Maintainer

DoganCK Jun 9, 2023 Maintainer

mikk-c Jun 12, 2023 Maintainer

DoganCK Jun 12, 2023 Maintainer

mikk-c Jun 13, 2023 Maintainer

HarryMcCarney
Jun 7, 2023
Maintainer

Replies: 7 comments 6 replies

mikk-c
Jun 8, 2023
Maintainer

mikk-c Jun 8, 2023
Maintainer

mikk-c
Jun 8, 2023
Maintainer

mikk-c
Jun 8, 2023
Maintainer

mikk-c
Jun 8, 2023
Maintainer

mikk-c
Jun 8, 2023
Maintainer

mikk-c
Jun 8, 2023
Maintainer

DoganCK
Jun 8, 2023
Maintainer

mikk-c Jun 9, 2023
Maintainer

DoganCK Jun 9, 2023
Maintainer

mikk-c Jun 12, 2023
Maintainer

DoganCK Jun 12, 2023
Maintainer

mikk-c Jun 13, 2023
Maintainer