-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add solutions for clique percolation #2
base: master
Are you sure you want to change the base?
Conversation
This pull request is for igraph/python-igraph#469. |
# If the two cliques share at least min_common_vertices, | ||
# they are part of the same community | ||
if c1 != c2 and len(cliques[c1].intersection(cliques[c2])) >= min_common_vertices: | ||
cg_edges.append((c1, c2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This double for loop can be simplified to a one-liner using itertools.combinations()
and Python list comprehensions.
# Remove all connected vertices | ||
for v in vertices: | ||
to_merge.remove(v) | ||
clique_communities.append(vertices) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that instead of this while loop, you can simply call clique_graph.clusters()
, which gives you a VertexClustering
object that has all that you need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically you will need to iterate over the VertexClustering
to get the connected components of the clique graph.
Thanks a lot for your work! I have identified two places where the code can be simplified; it's good that you did it your way first because it highlights that some features already provided by igraph are not as accessible / discoverable as they should be. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dummy comment to make @vtraag's spurious review request go away :)
raise ValueError("Error: min_common_vertices=%d must be greater than k=%d" % (min_common_vertices, k)) | ||
|
||
# Each clique can be identified by it's index in 'cliques' | ||
cliques = list(map(set, k_cliques(g, k))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cliques = [set(clique) for clique in k_cliques(g, k)]
is probably more readable
@vtraag @szhorvat What's the policy for these usability benchmarks in terms of efficiency? For instance, the current version enumerates all k-cliques, creates the clique graph, and merges them, sticking to the original definition of the clique percolation problem to the letter. However, one may easily speed it up by recognizing that one can search for cliques of size at least k, because larger cliques consist of highly overlapping k-cliques anyway so they will eventually be merged in the merging step. This would also alleviate the need of definiing a separate |
I think the usability benchmarks should focus on how things can be achieved most easily, not necessarily focusing on performance. Having said that, if there are things that can trivially be done in a more performant way, that should have priority. The task is simply to emulate the original clique percolation, so I think the implementation should just follow that idea. |
What I think should be improved in this case is to simplify the code as much as possible. For example, I would not include code as a |
cg_edges = [(c1, c2) for c1, c2 in itertools.combinations(range(num_cliques), 2) if len(cliques[c1].intersection(cliques[c2])) >= min_common_vertices] | ||
|
||
# Add edges for clique graph | ||
clique_graph.add_edges(cg_edges) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you would also be able to simply construct a graph directly here:
clique_graph.add_edges(cg_edges) | |
clique_graph = ig.Graph(cg_edges) |
That way, you don't first need to construct a graph, and call add_vertices
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, I believe we are almost there! We probably can't simplify it much further. There are a few minor remarks that I still have.
|
||
EXAMPLE_K = 3 | ||
|
||
def k_communities(g, k, min_common_vertices): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would really call this clique_percolation
to be consistent with the usual name.
Given a graph and size k, return a list of communities found through clique | ||
percolation with at least min_common_vertices shared. | ||
A community is the connected component of cliques where two cliques overlap | ||
in at least k-1 vertices. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in at least k-1 vertices. | |
in at least min_common_vertices. |
percolation with at least min_common_vertices shared. | ||
A community is the connected component of cliques where two cliques overlap | ||
in at least k-1 vertices. | ||
Each community is a set of vertices in that community |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think lines 13-16 can be removed, because they don't add much in terms of explanation of the function.
connect vertices if the cliques share min_common_vertices | ||
""" | ||
if (min_common_vertices > k): | ||
raise ValueError("Error: min_common_vertices=%d must be greater than k=%d" % (min_common_vertices, k)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise ValueError("Error: min_common_vertices=%d must be greater than k=%d" % (min_common_vertices, k)) | |
raise ValueError(f"Error: min_common_vertices={min_common_vertices} must be greater than k={k}") |
# Merge cliques accordingly into communities | ||
cg_clusters = clique_graph.clusters() | ||
|
||
# Generate the list of clique communities as a list of set of vertices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be better to define communities = []
only here, because it is used only here.
Answering the general question: It is sometimes worth having more than one solution for the same task, each with comments about pros/cons. The goal is to gauge the usability of igraph and compare it to other libraries. In practical use, I might go with the easiest-to-write solution first. Such a solution is worth writing down. But soon I might find that this solution does not scale, and need to rewrite it completely to something much faster, but also more complex. This solution is also worth writing down. There would be a couple of comments by the author explaining why it's worth keeping both solutions and what the tradeoff is. The comment is all we need—this is not about performance benchmarking so two solutions make sense only when there is an obvious tradeoff. Perhaps another library has simple solutions, but no fast ones. Or it has fast solutions, but no simple ones. Or maybe the simplest solution happens to be the fastest one too. In order to make a full comparison, we will need to have more than one solution to some tasks. For example, with the recent clique visualization task, using If the I will certainly be showing multiple solutions with Mathematica because the performance / simplicity tradeoff is common. In fact, some of the functions I provide in IGraph/M make it possible to have both fast and simple code (simple was already possible with built-in Mathematica functions, but it traded off performance). This is what really illustrates one of the new possibilities that IGraph/M brings to plain Mathematica.
I didn't look at the code in this PR, but if we are looking for |
I wasn't being precise because I forgot to add that we should actually be looking for maximal cliques of at least size k. The point of the clique percolation problem is that you will eventually merge all those k-cliques into larger cliques in order to identify overlapping communities. So, if you have a 24-clique in your graph, and you are examining k-clique percolation with k=4, you will end up enumerating and listing all the 4-subcliques of that 24-clique only to merge them later. It is much more efficient to look for maximal cliques of size at least k because then you effectively do the merging "in advance". Sorry if this wasn't clear; I forgot to mention the "maximal" part in my comment. |
Addendum to my long comment above: in case it wasn't obvious, I'm a Tim Toady guy, but I realize that Python is not that kind of language :-) |
This commit is for the clique percolation benchmark (https://github.com/igraph/usability-benchmarks/blob/master/tasks/tasks.md#clique-percolation).
I'm working with @dkhl65, @yanchenm, and @ChenyanWang on this issue.