Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add solutions for clique percolation #2

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

jamesjiang52
Copy link

This commit is for the clique percolation benchmark (https://github.com/igraph/usability-benchmarks/blob/master/tasks/tasks.md#clique-percolation).

I'm working with @dkhl65, @yanchenm, and @ChenyanWang on this issue.

@dkhl65
Copy link

dkhl65 commented Dec 15, 2021

This pull request is for igraph/python-igraph#469.

# If the two cliques share at least min_common_vertices,
# they are part of the same community
if c1 != c2 and len(cliques[c1].intersection(cliques[c2])) >= min_common_vertices:
cg_edges.append((c1, c2))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This double for loop can be simplified to a one-liner using itertools.combinations() and Python list comprehensions.

# Remove all connected vertices
for v in vertices:
to_merge.remove(v)
clique_communities.append(vertices)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that instead of this while loop, you can simply call clique_graph.clusters(), which gives you a VertexClustering object that has all that you need.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically you will need to iterate over the VertexClustering to get the connected components of the clique graph.

@vtraag vtraag requested a review from ntamas December 15, 2021 22:08
@ntamas
Copy link
Member

ntamas commented Dec 15, 2021

Thanks a lot for your work! I have identified two places where the code can be simplified; it's good that you did it your way first because it highlights that some features already provided by igraph are not as accessible / discoverable as they should be.

Copy link
Member

@ntamas ntamas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dummy comment to make @vtraag's spurious review request go away :)

raise ValueError("Error: min_common_vertices=%d must be greater than k=%d" % (min_common_vertices, k))

# Each clique can be identified by it's index in 'cliques'
cliques = list(map(set, k_cliques(g, k)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cliques = [set(clique) for clique in k_cliques(g, k)] is probably more readable

@ntamas
Copy link
Member

ntamas commented Dec 16, 2021

@vtraag @szhorvat What's the policy for these usability benchmarks in terms of efficiency? For instance, the current version enumerates all k-cliques, creates the clique graph, and merges them, sticking to the original definition of the clique percolation problem to the letter. However, one may easily speed it up by recognizing that one can search for cliques of size at least k, because larger cliques consist of highly overlapping k-cliques anyway so they will eventually be merged in the merging step. This would also alleviate the need of definiing a separate k_clique function for searching for cliques of exactly size k because g.cliques(min_size=k) would do the job just fine.

@vtraag
Copy link
Member

vtraag commented Dec 16, 2021

I think the usability benchmarks should focus on how things can be achieved most easily, not necessarily focusing on performance. Having said that, if there are things that can trivially be done in a more performant way, that should have priority. The task is simply to emulate the original clique percolation, so I think the implementation should just follow that idea.

@vtraag
Copy link
Member

vtraag commented Dec 16, 2021

What I think should be improved in this case is to simplify the code as much as possible. For example, I would not include code as a __main__ function. Also the separate k_cliques function is not necessary I think.

cg_edges = [(c1, c2) for c1, c2 in itertools.combinations(range(num_cliques), 2) if len(cliques[c1].intersection(cliques[c2])) >= min_common_vertices]

# Add edges for clique graph
clique_graph.add_edges(cg_edges)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you would also be able to simply construct a graph directly here:

Suggested change
clique_graph.add_edges(cg_edges)
clique_graph = ig.Graph(cg_edges)

That way, you don't first need to construct a graph, and call add_vertices.

Copy link
Member

@vtraag vtraag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, I believe we are almost there! We probably can't simplify it much further. There are a few minor remarks that I still have.


EXAMPLE_K = 3

def k_communities(g, k, min_common_vertices):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would really call this clique_percolation to be consistent with the usual name.

Given a graph and size k, return a list of communities found through clique
percolation with at least min_common_vertices shared.
A community is the connected component of cliques where two cliques overlap
in at least k-1 vertices.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
in at least k-1 vertices.
in at least min_common_vertices.

percolation with at least min_common_vertices shared.
A community is the connected component of cliques where two cliques overlap
in at least k-1 vertices.
Each community is a set of vertices in that community
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think lines 13-16 can be removed, because they don't add much in terms of explanation of the function.

connect vertices if the cliques share min_common_vertices
"""
if (min_common_vertices > k):
raise ValueError("Error: min_common_vertices=%d must be greater than k=%d" % (min_common_vertices, k))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise ValueError("Error: min_common_vertices=%d must be greater than k=%d" % (min_common_vertices, k))
raise ValueError(f"Error: min_common_vertices={min_common_vertices} must be greater than k={k}")

# Merge cliques accordingly into communities
cg_clusters = clique_graph.clusters()

# Generate the list of clique communities as a list of set of vertices
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to define communities = [] only here, because it is used only here.

@szhorvat
Copy link
Member

szhorvat commented Dec 16, 2021

@vtraag @szhorvat What's the policy for these usability benchmarks in terms of efficiency?

Answering the general question:

It is sometimes worth having more than one solution for the same task, each with comments about pros/cons.

The goal is to gauge the usability of igraph and compare it to other libraries. In practical use, I might go with the easiest-to-write solution first. Such a solution is worth writing down. But soon I might find that this solution does not scale, and need to rewrite it completely to something much faster, but also more complex. This solution is also worth writing down. There would be a couple of comments by the author explaining why it's worth keeping both solutions and what the tradeoff is. The comment is all we need—this is not about performance benchmarking so two solutions make sense only when there is an obvious tradeoff.

Perhaps another library has simple solutions, but no fast ones. Or it has fast solutions, but no simple ones. Or maybe the simplest solution happens to be the fastest one too. In order to make a full comparison, we will need to have more than one solution to some tasks.

For example, with the recent clique visualization task, using VertexCover was a nice choice, and it's wroth recording. But IMO it's not suitable for every use case because the cover, as drawn, does not make it clear which vertex is part of the clique and which isn't. Sometimes, the a single shaded region on the plot seems to have five vertices within, even though we are visualizing 4-cliques.

If the VertexCover solution is easy to use, and allows for compact code or for a quick-and-dirty visualization, it should be shown. But it's clearly not suitable for all use cases, so a more complicated solution that colours vertices that are part of the clique should also be shown.

I will certainly be showing multiple solutions with Mathematica because the performance / simplicity tradeoff is common. In fact, some of the functions I provide in IGraph/M make it possible to have both fast and simple code (simple was already possible with built-in Mathematica functions, but it traded off performance). This is what really illustrates one of the new possibilities that IGraph/M brings to plain Mathematica.

For instance, the current version enumerates all k-cliques, creates the clique graph, and merges them, sticking to the original definition of the clique percolation problem to the letter. However, one may easily speed it up by recognizing that one can search for cliques of size at least k, because larger cliques consist of highly overlapping k-cliques anyway so they will eventually be merged in the merging step. This would also alleviate the need of definiing a separate k_clique function for searching for cliques of exactly size k because g.cliques(min_size=k) would do the job just fine.

I didn't look at the code in this PR, but if we are looking for k-cliques, why don't we set both the min and max size to k to only obtain the correct sizes?

@ntamas
Copy link
Member

ntamas commented Dec 16, 2021

I didn't look at the code in this PR, but if we are looking for k-cliques, why don't we set both the min and max size to k to only obtain the correct sizes?

I wasn't being precise because I forgot to add that we should actually be looking for maximal cliques of at least size k. The point of the clique percolation problem is that you will eventually merge all those k-cliques into larger cliques in order to identify overlapping communities. So, if you have a 24-clique in your graph, and you are examining k-clique percolation with k=4, you will end up enumerating and listing all the 4-subcliques of that 24-clique only to merge them later. It is much more efficient to look for maximal cliques of size at least k because then you effectively do the merging "in advance". Sorry if this wasn't clear; I forgot to mention the "maximal" part in my comment.

@szhorvat
Copy link
Member

Addendum to my long comment above: in case it wasn't obvious, I'm a Tim Toady guy, but I realize that Python is not that kind of language :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants