Suppose you have a very large graph, for example a protein-protein interaction graph for some complicated synthesis pathway. To draw conclusions from such a wealth of data, it is typically helpful to look for certain patterns, or subgraphs.
Finding, or indeed even counting, small sized subgraphs in a large graph seems to be a very hard problem: The number of candidate positions against which you need to match your subgraph explodes combinatorically. The situation is even more hopeless if the data doesn’t fit into main memory anymore.
To address the problem of working with graphs that don’t fit into RAM the streaming model was introduced. Here the algorithm has a very limited amount of storage, say O() bits, and the input is presented as a stream of edges. The algorithm is only allowed to make a small number of passes over the data.
Only few problems can be solved exactly in this model, but often it is possible to find good approximation algorithms. In this post we will have a look at how to count triangles in the streaming model. Very similar techniques can be used to count arbitrary constant size subgraphs.1
Triangle counting already is a relevant problem in itself. It has applications for example in community detection in graphs. There the clustering coefficient, i.e. how many of a node’s neighbours are adjacent to each other (and thus form triangles), plays a central role. Indeed there are several papers 2 on counting triangles using MapReduce as a way to tackle the large networks we face today. The streaming algorithm I’ll present here is a simple alternative to this.
This algorithm is from a paper by Jowhari and Ghodsi 3. It is extremely simple. We need a source of sufficiently independent random bits, say an explicit polynomial generator of degree 12. That is a polynomial over some sufficiently large field with random coefficients. The random numbers are then and we use their binary representation for a stream of random numbers , with . Note that we can encode this generator using only a logarithmic number of bits for the coefficients.
As we see the edges we sum up , that is the algorithm computes
The output is then
That’s it. Algorithm over. This might seem too simple to be correct, but it actually works. Observe that we output
Since is zero in expectation for odd , only terms with even powers count (in expectation) in this sum.4 Hence only if , , and we count this term. This is exactly true if the three edges form a triangle! Of course we over count, since there are permutations of the three edges and they all occur in the sum, but we simply divide by six and get the right thing.
Unfortunately this is only true in expectation, the actual value can differ quite a bit. But it is not too difficult to bound the variance of and hence get a good estimate of how often we need to run this with fresh random variables to get a good estimate. Have a look at the paper to get the actual numbers. In the streaming model we can simply run these instances in parallel.
In did a few experiments and it seems like the high independence that we require in the analysis is actually not so important for random graphs. My results indicate that a normal linear congruential generator already provides sufficient randomness to get good estimates, so the algorithm is really trivial to implement.
Suri, Vassilvitskii: Counting Triangles and the Curse of the Last Reducer,
Pagh, Tsourarakis: Colorful triangle counting and a mapreduce implementation↩
Jowhari, Ghodsi: New Streaming Algorithms for Counting Triangles in Graphs↩
If you ask yourself why we can take the expectation of the individual , even though they appear in a product, this is because they are sufficiently independent. For the expectation calculation six-way independent random numbers would be sufficient (there are six terms in the product), but for the variance calculation that I don’t show here we need , hence 12-way independence.↩