Monday, August 06, 2012

Finding connected components of a graph

I recently was bothered by a seemly-easy problem: how to merge lines with common number(s)?

For example, if the original file looks like:

1;2;3;4
4;5
7;5
10;13
22;34
11;13

Based on the rule, it will be merged into:
1;2;3;4;5;7
10;11;13
22;34

This is actually derived from a common problem in graph theory: how to detect the connect components?

Imagine each number in the txt file is the label of nodes, one line is a sub-graph (more extremely, it can be two numbers/nodes for just one edge). So if two lines have common number, they have common node connected, therefore should be merged together. In this case, I don't need to consider the topology of final result, which means the order of number in the final files does not matter. 

Of course, there are more sophisticated ways to do so. But can we do it using bash command?

Here is it (../data/IDs is the node label file):

sed 's/;/\n/g' ../data/IDs | sort -u > ../data/uniqIDs
cp ../data/IDs a.temp

while IFS= read -r line
do
   grep $line a.temp | tr ';' '\n' | sort -u | tr '\n' ';' | sed 's/;$/\n/' > b.temp
   grep -v $line a.temp >> b.temp
   cp b.temp a.temp
done < ../data/uniqIDs

No comments:

Post a Comment