For example, if the original file looks like:
1;2;3;4
4;5
7;5
10;13
22;34
11;13
Based on the rule, it will be merged into:
1;2;3;4;5;7
10;11;13
22;34
This is actually derived from a common problem in graph theory: how to detect the connect components?
Imagine each number in the txt file is the label of nodes, one line is a sub-graph (more extremely, it can be two numbers/nodes for just one edge). So if two lines have common number, they have common node connected, therefore should be merged together. In this case, I don't need to consider the topology of final result, which means the order of number in the final files does not matter.
Of course, there are more sophisticated ways to do so. But can we do it using bash command?
Here is it (../data/IDs is the node label file):
sed 's/;/\n/g' ../data/IDs | sort -u > ../data/uniqIDs
cp ../data/IDs a.temp
while IFS= read -r line
do
grep $line a.temp | tr ';' '\n' | sort -u | tr '\n' ';' | sed 's/;$/\n/' > b.temp
grep -v $line a.temp >> b.temp
cp b.temp a.temp
done < ../data/uniqIDs
No comments:
Post a Comment