Friday, March 20, 2015

grep -wfF list.txt input.txt

If you are just greping the list from a file, and your list are store in a file, let's say, list.txt, then you can always do grep -wf list.txt input.txt

When list.txt is huge, "-F" will be much faster.

Extracted from https://www.biostars.org/p/134753/#134871

3 comments:

  1. captainentropy5:22 PM

    Thanks for teaching us this handy command. I've been using it for a few days now but I became frustrated in having to always create and save the "list.txt" and "input.txt" files since I was needing to do this for files on the fly (mainly to answer the "what if" question with my boss and colleagues when examining our ChIP-seq data). I'm not a computer scientist, so using google fu I learned a very useful (for me) addition: the process substitution operator. http://www.refining-linux.org/archives/24/17-Process-substitution/ http://serverfault.com/a/171120

    So now if I want to find out how many genes are in common between two lists, and those lists contain extraneous columns of data I can use awk (or whatever command is needed, even a bunch of piped ones) to extract the data for both files *in* the grep command:

    grep -cwf <(awk '{print $2}' List.txt) <(awk '{print $2}' Input.txt)

    Note, the above will count the number of overlaps between the lists (the -c part of the grep command) but could instead be redirected to output or whatever.

    ReplyDelete
    Replies
    1. Thanks for your comments, captainentropy. I am not sure I understand what your question is, or you just leave a comment?

      Delete
    2. captainentropy5:49 PM

      Hi Xianjun, sorry if I wasn't clear. I didn't have a question, just a comment on how to expand the usefulness of your tip. I find myself often using bedtools to (in the presence of colleagues and my advisor) identify overlapping regions in ChIP-seq experiments and the promoters of the genes they were binding to then looking for intersections with other gene lists. Your tip saved me time but only if I already had the lists ready. However, once I integrated it with process substitution I could answer more questions even more quickly.

      Delete