One Tip Per Day: general tips

Showing posts with label general tips. Show all posts

Sunday, November 29, 2020

How to take ultra high resolution screenshot?

This is very useful esp. when you want to use a high-resolution screenshot for a poster/printout. I currently don't find a way in Safari to save the screenshot in Responsive Mode.

See stepwise instructions below from David Augustat's blog: https://davidaugustat.com/web/take-ultra-high-resolution-screenshots-in-chrome

My 15 practical tips for a bioinformatician

Tips below are based on the lessons I learnt from making mistakes during my years of research. It's purely personal opinion. Order doesn't mean anything. If you think I should include something else, please comment below.

Always set a seed number when you run tools with random option, e.g. bedtools shuffle, random etc.; You (or your boss in a day) want your work reproducible.
Set your own temporary folder (via --tmp, $TMPDIR etc., depending on your program). By default, many tools, e.g. sort, use the system /tmp as temporary folder, which may have limited quote that is not enough for your big NGS data.
Always use a valid name for your variables, column and rows of data frame. Otherwise, it can bring up unexpected problem, e.g. a '-' in the name will be transferred to '.' in R unless you specify check.names=F.
Always make a README file for the folder of your data; For why and how, read this: http://stackoverflow.com/questions/2304863/how-to-write-a-good-readme
Always comment your code properly, for yourself and for others, as you very likely will read your ugly code again.
Always backup your code timely, using github, svn, Time Machine, or simply copy/paste whatever.
Always clean up the intermediate or unnecessary data, as you can easily shock your boss and yourself by generating so much data (and perhaps most of them are useless).
Don't save into *.sam if you can use *.bam. Always zip your fastq (and other large plain files) as much as you. This applies to other file format if you can use the compressed one. As you cannot imagine how much data (aka "digital garbage") you will generate soon.
Using parallel as much as you can, e.g. using "LC_ALL=C sort --parallel=24 --buffer-size=5G" for sorting (https://www.biostars.org/p/66927/), as multi-core CPU/GPU is common nowaday.
When a project is completed, remember to clean up your project folder, incl. removing the unnecessary code/data/intermediate files, and burn a CD or host it somewhere in cloud for the project. You never know when you, your boss or your collaborators will need the data again;
Make your code sustainable as possible as you can. Remember the 3 major features of OOP: Inheritance, Encapsulation, Polymorphism. (URL)
When you learn some tips from others by Google search, remember to list the URL for future reference and also for acknowledging others' credit. This applies to this post, of course :)
Keep learning, otherwise you will be out soon. Just like the rapid development of NGS techniques, computing skills are also evolving quickly. Always catch up with the new skills/tools/techniques.
When you learn some tips from others, remember to share something you learned to the community as well, as that's how the community grows healthily.
Last but not least, stand up and move around after sitting for 1-2 hours. This is especially important for us bioinformaticians who usually sit in front of computer for hours. Only good health can last your career long. More reading: https://www.washingtonpost.com/news/wonk/wp/2015/06/02/medical-researchers-have-figured-out-how-much-time-is-okay-to-spend-sitting-each-day/

Wednesday, May 22, 2013

Basic knowledge for a bioinformatician

Very often, esp. when I was interviewed for a job or talk with a knowledgable guy like Xiaopeng, I feel there are full of "holes" in the mass body of my knowledge. How awkward it is! Guess I am not the only one who feels the same.

As a senior-in-age-but-not-senior-in-knowledge bioinformatian, I would seriously recommend who will like to work in this field to have basic knowledge in the following subjects I can think of:
1. probability and statistics (not everyone know the difference between them)
2. machine learning (the 4-elements circle: data + algorithm + model + criteria)
3. programming design (knowing how to write script does not mean you know how to program; a good programer should learn the concept of how to write code in a inheritable manner).
4. algorithm and data structure (many know some algorithm, but to truly understand it is not a easy task. Binindex is a good example of using the concept of binary tree to store/query genomic coordinate in a super fast way.)
5. know how to appreciate a scientific work. (A paper can be good in way of (i) data sources (2) method and/or (3) idea. For sure it's also important to tell good paper from junk papers. I feel it's so important to enhance the sensitivity of 'smelling' a paper)

I found this nice reading list from Hendrik's page (http://www.liacs.nl/~hoogeboo/mcb/nature_primer.html)

How to apply de Bruijn graphs to genome assembly
(Phillip E C Compeau, Pavel A Pevzner & Glenn Tesler)
November 2011, Vol 29, No 11; pp 987 - 991
doi: 10.1038/nbt.2023 (?)
Analyzing 'omics data using hierarchical models
(Hongkai Ji & X Shirley Liu)
April 2010, Vol 28, No 4; pp 337 - 340
doi: 10.1038/nbt.1619 (?)
What is flux balance analysis?
(Jeffrey D Orth, Ines Thiele & Bernhard Ø Palsson)
March 2010, Vol 28, No 3; pp 245 - 248
doi: 10.1038/nbt.1614 (?)
How does multiple testing correction work?
(William S Noble)
December 2009, Vol 27, No 12 ; pp 1135 - 1137
doi: 10.1038/nbt1209-1135 (?)
How to visually interpret biological data using networks
(Daniele Merico, David Gfeller & Gary D Bader)
October 2009, Vol 27 No 10 ; pp 921 - 924
doi: 10.1038/nbt.1567 (?)
How to map billions of short reads onto genomes
(Cole Trapnell & Steven L Salzberg)
May 2009, Vol 27, No 5; pp 455 - 457
doi: 10.1038/nbt0509-455 (?)
SNP imputation in association studies
(Eran Halperin & Dietrich A Stephan)
April 2009, Vol 27, No 4; pp 349 - 351
doi: 10.1038/nbt0409-349 (?)
Maximizing power in association studies
(Eran Halperin & Dietrich A Stephan)
March 2009, Vol 27, No 3; pp 255 - 256
doi: 10.1038/nbt0309-255 (?)
Understanding genome browsing
(Melissa S Cline & W James Kent)
February 2009, Vol 27, No 2; pp 153 - 155
doi: 10.1038/nbt0209-153 (?)
What are decision trees?
(Carl Kingsford & Steven L Salzberg)
September 2008, Volume 26, No 9; pp 1011 - 1013
doi: 10.1038/nbt0908-1011 (?)
What is the expectation maximization algorithm?
(Chuong B Do & Serafim Batzoglou)
August 2008, Volume 26 No 8; pp 897 - 899
doi: 10.1038/nbt1406 (?)
What is principal component analysis?
(Markus Ringnér)
March 2008, Volume 26, No 3; pp 303 - 304
doi: 10.1038/nbt0308-303 (?)
What are artificial neural networks?
(Anders Krogh)
February 2008, Volume 26, No 2; pp 195 - 197
doi: 10.1038/nbt1386 (?)

How does eukaryotic gene prediction work?
(Michael R Brent)
August 2007, Volume 25, No 8; pp 883 - 885
doi: 10.1038/nbt0807-883 (?)
How do shotgun proteomics algorithms identify proteins?
(Edward M Marcotte)
July 2007, Volume 25, No 7; pp 755 - 757
doi: 10.1038/nbt0707-755 (?)
What is a support vector machine?
(William S Noble)
December 2006, Volume 24, No 12; pp 1565 - 1567
doi: 10.1038/nbt1206-1565 (?)
How does DNA sequence motif discovery work?
(Patrik D'haeseleer)
August 2006, Volume 24, No 8; pp 959 - 961
doi: 10.1038/nbt0806-959 (?)
What are DNA sequence motifs?
(Patrik D'haeseleer)
April 2006, Volume 24, No 4; pp 423 - 425
doi: 10.1038/nbt0406-423 (?)
Inference in Bayesian networks
(Chris J Needham, James R Bradford, Andrew J Bulpitt & David R Westhead)
January 2006, Volume 24, No 1; pp 51 - 53
doi: 10.1038/nbt0106-51 (?)
How does gene expression clustering work?
(Patrik D'haeseleer)
December 2005, Volume 23, No 12; pp 1499 - 1501
doi: 10.1038/nbt1205-1499 (?)
How do RNA folding algorithms work?
(Sean R Eddy)
November 2004, Volume 22, No 11; pp 1457 - 1458
doi: 10.1038/nbt1104-1457 (?)
What is a hidden Markov model?
(Sean R Eddy)
October 2004, Volume 22, No 10; pp 1315 - 1316
doi: 10.1038/nbt1004-1315 (?)
What is Bayesian statistics?
(Sean R Eddy)
September 2004, Volume 22, No 9; pp 1177 - 1178
doi: 10.1038/nbt0904-1177 (?)
Where did the BLOSUM62 alignment score matrix come from?
(Sean R Eddy)
August 2004, Volume 22, No 8; pp 1035 - 1036
doi: 10.1038/nbt0804-1035 (?)
What is dynamic programming?
(Sean R Eddy)
July 2004, Volume 22, No 7; pp 909 - 910
doi: 10.1038/nbt0704-909 (?)

Getting Started in ...

Getting Started in Gene Orthology and Functional Analysis
(Fang G, Bhardwaj N, Robilotto R, Gerstein MB)
PLoS Comput Biol (2010) 6(3): e1000703;
doi: 10.1371/journal.pcbi.1000703 (?)
Getting Started in Structural Phylogenomics
(Sjölander K )
PLoS Comput Biol (2010) 6(1): e1000621 ;
doi: 10.1371/journal.pcbi.1000621 (?)
Getting Started in Gene Expression Microarray Analysis
(Slonim DK, Yanai I)
PLoS Comput Biol (2009) 5(10): e1000543;
doi: 10.1371/journal.pcbi.1000543 (?)
Getting Started in Text Mining: Part Two.
(Rzhetsky A, Seringhaus M, Gerstein MB)
PLoS Comput Biol (2009) 5(7): e1000411. ;
doi: 10.1371/journal.pcbi.1000411 (?)
Getting Started in Computational Mass Spectrometry-Based Proteomics.
(Vitek O)
PLoS Comput Biol (2009) 5(5): e1000366. ;
doi: 10.1371/journal.pcbi.1000366 (?)

Getting Started in Computational Immunology.
(Kleinstein SH )
PLoS Comput Biol (2008) 4(8): e1000128;
doi: 10.1371/journal.pcbi.1000128 (?)
Getting Started in Biological Pathway Construction and Analysis.
(Viswanathan GA, Seto J, Patil S, Nudelman G, Sealfon SC )
PLoS Comput Biol (2008) 4(2): e16;
doi: 10.1371/journal.pcbi.0040016 (?)
Getting Started in Text Mining
(Cohen KB, Hunter L)
PLoS Comput Biol (2008) 4(1): e20;
doi: 10.1371/journal.pcbi.0040020 (?)
Getting Started in Probabilistic Graphical Models.
(Airoldi EM )
PLoS Comput Biol (2007) 3(12): e252. ;
doi: 10.1371/journal.pcbi.0030252 (?)
Getting Started in Tiling Microarray Analysis
(Liu XS)
PLoS Comput Biol (2007) 3(10): e183;
doi: 10.1371/journal.pcbi.0030183 (?)

Ten Simple Rules

Also the Ten Simple Rules series of editorials has a separate page at the PLoS journal. A link is now all you need to read about 'Ten Simple Rules for Getting Published' or '...for a Good Poster Presentation', etc.
On the Process of Becoming a Great Scientist
(Giddings MC)
PLoS Comput Biol (2008) 4(2): e33;
doi: 10.1371/journal.pcbi.0040033 (?)

One Tip Per Day

Pages