The aim of this program is to convert our Illumina genome analyzer 2 to an expensive microarray scanner. This is done by counting the occurences of fragments in the sample and associating them with particular genes. The advantage of doing so is that we obtain more information than by probing for specific sequences. The disadvantage is of course that there are somewhat more metrics involved than first meets the eye. For isntance, fo we wnat the average gene expression? The maximum occurence sequence occurence ? Do we only count the exons, or do we include the introns as well.
The program below makes it possible to obtain all this information in a ready-to-use format. The input consist of a gene location file that describes the genome with three columns. The first column is the gene identifier, this can be anything. The second column contains 'transcript' or 'exon'. 'Transcript' implies that the following last two columns contain the start and stop position of the transcript. If the second column lists 'exon' then the last two columns list the start and stop position of the exon related to that gene. The third column contains the chromosome on which this gene occurs.
An interesting problem with this kind of program is splice-variants. Many genes can have different transcriptions and they might overlap. This means that it is not necearily easy to specify whether a specific gene position relates to the exon or introns. We currently assume that it belongs to an exon if it is present in any of the possible exons. If one is specifically interested in transcription variants then the location file should instead of using the gene-id in the first column use some form of transcription-id.
The program also does not reclaim short reads that jump from one exon to another because these are not alligned by Eland. This means that around 18% of the reads will be missing in any case.Usage: eland2exp <positionfile> <fragmentsize> <strand> An example of a location file:
gid chrom tid dir start stop rank 1 2L 1 1 7529 8116 1 1 2L 1 1 8229 8589 2 1 2L 1 1 8668 9491 3 2 2L 2 -1 9836 11344 9 2 2L 2 -1 11410 11518 8 2 2L 2 -1 11779 12221 7 2 2L 2 -1 12286 12928 6 2 2L 2 -1 13520 13625 5 2 2L 2 -1 13683 14874 4To run the program on lane 7 for instance we can use:eland2exp drosmel-geneid2loc.tsv 150 0 <s_7_export.txt >b.tsv