File Formats
This page describes the fileformats as used in the Illumina genome
analyzer.
Intensity files
These are the files that can be found in the directory
s_#lane_#tile_nse.txt
These files are generated during the cluster detection and are tab seperated.
The content of the file is not well documented but it seems to cover the variance within each cluster.
cycle 1 cycle 2 cycle 3
lane tile x y a c g t a c g t a c g t ...
2 47 219 303 9.9 9.9 4.9 4.6 15.0 16.9 6.0 8.8 14.9 17.7 7.8 7.8 ...
s_#lane_#tile_int.txt
The intensity files contain the average intensity for each cluster on each of the 4 images. Below is an example. These
intensities are non-normalized. So crosstalk is still present.
cycle 1 cycle 2 cycle 3
lane tile x y a c g t a c g t a c g t ...
2 47 219 303 64.8 69.4 1567.9 525.7 790.7 597.9 45.8 32.2 178.1 630.8 46.1 29.6
Base called files
s_#lane_#tile_sig2.txt
These are the files afte crosstalk correction has been performed. If interested in the intensities this is the file that should be used. Its format is again tab seperated. One row per cluster.
cycle 1 cycle 2 cycle 3
lane tile x y a c g t a c g t a c g t ...
2 47 219 303 -6.5 3.5 1056.5 -24.7 661.4 12.7 17.1 13.9 -23.2 679.4 17.6 10.2
s_#lane_#tile_seq.txt
The sequence file that will list for each cluster the bases that have been called. If a base could not be called a '.' is used.
lane tile x y sequence
2 47 219 303 GACATTATGGGTCTGCAAGCTGCTTATGCTAATTTG
2 47 223 1924 GGTGTGGTTGATATTTTTCATGGTATTGATAAAGCT
2 47 621 348 GGAAGTAGCGACAGCTTGGTTTTTAGTGAGTTGTTC
2 47 892 162 GCTTCCATAAGCAGATGGATAACCGCATCAAGCTCT
2 47 1473 657 GCTTTATCAAGATAATTTTTCGACTCATCAGAAATA
2 47 670 345 GTCAATCCTGACGGTTATTTCCTAGACAAATTAGAG
2 47 1787 755 G...................................
s_#lane_#tile_prb.txt
The probability file contains for each cluster the probabilities that
the specific bases were called properly ? The file format is explained
in detail
here.
Each line corresponds to the cluster found at the same line in the _seq, _sig2, _int and _qhg files.
The tabs are at unexpected places: between each cycle and not between each number. The numbers are simply seperated by spaces.
cycle 1 cycle 2 cycle 3 cycle 4
a c g t a c g t a c g t a c g t ...
-40 -40 40 -40 40 -40 -40 -40 -40 40 -40 -40 40 -40 -40 -40 ...
s_#lane_#tile_qhg.txt
The quality metrics that can afterwards be used for filtering the data
Lane; tile; xPos; yPos; chastity, purity;
similarity; neighbour; neighbourhood
Lane Tile x y chastity purity similarity neighbour neighbourhood
2 47 219 303 0.67 0.82 -0.43 3.00 70696:30892:61591:23697:37453:32059:85214:54989
Eland aligned files
s_#lane_#tile_all.png
This is a file that reports the average intensity over each of the
called bases for the specified tile and lane.
s_#lane_#tile_all.txt
Contains the statistics that are used to generate the various plots.
# Clusters: Filtered 63791 Original 100455
#
#Lane Cycle All A All C All G All T Call A Call C Call G Call T Num A Num C Num G Num T Num X
2 1 158.5 141.9 127.0 183.2 575.7 546.2 595.1 585.7 17055 14516 13525 18695 0
2 2 160.5 120.8 136.4 174.8 546.8 525.8 575.1 550.7 17442 12941 14551 18857 0
2 3 156.6 140.7 123.8 178.3 549.2 534.6 558.3 548.7 17000 14412 13514 18865 0
2 4 144.6 128.2 134.5 175.6 516.9 500.6 553.2 535.4 16491 14249 14553 18498 0
2 5 151.2 134.7 137.0 175.8 518.1 522.6 550.4 528.2 17284 12970 14755 18782 0
s_#lane_#tile_errors.png
This image reports the number of sequences that could be directyle/one mismatch/two mismatches be aligned after x cycles. Below is an example.
s_#lane_#tile_rescore.png
This image reports the percentage of errors per position in the alginment.
Below is an example.
s_#lane_#tile_rescore.txt
This is actually an interesting file that reports on the filtering of the data. It contains multiple sections. Below is a demonstration of such a file shortened somewhat by removing cycles 2-35.
#RUN_TIME Thu Jun 26 20:47:35 2008
#SOFTWARE_VERSION @(#) $Id: score.pl,v 1.2 2008-05-30 07:36:06 wernersa Exp $
# Lane 2 : Tile 47 : Quality Filtered
2233116 bases of sequence found
16279 were errors
20 were blanks
0.73 percent error rate
0.00 percent blank rate
unique alignments : 62031 (total score 3937852)
cycles : 36
Breakdown of errors by cycle
Cycle: Err pc: Blank pc:
1 0.19 0.00
36 3.14 0.00
Error rate relative to reference base (including blanks)
(Given a reference base, what is it sequenced as?)
Really: Read as:
N pc A pc C pc G pc T pc
A 0.00 99.46 0.28 0.12 0.14
C 0.00 0.21 99.41 0.03 0.35
G 0.00 0.24 0.13 98.37 1.26
T 0.00 0.04 0.15 0.10 99.70
Error rate relative to reference base (excluding blanks)
(Given a reference base, what is it sequenced as?)
Really: Read as:
A pc C pc G pc T pc
A 99.46 0.28 0.12 0.14
C 0.21 99.41 0.03 0.35
G 0.24 0.13 98.37 1.26
T 0.04 0.15 0.10 99.70
Error rate relative to sequenced base
(Given a sequenced base, what was it really?)
Read as: Really:
A pc C pc G pc T pc
N 30.00 25.00 10.00 35.00
A 99.57 0.18 0.21 0.05
C 0.33 99.33 0.13 0.20
G 0.14 0.03 99.69 0.14
T 0.13 0.26 0.95 98.66
Breakdown of errors by nucleotide
Read As: Really:
A pc C pc G pc T pc
N 6 5 2 7
A 581763 1042 1200 266
C 1650 494665 656 1011
G 683 165 489812 680
T 844 1744 6258 650577
Full breakdown of errors by cycle and nucleotide
Cycle: Read As: Really:
A ct C ct G ct T ct
1 N 0 0 0 0
1 A 16515 18 23 0
1 C 4 14118 3 17
1 G 12 1 13100 10
1 T 4 18 5 18183
36 N 0 0 0 0
36 A 15764 51 41 25
36 C 171 13740 90 46
36 G 43 17 12627 40
36 T 125 142 1156 17922
Error rate relative to reference by cycle/nucleotide
Cycle: Read As: Really:
A ct C ct G ct T ct
@1 N 0.000 0.000 0.000 0.000
@1 A 0.999 0.001 0.002 0.000
@1 C 0.000 0.997 0.000 0.001
@1 G 0.001 0.000 0.998 0.001
@1 T 0.000 0.001 0.000 0.999
@36 N 0.000 0.000 0.000 0.000
@36 A 0.979 0.004 0.003 0.001
@36 C 0.011 0.985 0.006 0.003
@36 G 0.003 0.001 0.908 0.002
@36 T 0.008 0.010 0.083 0.994
Information Content By Cycle
Bases this cycle Bases so far
Cycle: Equiv info: Align: Total: Equiv info: Align: Total:
~1 61363.11 62031 63791 61363.11 62031 63791
~36 55349.21 62031 63791 2161274.97 2233096 2296454
The 20 most common words with 2 blanks or less:
Ranking Occurrences Words
1 21 GTTTGATGAATGCAATGCGACAGGCTCATGCTGATG TCTCATATTGGCGCTACTGCAAAGGATATTTCTAAT
2 20 CTTGCTATTGACTCTACTGTAGACATTTTTACTTTT
3 18 ACTGATGCTGCTTCTGGTGTGGTTGATATTTTTCAT AGAACGTTTTTTACCTTTAGACATTACATCACTCCT AGATGGATAACCGCATCAAGCTCTTGGAAGAGATTC
4 17 CGATTAGAGGCGTTTTATGATAATCCCAATGCTTTG
5 16 9 sequences
6 15 24 sequences
7 14 28 sequences
8 13 56 sequences
9 12 104 sequences
10 11 161 sequences
11 10 306 sequences
12 9 447 sequences
13 8 610 sequences
14 7 896 sequences
15 6 1246 sequences
16 5 1529 sequences
17 4 1677 sequences
18 3 1676 sequences
19 2 1730 sequences
20 1 10463 sequences
The 3 most common blank patterns (N=any nonblank character)
Ranking Occurrences Words
1 63769 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
2 20 NNNNNNNNNNNNNNNNNNNN.NNNNNNNNNNNNNNN
3 1 NNNNNNNNNNNNNNNNNNNNNNNN.NNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN.NNN
Log likelihood scores
Cycle: Read As: Really:
A C G T
>1 N -47 -47 -47 -47
>1 A 260 -296 -285 -47
>1 C -354 276 -367 -291
>1 G -303 -411 275 -311
>1 T -365 -300 -356 282
>36 N -47 -47 -47 -47
>36 A 212 -249 -258 -280
>36 C -190 165 -219 -248
>36 G -246 -287 210 -250
>36 T -218 -213 -119 110
Cumulative errors by cycle
Cycle: 1 2 3 4 5
!1 61916 62031 62031 62031 62031
!36 50087 58673 61221 61866 62010
s_#lane_#tile_score.txt
Various statistics of this tile and lane before quality filtering.
#RUN_TIME Thu Jun 26 20:47:35 2008
#SOFTWARE_VERSION @(#) $Id: score.pl,v 1.2 2008-05-30 07:36:06 wernersa Exp $
# Lane 2 : Tile 47 : Raw
2691216 bases of sequence found
33142 were errors
134 were blanks
1.23 percent error rate
0.00 percent blank rate
unique alignments : 74756 (total score 4168089)
cycles : 36
Breakdown of errors by cycle
Cycle: Err pc: Blank pc:
1 0.50 0.10
36 4.25 0.00
Error rate relative to reference base (including blanks)
(Given a reference base, what is it sequenced as?)
Really: Read as:
N pc A pc C pc G pc T pc
A 0.01 98.95 0.50 0.20 0.35
C 0.01 0.44 98.96 0.11 0.47
G 0.00 0.32 0.34 97.74 1.60
T 0.01 0.14 0.34 0.29 99.23
Error rate relative to reference base (excluding blanks)
(Given a reference base, what is it sequenced as?)
Really: Read as:
A pc C pc G pc T pc
A 98.96 0.50 0.20 0.35
C 0.44 98.97 0.11 0.47
G 0.32 0.34 97.74 1.60
T 0.14 0.34 0.29 99.24
Error rate relative to sequenced base
(Given a sequenced base, what was it really?)
Read as: Really:
A pc C pc G pc T pc
N 26.87 27.61 10.45 35.07
A 99.20 0.38 0.27 0.15
C 0.58 98.63 0.34 0.45
G 0.23 0.11 99.27 0.38
T 0.31 0.36 1.20 98.13
Breakdown of errors by nucleotide
Read As: Really:
A pc C pc G pc T pc
N 36 37 14 47
A 697767 2655 1931 1076
C 3502 592725 2048 2675
G 1377 669 586074 2245
T 2461 2838 9576 781374
Full breakdown of errors by cycle and nucleotide
Cycle: Read As: Really:
A ct C ct G ct T ct
1 N 22 19 8 25
1 A 19825 88 40 13
1 C 14 16733 12 36
1 G 21 11 15886 61
1 T 24 41 10 21867
36 N 0 0 0 0
36 A 18785 109 77 53
36 C 284 16441 184 113
36 G 84 53 14815 78
36 T 315 245 1578 21507
Error rate relative to reference by cycle/nucleotide
Cycle: Read As: Really:
A ct C ct G ct T ct
@1 N 0.001 0.001 0.001 0.001
@1 A 0.996 0.005 0.003 0.001
@1 C 0.001 0.991 0.001 0.002
@1 G 0.001 0.001 0.996 0.003
@1 T 0.001 0.002 0.001 0.994
@36 N 0.000 0.000 0.000 0.000
@36 A 0.965 0.006 0.005 0.002
@36 C 0.015 0.976 0.011 0.005
@36 G 0.004 0.003 0.890 0.004
@36 T 0.016 0.015 0.095 0.989
Information Content By Cycle
Bases this cycle Bases so far
Cycle: Equiv info: Align: Total: Equiv info: Align: Total:
~1 72766.19 74682 100268 72766.19 74682 100268
~36 64041.84 74756 98558 2548406.67 2691082 3537981
The 22 most common words with 2 blanks or less:
Ranking Occurrences Words
1 22 TCTCATATTGGCGCTACTGCAAAGGATATTTCTAAT
2 21 GTTTGATGAATGCAATGCGACAGGCTCATGCTGATG
3 20 CTTGCTATTGACTCTACTGTAGACATTTTTACTTTT
4 19 AGAACGTTTTTTACCTTTAGACATTACATCACTCCT TGCTTTTGATGCCGACCCTAAATTTTTTGCCTGTTT
5 18 ACTGATGCTGCTTCTGGTGTGGTTGATATTTTTCAT AGATGGATAACCGCATCAAGCTCTTGGAAGAGATTC CGATTAGAGGCGTTTTATGATAATCCCAATGCTTTG GTATTCTGGCGTGAAGTCGCCGACTGAATGCCAGCA TGACTATTGACGTCCTTCCTCGTACGCCGGGCAATA
6 17 9 sequences
7 16 14 sequences
8 15 24 sequences
9 14 50 sequences
10 13 102 sequences
11 12 124 sequences
12 11 228 sequences
13 10 365 sequences
14 9 534 sequences
15 8 720 sequences
16 7 1002 sequences
17 6 1328 sequences
18 5 1531 sequences
19 4 1638 sequences
20 3 1502 sequences
21 2 1778 sequences
22 1 39095 sequences
The 30 most common blank patterns (N=any nonblank character)
Ranking Occurrences Words
1 97289 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
2 1661 N...................................
3 173 N....NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
4 134 N...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
5 128 .NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
6 79 N.....NNNNNNNNNN....NNNNNNNNNNNNNNNN
7 67 NN..................................
8 60 N.....NNNNNNNNNNN..NNNNNNNNNNNNNNNNN
9 58 ....................................
10 49 N.....NNNNNNNNNN...NNNNNNNNNNNNNNNNN
11 46 N.......................NNNN......NN
12 43 N.......NNNNN.......NNNNNNNNNNN..NNN
13 36 N.....NNNNNNNNN.....NNNNNNNNNNNNNNNN
14 34 N........NNNN.........NNNNNNNN...NNN
15 30 N.........NN...........NNNNNN.....NN
16 27 N.......................NNNN......N.
17 26 NNNNNNNNNNNNNNNNNNNN.NNNNNNNNNNNNNNN
18 22 N..........N...........NNNNNN.....NN N.......NNNNN........NNNNNNNNN...NNN N.......NNNNN.......NNNNNNNNNNN.NNNN
19 19 N......NNNNNN.......NNNNNNNNNNN.NNNN NN.........................N........
20 18 N.....NNNNNNNNNNNN.NNNNNNNNNNNNNNNNN
21 17 N......NNNNNNNN.....NNNNNNNNNNNNNNNN
22 16 N........NNN...........NNNNNN....NNN N.......NNNNN.......NNNNNNNNNN...NNN NN.NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
23 15 N.........................N.........
24 13 N........................NN......... N......................NNNNNN.....NN N.......NNNNN.........NNNNNNNN...NNN
25 12 N.......................NNNNN.....NN
26 11 N......NNNNNN.N.....NNNNNNNNNNN.NNNN N.....NNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NN.....NNNNNNNN.......NNNNNNNNNNNNNN
27 10 N........................NNN........ NNN..NNNNNNNNNN..N...NNNNNNNNNNNNNNN
28 9 N........................NNN......N. N........NNN..........NNNNNNN....NNN N..NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNN.NNNNNNNNNNNNNNNN.NNNNNNNNNNNNNNN
29 5 N......NNNNNNNN.....NNNNNNNNNNN.NNNN NN........................NN........ NN.....NNNNNNNN.......NNNNNNNNN.NNNN NNN..NNNNNNNNNNNNN...NNNNNNNNNNNNNNN
30 4 NN.....NNNNNN..........NNNNNNNN.NNNN
Log likelihood scores
Cycle: Read As: Really:
A C G T
>1 N -37 -46 -91 -29
>1 A 214 -235 -269 -318
>1 C -307 243 -314 -266
>1 G -288 -316 223 -241
>1 T -296 -272 -334 246
>36 N -47 -47 -47 -47
>36 A 189 -223 -239 -255
>36 C -177 145 -196 -217
>36 G -225 -245 183 -228
>36 T -186 -198 -114 100
Cumulative errors by cycle
Cycle: 1 2 3 4 5
!1 74311 74756 74756 74756 74756
!36 54261 66703 72317 74193 74601
PhageAlign Output
The s_N_TTT_align.txt files contain the unfiltered first-pass
alignments for a give tile. The s_N_TTTT_prealign.txt contains a
recalibration of the aligned sequences, thereby taking into account
the errormodel created based on the alignments (_align.txt and
_score.txt). The file s_N_realign.txt consists of alignments in
s_N_TTT_prealign.txt that do pass the filter criteria.
Lane summaries
s_#lane_cov.png
The coverage of the bases compared to the genome.
s_2_all.png
The average intensity over all tiles.
s_2_call.png
The average intensity of only the called bases.
s_2_calsaf.txt
s_2_eland_extended.txt
s_2_eland_multi.txt
s_2_eland_query.txt
s_#lane_export.txt
This file contains the following content
[1. Machine]
[2. Run number]
3. Lane
4. Tile
5. X coordinate cluster
6. Y coordinate cluster
[7. index string]
[8. Read number (1 or 2 for paired-end reads, blank for single
read analysis)]
9. Read
10. Quality string. The ASCII character code = quality value +64
11. Match chromosome - name of the chromosome match or code
indicating why no match resulted.
12. Match contig - gives the contig name if there is a match and
the match chromosome is split into contigs (blank if no match is found)
13. Match position - always with respect to forward strand,
numbering starts at 1 (blank if no match found)
14. Match strand - "F" for forward, "R" for reverse
15. Match descriptor. Consice description of alignment. A
numerical denotes a run of matchine bases. A letter denotes
substitution of a nucleotide. Eg. 32C2 denotes substitution of a C at
the 33rd position.
16. Single read alignment score - alignment scor of a single-read
match, or for a paired-read, alignment scoire of a read if it were
treated as a single read.
[17. paired-read alignment score]
[18. partner chromosome - name of the chromosome if the read is
paired and its partner aligns to another chromosome]
[19. Partner Contig]
[20. Partner Offset]
[21. Partner Strand]
22. Filtering. Did the read pass quality filtering. Y for yes, N
for no.
An example fragment of such a file.
HWI-EAS264 303KWAAXX 1 1 638 200 GTNNNTTTTCTGCTTAGNNGTTTAATCATGTTTCAA NN???NNNNNNNNNNNN??NNNNNNNNNNNNNNNNN QC N
HWI-EAS264 303KWAAXX 1 1 1231 395 ACNNNCCAGAACGTGAANNAGCGTCCTGCGTGTAGC PN???NNNNNNNNNNNN??NNNNNNNNNNNNNNNNN QC N
HWI-EAS264 303KWAAXX 1 1 1061 436 GTNNNCCGCATGACCTTNNCCATCTTGGCTTCCTTT NN???NNNNNNNNNNNN??NNNNNNNNNNNNNNNNC QC N
HWI-EAS264 303KWAAXX 1 1 1076 412 GGNNNGTAGCGACAGCTNNGTTTTTAGTGAGTTGTT NN???NNNNNNNNNNNN??NNNNNNNNNNNNNNNNN QC N
HWI-EAS264 303KWAAXX 1 1 1785 543 GNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN N??????????????????????????????????? QC N
HWI-EAS264 303KWAAXX 1 1 552 176 GGNNNGTTATAACGCCGNNGCGGTAATAAACTCAAT NJ???NJNNNNNNJJNN??NNJNNNNBJNDDNBNEE QC N
HWI-EAS264 303KWAAXX 1 1 477 231 GTNNNGACAGCTTGGTTNNTAGTGAGTTTTTCCATT PN???NNNNNNNNNNNN??NNNNJNJNNDNNNNNNN QC N
HWI-EAS264 303KWAAXX 1 1 1598 633 GANNNTTTGACGGTTAANNGTGGTAATGGTGGTTTT PN???NNNNNNNNNNNN??HNNNNNNNNNNENENNN QC N
HWI-EAS264 303KWAAXX 1 1 653 186 GANNNTTTGCTATTCAGNNTTTGATGAATGCAATGC PN???NNNNNNNNNNNN??NNNNNNNNNNNNNNNEN QC N
HWI-EAS264 303KWAAXX 1 1 1066 418 GCNNNATGTTTACTCTTNNGCTTGTTCGTTTTCCGC NN???NNNNNNNNNNNN??NNNNJNNNNNNNNNNEN QC N
s_2_filt.txt
s_2_finished.txt
A file used by make to decide whether the process has finished or not.
s_2_frag.txt
s_2_percent_all.png
s_2_percent_base.png
s_2_percent_call.png
s_2_qcalreport.txt
s_2_qcal.txt
s_2_qraw.txt
s_2_qreport.txt
s_2_qtable.txt
s_2_saf.txt
s_2_score_files.txt
s_2_seqpre.txt
s_#lane_sequence.txt
Contains the sequences, quality scores and clusters after filtering. A
good strategy to filter out the sequences after filtering is grep
^[ACTG] <s_1_sequence.txt. The standard output is in fasq format
but with score+64 instead of score+32 to account for 'the dynamic
range', which doesn't make much sense anyway since a) nobody will
print these files; b) an extra 32 added will reduce the dynamic range
and since no scaling is reported this is probably nonsense.
@HWI-EAS264_303KWAAXX:4:1:978:308
GGTTGATATTTTTCATGGTATTGATAAAGCTGTTGC
+HWI-EAS264_303KWAAXX:4:1:978:308
]]]]]]]]]]]]]]]]][]]]][]]]][U][S[[P[
@HWI-EAS264_303KWAAXX:4:1:1246:245
GAAGTTAACACTTTCGGATATTTCTGATGAGTCGAA
+HWI-EAS264_303KWAAXX:4:1:1246:245
]]]]]]]]]]]]]]]][[]]]]][]U[][[[\\X\\
@HWI-EAS264_303KWAAXX:4:1:96:397
GTTCTGGTGATTCGTCTAAGAAGTTTAAGATTGCTG
+HWI-EAS264_303KWAAXX:4:1:96:397
]]]]]][]V]]]]V]]][[U[]V[]]]]R][[N[[M
s_2_Signal_Means.txt
s_2_sorted.txt
More Deep Sequencing notes
- http://analysis.yellowcouch.org/