The TSNE Plot
The TSNE plot is a dimensionality reduction technique which is a way
to graphically simplify very large datasets. Within seqmonk it can be
used to cluster data stores on the basis of the current quantitation across
a large number of probes.
Conceptually the TSNE plot is similar to a PCA, but with some important
differences:
- TSNE always produces a 2D separationk, in contrast to PCA which can
produce many different components
- TSNE is non-deterministic, meaning you won't get exactly the same
output each time you run it (though the results are likely to be similar
- TSNE tends to cope better with non-linear signals in your data, so
odd outliers tend to have less of an effect, and often the visible separation
between relevant groups is improved
- TSNE offers no ability to reverse engineer the groups it identifies so
in contrast to PCA you can't make a probe list based on the separation you see
The TSNE plot will work on whichever data stores are currently displayed
in the chromosome view and will use the currently selected probe list. TSNE
tends to become very resource (both CPU and memory) hungry as the number of
probes used increases so we'd recommend limiting the plot to cases where your
number of probes is no more than a couple of thousand.
With the TSNE plot you can put your mouse over any individual
point, which will then cause the name of that point to be drawn under it so you
can tell which point is which. You can also tick the labels box to see all sample
labels (though this might get a bit messy). There is also an option to highlight any
replicate sets you've made in the project so you can see if groups of data stores
which you would expect to cluster together actually behave that way in your data.
Options
- Perplexity - this is a number which represents roughly how many samples per
cluster you expect to see. The technique is reasonably robust to the value you
use here but altering it will have some effect. The default is the number of
data stores divided by 5 (so assuming we'll see 5 groups) but with a limit of
>=2 and <=50.
- Max Iterations - TSNE is an iterative process the differences between samples
are continually refined. You can set a limit on the maximum number of iterations
to be performed. For large datasets this might speed up the time taken to get
an answer, but for the most part you should leave this set to the default of 1000