.. role:: img-inline
.. |colab_logo| image:: content/images/colab.png
   :class: img-inline
   :alt: SPIT Colab
   :target: https://colab.research.google.com/github/berilerdogdu/spit/blob/master/notebooks/SPIT.ipynb
.. raw:: html

   <style>
   .img-inline {
       height: 2em;
       vertical-align: middle;
   }
   </style>

******
SPIT
******

**SPIT** detects DTU events exclusive to subgroups as well as DTU events shared amongst all case samples.
Downstream of DTU analysis, SPIT uses detected DTU events to provide insight
into potentially hierarchical subgrouping patterns present in complex disease populations using hierarchical clustering.

Preparing input data
############################

**SPIT** is ready to run almost as soon as transcripts have been quantified using your favorite method.
Since various quantification methods may present their results in varying formats, **SPIT** requires that
the input is standardized, which can often be achieved using the tximport package [#diff_analysis_sonenson]_. For examples of data formats expected by **SPIT**
please consult the :ref:`File Formats <file-formats>` section.

Using SPIT to Find DTU Genes
############################################

For a complete example of how to run **SPIT** please consult the :ref:`Examples <examples>` section.

Alternatively, **SPIT** can be run with no installation or setup from the |colab_logo| `Google Colab <github.com>`__ notebook!

Output Formats
############################################
	
For information regarding all file formats used in **SPIT** please consult the :ref:`File Formats <file-formats>` section.

Modules and Parameters
############################################

While default parameters of **SPIT** are designed to be applicable to a wide array of tasks,
multiple optional modules and parameters have been implemented to allow fine-tuning of the analysis.
Users can run ``-h/--help`` on any of the included modules to learn more about each program.
This section provides extended descriptions of some of the more important parameters in different modules.


confounding_analysis.py
*****************************

This module helps control for the confounding factors inherently present in the experimental design.
By fitting a random forest regressor on each detected DTU transcript the module determines
which covariates might be contributing into observable variance in isoform fraction (IF) values.

``-i``
""""""""""""""""""""""""

Required

File containing Isoform Fractions (IF).

``-l``
""""""""""""""""""""""""

Required

TSV file containing metadata and labels for each sample.

``--cluster_matrix``
""""""""""""""""""""""""

Required

Cluster matrix file generated by the :ref:`DTU Detection <dtu-detection>` module of SPIT.

``-n/--n_small``
""""""""""""""""""""""""

Optional

Default: 12

The value controls the smallest number of samples required within each subgroup for it to be considered
in the analysis. Default: 12

``-o``
""""""""""""""""""""""""

Required

The file generated by :ref:`DTU Detection <dtu-detection>` module of SPIT which contains full list of
candidate DTU genes.

``-M``
""""""""""""""""""""""""

Required

Output File where updated SPIT cluster matrix will be written to. This file is a subset of the input file
provided via the ``--cluster_matrix`` option.

``-O``
""""""""""""""""""""""""

Required

Output File where updated DTU genes will be written to. This file will be a subset of the input file
provided via the ``-o`` option.

``-P``
""""""""""""""""""""""""

Optional

Optional output PDF file where, if enabled, visualizations of the confounding analysis will be saved.
The plots saved in this file visualize importances of each covariate for each transcript.
You can see additional examples in the :ref:`Tutorial section <examples>`.


dtu_detection.py
*****************************

This module fits KDE on Isoform Fraction (IF) distributions to search for separation between case and control samples.
P-value threshold is then used to select significant events out of all candidate DTU events.

``-i``
""""""""""""""""""""""""

Required

TSV file containing Isoform Fractions (IF).

``-g``
""""""""""""""""""""""""

Required

TSV file containing Gene Counts.

``-m``
""""""""""""""""""""""""

Required

TSV file containing a mapping of transcript IDs to gene IDs.

``-l``
""""""""""""""""""""""""

Required

TSV file containing metadata and labels for each sample.

``--p_cutoff``
""""""""""""""""""""""""

Optional

Default: 0.05

The value sets the p-value cutoff from SPIT Test permutations.

``-b``
""""""""""""""""""""""""

Optional

The parameter controls the Kernel Density Estimation (KDE) bandwidth.
KDE bandwidth controls the smoothness of the density.
Higher values describe are more generalizable at the cost of finer features,
while lower values are more prone to noise.

``-n/--n_small``
""""""""""""""""""""""""

Optional

Default: 12

The value controls the smallest number of samples required within each subgroup for it to be considered
in the analysis. Default: 12

``--f_cpm``
""""""""""""""""""""""""

Optional

This flag enables filtering by Counts Per MIllion (CPM).
If enabled, only genes with the CPM over 10 will be used in the analysis.
Filtering will set NA values in the SPIT cluster matrix for any samples that were not included in the DTU detection for a transcript.

``-M``
""""""""""""""""""""""""

Required

Output File where SPIT cluster matrix will be written to.

``-O``
""""""""""""""""""""""""

Required

Output File where DTU genes will be written to.


filter_and_transform_tx_counts.py
**************************************

This module performs pre-filtering of the input transcripts.
The protocol implemented in this modules follows closely the filtering criteria defined in DRIMSeq (Nowicka M. et al, 2016)
and involves filtering by Counts Per Million (CPM), Sample Count, Isoform Fraction (IF), Transcrit Count Per Gene.

``-i``
""""""""""""""""""""""""

Required

TSV file containing Isoform Fractions (IF).

``-m``
""""""""""""""""""""""""

Required

TSV file containing a mapping of transcript IDs to gene IDs.

``-l``
""""""""""""""""""""""""

Required

TSV file containing metadata and labels for each sample.

``-T``
""""""""""""""""""""""""

Required

Output file where filtered transcript counts will be written to.

``-F``
""""""""""""""""""""""""

Required

Output file where filtered isoform fractions (IFs) will be written.

``-G``
""""""""""""""""""""""""

Required

Output file where filtered gene counts will be written.

``-w``
""""""""""""""""""""""""

Optional

This flag enables writing to stdout the number of transcripts and genes left after each filtering step.

``-n/--n_small``
""""""""""""""""""""""""

Optional

Default: 12

The value controls the smallest number of samples required within each subgroup for it to be considered
in the analysis. Default: 12

``-p/--pr_fraction``
""""""""""""""""""""""""

Optional

Default: 0.2

The value controls the minimum number of samples (control AND case) with a positive read count.
For instance, if set to 0.1 the only transcripts retained will be those that have positive read count in at least 10% of the samples.

``-f/--if_fraction``
""""""""""""""""""""""""

Optional

Default: 0.1

The value sets a threshold for transcripts based on the Isoform Fraction (IF) value.
When set, only transcripts with an IF value larger than f in at least n_small samples are retained.

``-c/--genefilter_count``
"""""""""""""""""""""""""""

Optional

Default 10

The value controls the minimum read count for genes considered in the analysis.
For instance, when set to 10, only genes with a read count >= 10 will be considered.

``-s/--genefilter_sample``
""""""""""""""""""""""""""""

Optional

Default 10

The value controls the minimum sample count for genes considered in the analysis.
For instance, when set to 10, only genes that appear in >= 10 samples will be considered.


get_p_cutoff.py
**************************************

Retrieves a p-value threshold to be used in detecting significant DTU events.
This module determines p-value threshold based on the user-defined parameter K
and is obtained as the "K x N"-th smallest p-value among the N sampled by SPIT-Test.

``-k``
""""""""""""""""""""""""

Optional

Default: 0.6

The value sets the K hyperparameter used in obtaining p-value thresholds.
Smaller values of K will yield more stringent p-value thresholds.

``-p``
""""""""""""""""""""""""

Required

File containing minimum p-values computed during the SPIT Test iterations.


transform_tx_counts_to_ifs.py
**************************************

Directly convert transcript counts to Isoform Fractions (IF).
Transcripts with no non-zero counts will still be filtered out.
This module is useful when skipping pre-filtering step.

``-i``
""""""""""""""""""""""""

Required

TSV file containing Isoform Fractions (IF).

``-m``
""""""""""""""""""""""""

Required

TSV file containing a mapping of transcript IDs to gene IDs.

``-F``
""""""""""""""""""""""""

Required

Output file where isoform fractions (IFs) will be written.

``-G``
""""""""""""""""""""""""

Required

Output file where gene counts will be written.

``-w``
""""""""""""""""""""""""

Optional

This flag enables writing to stdout the number of transcripts and genes left after each filtering step.


spit_test.py
**************************************

This module performs the SPIT-Test. The test randomly splits the control group in half,
and identifies the most significant difference in isoform fractions between the two halves.
The observations from this process can then be used to compare candidate DTU events in terms of their significance.


``-i``
""""""""""""""""""""""""

Required

File containing Isoform Fractions (IF).

``-l``
""""""""""""""""""""""""

Required

TSV file containing metadata and labels for each sample.

``-g``
""""""""""""""""""""""""

Required

TSV file containing Gene Counts.

``-n``
""""""""""""""""""""""""

Optional

Default: 1000

SPIT-Test is computed by randomly splitting control samples into two sets of equal size.
This value controls the number of iterations performed.

``-d/--p_dom``
""""""""""""""""""""""""

Optional

Default: 0.75

The value controls dominance selection threshold.

``--n_small``
""""""""""""""""""""""""

Optional

Default: 12

The value controls the smallest number of samples required within each subgroup for it to be considered
in the analysis. Default: 12

``-I``
""""""""""""""""""""""""

Required

Output file where dominance-selected isoform fractions (IFs) will be written.

``-G``
""""""""""""""""""""""""

Required

Output file where dominance-selected gene counts will be written.

``-P``
""""""""""""""""""""""""

Required

Output file where minimum p-values from all iterations will be written.


References
------------------------

.. [#diff_analysis_sonenson] `Soneson C, Love MI and Robinson MD. Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences [version 1; peer review: 2 approved]. F1000Research 2015, 4:1521 (https://doi.org/10.12688/f1000research.7563.1)`__.