SPIT

SPIT detects DTU events exclusive to subgroups as well as DTU events shared amongst all case samples. Downstream of DTU analysis, SPIT uses detected DTU events to provide insight into potentially hierarchical subgrouping patterns present in complex disease populations using hierarchical clustering.

Preparing input data

SPIT is ready to run almost as soon as transcripts have been quantified using your favorite method. Since various quantification methods may present their results in varying formats, SPIT requires that the input is standardized, which can often be achieved using the tximport package [1]. For examples of data formats expected by SPIT please consult the File Formats section.

Using SPIT to Find DTU Genes

For a complete example of how to run SPIT please consult the Examples section.

Alternatively, SPIT can be run with no installation or setup from the SPIT Colab Google Colab notebook!

Output Formats

For information regarding all file formats used in SPIT please consult the File Formats section.

Modules and Parameters

While default parameters of SPIT are designed to be applicable to a wide array of tasks, multiple optional modules and parameters have been implemented to allow fine-tuning of the analysis. Users can run -h/--help on any of the included modules to learn more about each program. This section provides extended descriptions of some of the more important parameters in different modules.

confounding_analysis.py

This module helps control for the confounding factors inherently present in the experimental design. By fitting a random forest regressor on each detected DTU transcript the module determines which covariates might be contributing into observable variance in isoform fraction (IF) values.

-i

Required

File containing Isoform Fractions (IF).

-l

Required

TSV file containing metadata and labels for each sample.

--cluster_matrix

Required

Cluster matrix file generated by the DTU Detection module of SPIT.

-n/--n_small

Optional

Default: 12

The value controls the smallest number of samples required within each subgroup for it to be considered in the analysis. Default: 12

-o

Required

The file generated by DTU Detection module of SPIT which contains full list of candidate DTU genes.

-M

Required

Output File where updated SPIT cluster matrix will be written to. This file is a subset of the input file provided via the --cluster_matrix option.

-O

Required

Output File where updated DTU genes will be written to. This file will be a subset of the input file provided via the -o option.

-P

Optional

Optional output PDF file where, if enabled, visualizations of the confounding analysis will be saved. The plots saved in this file visualize importances of each covariate for each transcript. You can see additional examples in the Tutorial section.

dtu_detection.py

This module fits KDE on Isoform Fraction (IF) distributions to search for separation between case and control samples. P-value threshold is then used to select significant events out of all candidate DTU events.

-i

Required

TSV file containing Isoform Fractions (IF).

-g

Required

TSV file containing Gene Counts.

-m

Required

TSV file containing a mapping of transcript IDs to gene IDs.

-l

Required

TSV file containing metadata and labels for each sample.

--p_cutoff

Optional

Default: 0.05

The value sets the p-value cutoff from SPIT Test permutations.

-b

Optional

The parameter controls the Kernel Density Estimation (KDE) bandwidth. KDE bandwidth controls the smoothness of the density. Higher values describe are more generalizable at the cost of finer features, while lower values are more prone to noise.

-n/--n_small

Optional

Default: 12

The value controls the smallest number of samples required within each subgroup for it to be considered in the analysis. Default: 12

--f_cpm

Optional

This flag enables filtering by Counts Per MIllion (CPM). If enabled, only genes with the CPM over 10 will be used in the analysis. Filtering will set NA values in the SPIT cluster matrix for any samples that were not included in the DTU detection for a transcript.

-M

Required

Output File where SPIT cluster matrix will be written to.

-O

Required

Output File where DTU genes will be written to.

filter_and_transform_tx_counts.py

This module performs pre-filtering of the input transcripts. The protocol implemented in this modules follows closely the filtering criteria defined in DRIMSeq (Nowicka M. et al, 2016) and involves filtering by Counts Per Million (CPM), Sample Count, Isoform Fraction (IF), Transcrit Count Per Gene.

-i

Required

TSV file containing Isoform Fractions (IF).

-m

Required

TSV file containing a mapping of transcript IDs to gene IDs.

-l

Required

TSV file containing metadata and labels for each sample.

-T

Required

Output file where filtered transcript counts will be written to.

-F

Required

Output file where filtered isoform fractions (IFs) will be written.

-G

Required

Output file where filtered gene counts will be written.

-w

Optional

This flag enables writing to stdout the number of transcripts and genes left after each filtering step.

-n/--n_small

Optional

Default: 12

The value controls the smallest number of samples required within each subgroup for it to be considered in the analysis. Default: 12

-p/--pr_fraction

Optional

Default: 0.2

The value controls the minimum number of samples (control AND case) with a positive read count. For instance, if set to 0.1 the only transcripts retained will be those that have positive read count in at least 10% of the samples.

-f/--if_fraction

Optional

Default: 0.1

The value sets a threshold for transcripts based on the Isoform Fraction (IF) value. When set, only transcripts with an IF value larger than f in at least n_small samples are retained.

-c/--genefilter_count

Optional

Default 10

The value controls the minimum read count for genes considered in the analysis. For instance, when set to 10, only genes with a read count >= 10 will be considered.

-s/--genefilter_sample

Optional

Default 10

The value controls the minimum sample count for genes considered in the analysis. For instance, when set to 10, only genes that appear in >= 10 samples will be considered.

get_p_cutoff.py

Retrieves a p-value threshold to be used in detecting significant DTU events. This module determines p-value threshold based on the user-defined parameter K and is obtained as the “K x N”-th smallest p-value among the N sampled by SPIT-Test.

-k

Optional

Default: 0.6

The value sets the K hyperparameter used in obtaining p-value thresholds. Smaller values of K will yield more stringent p-value thresholds.

-p

Required

File containing minimum p-values computed during the SPIT Test iterations.

transform_tx_counts_to_ifs.py

Directly convert transcript counts to Isoform Fractions (IF). Transcripts with no non-zero counts will still be filtered out. This module is useful when skipping pre-filtering step.

-i

Required

TSV file containing Isoform Fractions (IF).

-m

Required

TSV file containing a mapping of transcript IDs to gene IDs.

-F

Required

Output file where isoform fractions (IFs) will be written.

-G

Required

Output file where gene counts will be written.

-w

Optional

This flag enables writing to stdout the number of transcripts and genes left after each filtering step.

spit_test.py

This module performs the SPIT-Test. The test randomly splits the control group in half, and identifies the most significant difference in isoform fractions between the two halves. The observations from this process can then be used to compare candidate DTU events in terms of their significance.

-i

Required

File containing Isoform Fractions (IF).

-l

Required

TSV file containing metadata and labels for each sample.

-g

Required

TSV file containing Gene Counts.

-n

Optional

Default: 1000

SPIT-Test is computed by randomly splitting control samples into two sets of equal size. This value controls the number of iterations performed.

-d/--p_dom

Optional

Default: 0.75

The value controls dominance selection threshold.

--n_small

Optional

Default: 12

The value controls the smallest number of samples required within each subgroup for it to be considered in the analysis. Default: 12

-I

Required

Output file where dominance-selected isoform fractions (IFs) will be written.

-G

Required

Output file where dominance-selected gene counts will be written.

-P

Required

Output file where minimum p-values from all iterations will be written.

References