.. _file-formats:

File Formats
======================

**SPIT** is designed to use simple and common file formats.

.. _input_formats:

Input File Formats
-------------------------

.. csv-table:: **Transcript level counts file (TAB-separated)**
    :file: ../content/csvs/tx_counts.csv
    :header-rows: 1
    :stub-columns: 0
    :caption: Tab-separated file containing the transcript counts for all samples. The rows should correspond to the transcripts, and the columns should represent the samples. The file should include a header with the sample IDs, and a first column with the unique transcript IDs, which must be named "tx_id". We recommend using the "tximport" [#diff_analysis_sonenson]_ package in R with the "scaledTPM" option, but raw counts can also be used.

.. csv-table:: Transcript to gene mapping file (TAB-separated)
    :file: ../content/csvs/tx2gene.csv
    :header-rows: 1
    :stub-columns: 0
    :caption: Tab-separated file that links each transcript ID in the count file to its corresponding gene ID. The first column should contain the transcript IDs and be labeled as "tx_id", while the second column should contain the gene IDs and be labeled as "gene_id".

.. csv-table:: Phenotype data file (TAB-separated)
    :file: ../content/csvs/phenodata.csv
    :header-rows: 1
    :stub-columns: 0
    :caption: Tab-separated file containing all relevant phenotype data. The file should include two columns: "id," which corresponds to the sample IDs, and "condition," which labels the control samples as "0" and the disease group as "1". If you are not conducting a disease analysis, you can label your groups arbitrarily as "0/1". It's essential to ensure that all sample IDs in the phenotype data file match those included in the transcript level counts file. You may provide any additional covariates (age, race, etc.) with distinct column names for confounding control. For any categorical (not numerical) covariates, you should name the corresponding column with the suffix "_cat", as this will signal the program to factorize that covariate. For example, if you have "sex" as one of your covariates and you would like to run the confounding control module, name the column corresponding to "sex" in your pheno data file as "sex_cat".


.. _output_formats:

Files Generated by the Software
---------------------------


References
------------------------

.. [#diff_analysis_sonenson] `Soneson C, Love MI and Robinson MD. Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences [version 1; peer review: 2 approved]. F1000Research 2015, 4:1521 (https://doi.org/10.12688/f1000research.7563.1)`__.