GSVA Python CLI

Execute bioconductors GSVA transformation of gene expression into pathway enrichment.

This python package gives both a CLI interface and a python module to work with GSVA in Python Pandas DataFrames.

Find the official R package here:

https://doi.org/doi:10.18129/B9.bioc.GSVA

And if you find this useful, cite the authors publication:

Hänzelmann S, Castelo R and Guinney J (2013). “GSVA: gene set variation analysis for microarray and RNA-Seq data.” BMC Bioinformatics, 14, pp. 7. doi: 10.1186/1471-2105-14-7, http://www.biomedcentral.com/1471-2105/14/7.

GSVA.gsva(expression_df, geneset_df=None, method='gsva', kcdf='Gaussian', abs_ranking=False, min_sz=1, max_sz=None, parallel_sz=0, parallel_type='SOCK', mx_diff=True, tau=None, ssgsea_norm=True, verbose=False, tempdir=None)

GSVA function for use with pandas DataFrame objects

Parameters:
  • expression_df (pandas.DataFrame) – REQUIRED: Expression data indexed on gene names column labels as sample ids
  • geneset_df (pandas.DataFrame) – REQUIRED: Genesets and their members in a dataframe
  • method (string Default: 'gsva') – Method to employ in the estimation of gene-set enrichment scores per sample. By default this is set to gsva (Hänzelmann et al, 2013) and other options 6 gsva are ssgsea (Barbie et al, 2009), zscore (Lee et al, 2008) or plage (Tomfohr et al, 2005). The latter two standardize first expression profiles into z-scores over the samples and, in the case of zscore, it combines them together as their sum divided by the square-root of the size of the gene set, while in the case of plage they are used to calculate the singular value decomposition (SVD) over the genes in the gene set and use the coefficients of the first right-singular vector as pathway activity profile.
  • kcdf (string Default: 'Gaussian') – Character string denoting the kernel to use during the non-parametric estimation of the cumulative distribution function of expression levels across samples when method=”gsva”. By default, kcdf=”Gaussian” which is suitable when input expression values are continuous, such as microarray fluorescent units in logarithmic scale, RNA-seq log-CPMs, log-RPKMs or log-TPMs. When input expression values are integer counts, such as those derived from RNA-seq experiments, then this argument should be set to kcdf=”Poisson”. This argument supersedes arguments rnaseq and kernel, which are deprecated and will be removed in the next release.
  • abs_ranking (bool Default: False) – Flag used only when mx_diff=TRUE. When abs_ranking=FALSE [default] a modified Kuiper statistic is used to calculate enrichment scores, taking the magnitude difference between the largest positive and negative random walk deviations. When abs.ranking=TRUE the original Kuiper statistic that sums the largest positive and negative random walk deviations, is used. In this latter case, gene sets with genes enriched on either extreme (high or low) will be regarded as ’highly’ activated.
  • min_sz (int Default: 1) – Minimum size of the resulting gene sets.
  • max_sz (int Default: Inf) – Maximum size of the resulting gene sets. Leave unset for no limit.
  • parallel_sz (int Default: 0) – Number of processors to use when doing the calculations in parallel. This requires to previously load either the parallel or the snow library. If parallel is loaded and this argument is left with its default value (parallel_sz=0) then it will use all available core processors unless we set this argument with a smaller number. If snow is loaded then we must set this argument to a positive integer number that specifies the number of processors to employ in the parallel calculation.
  • parallel_type (string Default: "SOCK") – Type of cluster architecture when using snow.
  • mx_diff (bool Default: True) – Offers two approaches to calculate the enrichment statistic (ES) from the KS random walk statistic. mx_diff=FALSE: ES is calculated as the maximum distance of the random walk from 0. mx_diff=TRUE (default): ES is calculated as the magnitude difference between the largest positive and negative random walk deviations.
  • tau (float) – Exponent defining the weight of the tail in the random walk performed by both the gsva (Hänzelmann et al., 2013) and the ssgsea (Barbie et al., 2009) methods. By default, this tau=1 when method=”gsva” and tau=0.25 when method=”ssgsea” just as specified by Barbie et al. (2009) where this parameter is called alpha. Leave unset for defaults.
  • ssgsea_norm (bool Default: True) – Logical, set to TRUE (default) with method=”ssgsea” runs the SSGSEA method from Barbie et al. (2009) normalizing the scores by the absolute difference between the minimum and the maximum, as described in their paper. When ssgsea_norm=FALSE this last normalization step is skipped.
  • verbose (bool Default: False) – Gives information about each calculation step.
  • tempdir (string Default: System Default) – Location to write temporary files
Returns:

pandas.DataFrame

GSVA.gmt_to_dataframe(fname)

A function to convert gmt files to a pandas dataframe

Parameters:fname (string) – path to gmt file
Returns:pandas.DataFrame