WEBSERVERS / CODE / SOFTWARE
We work hard to make our research fully reproducible and extendable by other researchers by providing well-documented data and code for each project. We also strive to make our computational tools widely usable to biologists and biomedical scientists via reusable software and interactive webservers.
nleval is a Python package containing reusable modules that enable researchers to effortlessly set up PyTorch Geometric (PyG)-compatible machine learning (ML)-ready biological network and node label datasets. It is designed to remove the barrier in working with biological network data because of the tedious and specialized (pre-)processing steps and help network biologists set up custom benchmarking datasets for answering specific biological questions of their interests and help graph representation learning researchers adapt these datasets for designing new specialized model architectures.
​
Cite:
nleval: A Python toolkit for generating benchmarking datasets for machine learning with biological networks.
Liu R, Krishnan A
bioRxiv (2023).
​
​
PyGenePlexus is a Python package that enables a user to gain insight into any gene set of interest through a molecular interaction network informed supervised machine learning model. PyGenePlexus provides predictions of how associated every gene in the network is to the input gene set, offers interpretability by comparing the model trained on the input gene set to models trained on thousands of known gene sets, and returns the network connectivity of the top predicted genes.
​
Cite:
PyGenePlexus: A Python package for gene discovery using network-based machine learning.
Mancuso CA*, Liu R*, Krishnan A
Bioinformatics (2023) 39:btad064.
​
​
Txt2Onto is a Python utility for text-based tissue classification along with NLP-ML (natural-language-processing + machine learning) models trained to perform the tissue classification. The repo also contains demo scripts with extensive documentation. Given an input file where each line is a piece of text to be classified, the txt2onto utility will perform the necessary text preprocessing, create an embedding for each piece of text, and then run each embedding through our pre-trained tissue models.
​
Cite:
Systematic tissue annotations of genomics samples by modeling unstructured metadata.
Hawkins NT, Maldaver M, Yannakopoulos A, Guare LA, Krishnan A
Nature Communications (2022) 13:6736.
​
​
PecanPy is a parallelized, efficient, and accelerated node2vec software written in Python. Learning low-dimensional representations (embeddings) of nodes in large graphs is key to applying machine learning on massive biological networks. Node2vec is the most widely used method for node embedding. PecanPy is an ultrafast implementation of node2vec that uses cache-optimized compact graph data structures and precomputing/parallelization to result in high-quality node embeddings for biological networks of all sizes and densities.
​
Cite:
PecanPy: a fast, efficient, and parallelized Python implementation of node2vec.
Renming L, Krishnan A
Bioinformatics (2021) 37:3377.
​
​
The Expresto repository contains data and code to generate/reproduce the results in our work on imputing the expression of unmeasured genes in gene-expression profiles. This work introduces a new method called SampleLASSO that uses a sparse regression-based approach that is accurately imputes unmeasured genes in samples from any platform in a way that captures context-specific biologically relevant information to guide imputation. The code includes a function that allows users to use SampleLASSO to fill in the unmeasured genes in their dataset of interest and get a report on which samples in the training data were the most helpful for imputation.
​
Cite:
A flexible, interpretable, and accurate approach for imputing the expression of unmeasured genes.
Mancuso CA*, Canfield JL*, Singla D, Krishnan A
Nucleic Acids Research (2020) 48:e125.
​
The GenePlexus webserver enables researchers to utilize a powerful, network-based machine learning method to gain insights into their gene set of interest and predict additional functionally similar genes. Once a user uploads their set of human genes and chooses between a number of different human network representations, GenePlexus predicts how associated every gene in the network is to the input set. The webserver also provides interpretability through network visualization and comparison to other ML models trained on thousands of known process/pathway and disease gene sets.
​
Cite:
GenePlexus: A web-server for network-based machine learning for human gene classification.
Mancuso CA, Bills P, Newsted J, Krum D, Liu R, Krishnan A
Nucleic Acids Research (2022) 50:W358.
​
Supervised-learning is an accurate method for network-based gene classification.
Liu R*, Mancuso CA*, Yannakopoulos A, Johnson KA, Krishnan A
Bioinformatics (2020) doi.org/10.1093/bioinformatics/btaa150.
​
​
​
ASD
The ASD webserver contains a genome-wide ranking of human candidate genes associated with Autism Spectrum Disorder (ASD), predicted based on known ASD-related genes and their functional relationships in a human brain-specific gene interaction network (from GIANT; below). Using the ASD webserver, researchers can interactively access all autism gene predictions in the context of their relationships in the human brain-specific gene network, along with the results from subsequent analyses, including spatiotemporal brain signatures, functional modules and prioritized copy-number variants (CNVs).
​
Cite:
Krishnan A*, Zhang R*, Yao V, Theesfeld CL, Wong AK, Tadych A, Volfovsky N, Packer A, Lash A, Troyanskaya OG
Nature Neuroscience (2016) 19:1454-1462.
​
The GIANT webserver contains data-driven human genome-scale functional interaction networks between ~26,000 genes in more than 280 tissues and cell-types. Using GIANT, researchers can (i) look-up the tissue-specific interactions of one or more genes, (ii) compare a gene's functional interaction in different tissues by selecting the relevant tissues in the dropdown menu, and (iii) reprioritize functional associations from a genome-wide association study (GWAS) using tissue-specific networks using an approach named NetWAS and identify additional candidate disease-associated genes.
​
Cite:
GIANT 2.0: genome-scale integrated analysis of gene networks in tissues.
Wong AK, Krishnan A, Troyanskaya OG
Nucleic Acids Research (2018) 46:W65–W70.
​
Greene CS*, Krishnan A*, Wong AK*, Ricciotti E, Zelaya R, Himmelstein D, Chasman D, Fitzgerald G, Dolinski K, Grosser T, Troyanskaya OG
Nature Genetics (2015) 47:569-576.
​
​