We work hard to make our research fully reproducible and extendable by other researchers by providing well-documented data and code for each project. We also strive to make our computational tools widely usable to biologists and biomedical scientists via reusable software and interactive webservers.


Txt2Onto is a Python utility for text-based tissue classification along with NLP-ML (natural-language-processing + machine learning) models trained to perform the tissue classification. The repo also contains demo scripts with extensive documentation. Given an input file where each line is a piece of text to be classified, the txt2onto utility will perform the necessary text preprocessing, create an embedding for each piece of text, and then run each embedding through our pre-trained tissue models.



PecanPy is a parallelized, efficient, and accelerated node2vec software written in Python. Learning low-dimensional representations (embeddings) of nodes in large graphs is key to applying machine learning on massive biological networks. Node2vec is the most widely used method for node embedding. PecanPy is an ultrafast implementation of node2vec that uses cache-optimized compact graph data structures and precomputing/parallelization to result in high-quality node embeddings for biological networks of all sizes and densities.



The Expresto repository contains data and code to generate/reproduce the results in our work on imputing the expression of unmeasured genes in gene-expression profiles. This work introduces a new method called SampleLASSO that uses a sparse regression-based approach that is accurately imputes unmeasured genes in samples from any platform in a way that captures context-specific biologically relevant information to guide imputation. The code includes a function that allows users to use SampleLASSO to fill in the unmeasured genes in their dataset of interest and get a report on which samples in the training data were the most helpful for imputation.



The GenePlexus repository contains data and code to generate/reproduce the results in our work on systematically benchmarking supervised-learning for network-based gene classification across diverse prediction tasks (functions, diseases, and traits) and molecular networks using meaningful validation schemes and evaluation metrics. We have designed the code to enable easy addition of new methods, which can then be benchmarked along with the other methods using the same evaluation environment.




The ASD webserver contains a genome-wide ranking of human candidate genes associated with Autism Spectrum Disorder (ASD), predicted based on known ASD-related genes and their functional relationships in a human brain-specific gene interaction network (from GIANT; below). Using the ASD webserver, researchers can interactively access all autism gene predictions in the context of their relationships in the human brain-specific gene network, along with the results from subsequent analyses, including spatiotemporal brain signatures, functional modules and prioritized copy-number variants (CNVs).



The GIANT webserver contains data-driven human genome-scale functional interaction networks between ~26,000 genes in more than 280 tissues and cell-types. Using GIANT, researchers can (i) look-up the tissue-specific interactions of one or more genes, (ii) compare a gene's functional interaction in different tissues by selecting the relevant tissues in the dropdown menu, and (iii) reprioritize functional associations from a genome-wide association study (GWAS) using tissue-specific networks using an approach named NetWAS and identify additional candidate disease-associated genes.