glbase: a framework for combining, analyzing and displaying heterogeneous genomic and high-throughput sequencing data
© Hutchins et al.; licensee BioMed Central Ltd. 2014
Received: 25 October 2013
Accepted: 23 January 2014
Published: 24 January 2014
Genomic datasets and the tools to analyze them have proliferated at an astonishing rate. However, such tools are often poorly integrated with each other: each program typically produces its own custom output in a variety of non-standard file formats. Here we present glbase, a framework that uses a flexible set of descriptors that can quickly parse non-binary data files. glbase includes many functions to intersect two lists of data, including operations on genomic interval data and support for the efficient random access to huge genomic data files. Many glbase functions can produce graphical outputs, including scatter plots, heatmaps, boxplots and other common analytical displays of high-throughput data such as RNA-seq, ChIP-seq and microarray expression data. glbase is designed to rapidly bring biological data into a Python-based analytical environment to facilitate analysis and data processing. In summary, glbase is a flexible and multifunctional toolkit that allows the combination and analysis of high-throughput data (especially next-generation sequencing and genome-wide data), and which has been instrumental in the analysis of complex data sets. glbase is freely available at http://bitbucket.org/oaxiom/glbase/.
KeywordsChIP-seq RNA-seq Genomics Microarray Motifs Transcription factor Bioinformatics
Genome-scale experiments are rapidly becoming a standard addition to the scientists’ toolkit. However, the development of tools to analyze high-throughput data has lagged behind our ability to generate larger and larger data-sets, and despite some standardization efforts, custom file formats continue to proliferate. Many of the tools currently used to analyze genome-wide data are very diverse and produce a variety of custom outputs that rarely feed directly into other bioinformatics tools without pre-processing of the file into standard file formats. A common way to get around this is to create ad hoc scripts in some combination of UNIX shell, awk, Perl, Python or other programming language and use these scripts to address the problem at hand. However, these scripts are often designed with only a single usage in mind, lack a detailed methodology, may be poorly documented or not preserved at all, and are rarely tested for accuracy and consistency.
Efforts have been made to make this process more transparent; Galaxy is a comprehensive web server with a large number of functions to deal with genome-scale data , but it is a web-server aimed primarily at non-programming scientists, requires extensive user interaction and therefore is difficult to automate, thus losing the advantages of a programming environment or the UNIX shell. BEDTools  and SAMtools  deal efficiently with the standardized genome file formats BED and SAM, but do not deal gracefully with non-standard file inputs or even poorly or incorrectly formatted files. The Biopython  and Bioperl  projects similarly attempt to deal with these problems, but these projects have such a large scope across all of their subject areas that the analysis of high-throughput sequencing has been relatively neglected to date.
The Bioconductor  project for the R language has a massive scope, with multiple tools from multiple developers that can come together to form a potent analysis toolkit. It is well documented and has become one of the major analytical frameworks for genomic analysis. Yet it has some limitations, the R language has a steep learning curve and deployment of a users own methods or functions is difficult. One of the original motivations for the development of glbase was to format files suitable for the import format required by R and it still fulfills this role. The Genomic Hyperbrowser  takes an interesting novel approach to the analysis of genomic data, built on top of the Galaxy framework it uses the widespread concept of ‘tracks’ (i.e. collections of genomic features, genes, exons, epigenetic data, etc) to which the user defines a putative relationship describing the two tracks and a null model and then the Hyperbrowser will test this relationship. In this way the Hyperbrowser brings a more statistical and mathematical approach to the analysis of genomic data. Although primarily presented as a web server it also makes available a programmatic interface. ArrayPlex  provides a framework similar to glbase for the analysis of heterogenous genomic data, in addition to providing a graphical interface it also exposes its functionality through the UNIX shell as executable commands. ArrayPlex is mainly focused on the retrieval of data from publicly accessible webservers. CruzDB  is the tool most similar to glbase. Also implemented in Python it provides a convenient system to extract data primarily from the UCSC genome browser, process the data in Python and then submit the data to other tools. It does not contain any internal drawing methods, although it should integrate well with Python plotting libraries such as matplotlib and potentially also with glbase. Tools originally designed for DNA motif discovery, such as HOMER  and MEME  are also expanding in their scope and offer an increasing diversity of genomic analysis methods that are exposed to the user not only in the form of a web server but also as tools that can integrate with the command line for automation.
Results and discussion
Genelists and flexible file format specifiers
Genelists can also be intersected by pairs of matching keys, made unique for any key, and many other methods to manipulate the data contained within the genelist. Finally, the resulting genelists can be saved in a variety of file formats, such as custom TSV (tab-separated value) and standard BED files.
Flexible specifiers to describe any arrangement of tabular data
In the example above each value specifies the key name and the column number of the TSV file to find the data in. This flexible format specifier can be used to describe almost any TSV file for loading into glbase.
Analysis and graphical outputs
glbase is a flexible and multifunctional toolkit allowing the user to perform many common analyses on ChIP-seq, microarray and RNA-seq data. Data from distinct sources can be combined inside a unified framework within a Python programming environment for direct analysis of the data, or processed and output for further analysis. glbase has already been used extensively in the analysis of STAT3 binding in macrophages , the analysis of STAT3 binding in multiple cell types , in analyzing the changes in the transcriptome of stimulated CD4+ T cells , and in the analysis of how mutated Sox17 co-operates with Oct4 to specify induced pluripotent stem cells [20, 21]. Thus glbase constitutes a useful addition to the researchers’ toolkit.
Availability and requirements
glbase was developed in Python and uses the freely available Python modules NumPy, SciPy and matplotlib. All functions in glbase are documented in Python (for example, to see the documentation for the map() method of genelists, type: help(glbase.genelist.map)), and documentation is also available as part of the distribution (glbase/docs/build/html/index.html), which also includes seven tutorials, code and example raw data (glbase/examples/) directly aimed at potential users with little or no Python experience. glbase is freely available from http://bitbucket.org/oaxiom/glbase/.
Browser extensible data
Gene transfer format
We thank Chu Lee Thean for valuable usage reports and testing of early versions of glbase.
- Goecks J, Nekrutenko A, Taylor J: Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol 2010,11(8):R86. 10.1186/gb-2010-11-8-r86PubMed CentralPubMedView ArticleGoogle Scholar
- Quinlan AR, Hall IM: BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 2010,26(6):841–842. 10.1093/bioinformatics/btq033PubMed CentralPubMedView ArticleGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R: The sequence alignment/map format and SAMtools. Bioinformatics 2009,25(16):2078–2079. 10.1093/bioinformatics/btp352PubMed CentralPubMedView ArticleGoogle Scholar
- Cock PJ, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B, de Hoon MJ: Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 2009,25(11):1422–1423. 10.1093/bioinformatics/btp163PubMed CentralPubMedView ArticleGoogle Scholar
- Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C, Fuellen G, Gilbert JG, Korf I, Lapp H, Lehvaslaiho H, Matsalla C, Mungall CJ, Osborne BI, Pocock MR, Schattner P, Senger M, Stein LD, Stupka E, Wilkinson MD, Birney E: The Bioperl toolkit: Perl modules for the life sciences. Genome Res 2002,12(10):1611–1618. 10.1101/gr.361602PubMed CentralPubMedView ArticleGoogle Scholar
- Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JY, Zhang J: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 2004,5(10):R80. 10.1186/gb-2004-5-10-r80PubMed CentralPubMedView ArticleGoogle Scholar
- Sandve GK, Gundersen S, Johansen M, Glad IK, Gunathasan K, Holden L, Holden M, Liestol K, Nygard S, Nygaard V, Paulsen J, Rydbeck H, Trengereid K, Clancy T, Drablos F, Ferkingstad E, Kalas M, Lien T, Rye MB, Frigessi A, Hovig E: The genomic hyperbrowser: an analysis web server for genome-scale data. Nucleic Acids Res 2013,41(Web Server issue):W133-W141.PubMed CentralPubMedView ArticleGoogle Scholar
- Killion PJ, Iyer VR: ArrayPlex: distributed, interactive and programmatic access to genome sequence, annotation, ontology, and analytical toolsets. Genome Biol 2008,9(11):R159. 10.1186/gb-2008-9-11-r159PubMed CentralPubMedView ArticleGoogle Scholar
- Pedersen BS, Yang IV, De S: CruzDB: software for annotation of genomic intervals with UCSC genome-browser database. Bioinformatics 2013,29(23):3003–3006. 10.1093/bioinformatics/btt534PubMed CentralPubMedView ArticleGoogle Scholar
- Heinz S, Benner C, Spann N, Bertolino E, Lin YC, Laslo P, Cheng JX, Murre C, Singh H, Glass CK: Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol Cell 2010,38(4):576–589. 10.1016/j.molcel.2010.05.004PubMed CentralPubMedView ArticleGoogle Scholar
- Bailey TL, Boden M, Buske FA, Frith M, Grant CE, Clementi L, Ren J, Li WW, Noble WS: MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res 2009,37(Web Server issue):W202-W208.PubMed CentralPubMedView ArticleGoogle Scholar
- Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, Weinstock GM, Wilson RK, Gibbs RA, Kent WJ, Miller W, Haussler D: Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res 2005,15(8):1034–1050. 10.1101/gr.3715005PubMed CentralPubMedView ArticleGoogle Scholar
- Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nusbaum C, Myers RM, Brown M, Li W, Liu XS: Model-based analysis of ChIP-Seq (MACS). Genome Biol 2008,9(9):R137. 10.1186/gb-2008-9-9-r137PubMed CentralPubMedView ArticleGoogle Scholar
- Hutchins AP, Poulain S, Miranda-Saavedra D: Genome-wide analysis of STAT3 binding in vivo predicts effectors of the anti-inflammatory response in macrophages. Blood 2012,119(13):e110-e119. 10.1182/blood-2011-09-381483PubMedView ArticleGoogle Scholar
- Chen X, Xu H, Yuan P, Fang F, Huss M, Vega VB, Wong E, Orlov YL, Zhang W, Jiang J, Loh YH, Yeo HC, Yeo ZX, Narang V, Govindarajan KR, Leong B, Shahab A, Ruan Y, Bourque G, Sung WK, Clarke ND, Wei CL, Ng HH: Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell 2008,133(6):1106–1117. 10.1016/j.cell.2008.04.043PubMedView ArticleGoogle Scholar
- Consortium EP: A user’s guide to the encyclopedia of DNA elements (ENCODE). PLoS Biol 2011,9(4):e1001046. 10.1371/journal.pbio.1001046View ArticleGoogle Scholar
- Wilson D, Charoensawan V, Kummerfeld SK, Teichmann SA: DBD–taxonomically broad transcription factor predictions: new content and functionality. Nucleic Acids Res 2008,36(Database issue):D88-D92.PubMed CentralPubMedGoogle Scholar
- Hutchins AP, Diez D, Takahashi Y, Ahmad S, Jauch R, Tremblay ML, Miranda-Saavedra D: Distinct transcriptional regulatory modules underlie STAT3’s cell type-independent and cell type-specific functions. Nucleic Acids Res 2013,41(4):2155–2170. 10.1093/nar/gks1300PubMed CentralPubMedView ArticleGoogle Scholar
- Hutchins AP, Poulain S, Fujii H, Miranda-Saavedra D: Discovery and characterization of new transcripts from RNA-seq data in mouse CD4(+) T cells. Genomics 2012,100(5):303–313. 10.1016/j.ygeno.2012.07.014PubMedView ArticleGoogle Scholar
- Aksoy I, Jauch R, Chen J, Dyla M, Divakar U, Bogu GK, Teo R, Leng Ng CK, Herath W, Lili S, Hutchins AP, Robson P, Kolatkar PR, Stanton LW: Oct4 switches partnering from Sox2 to Sox17 to reinterpret the enhancer code and specify endoderm. EMBO J 2013,32(7):938–953. 10.1038/emboj.2013.31PubMed CentralPubMedView ArticleGoogle Scholar
- Jauch R, Aksoy I, Hutchins AP, Ng CK, Tian XF, Chen J, Palasingam P, Robson P, Stanton LW, Kolatkar PR: Conversion of Sox17 into a pluripotency reprogramming factor by reengineering its association with Oct4 on DNA. Stem Cells 2011,29(6):940–951. 10.1002/stem.639PubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.