Big data is pervasive in biology and can be used to discover new insights into interconnected biological processes. This is particularly true for research in viral ecology, wherein large-scale metagenomic datasets are uncovering the extensive genetic diversity of viruses and their role in host-driven nutrient and energy cycles in aquatic systems. Yet, despite innovations in sequencing technology, bottlenecks still exist in analyzing these massive and highly contextualized datasets. Specifically, ecosystem-wide analyses require the harmonization, integration and analysis of multiple biological datasets such as genes, protein function, pathways and environmental or host-related factors. Here we describe a strategy to perform massive comparative metagenomic sequence analysis using the Hadoop big data architecture, and interconnect these data with biological annotations stored in a scalable Neo4J graph database for functional, taxonomic and ecosystem-level analyses. We demonstrate the utility of our toolkit using a large-scale viral metagenomics dataset from the TARA Oceans Expedition. This work represents a first step in storing, comparing, and querying massive metagenomic datasets using scalable big data architectures toward understanding viruses and their impact on host-processes in the ocean.


Hurwitz, B. L., University of Arizona, USA, bhurwitz@email.arizona.edu

Choi, I., University of Arizona, USA, iychoi@email.arizona.edu

Youens-Clark, C. K., University of Arizona, USA, kyclark@email.arizona.edu

Hartman, J. H., University of Arizona, USA, jhh@cs.arizona.edu


Oral presentation

Session #:105
Date: 2/27/2015
Time: 15:00
Location: Auditorium Manuel de Falla (Floor 1)

Presentation is given by student: No