Federation in Genomics Pipelines: Techniques and Challenges Somali Chaterji, Jinkyu Koo, Ninghui Li, Folker Meyer, Ananth Grama, and Saurabh Bagchi Abstract Federation is a popular concept in building distributed cyberinfrastructures, whereby computational resources are provided by multiple organizations through a unified portal, decreasing the complexity of moving data back and forth among multiple organizations. Federation has been used in bioinformatics only to a limited extent, namely, federation of datastores, e.g., SBGrid Consortium for structural biology and Gene Expression Omnibus (GEO) for functional genomics data. Here, we posit that it is important to federate both computational resources (CPU, GPU, FPGA, etc.) and datastores to support popular bioinformatics portals, that are faced with fast-increasing data volumes and increasing processing requirements. A prime example, and one that we discuss here, is in genomics and metagenomics. Here, it is critical that the processing of the data be done without having to transport the data across large network distances. We exemplify our design and development through our experience with MG-RAST, the most popular metagenomics portal and analysis pipeline. Currently, it is hosted completely at Argonne National Laboratory (ANL). However, through a recently started National Institutes of Health (NIH) project at Purdue University and ANL, we are taking steps toward federating this infrastructure. Being a widely used resource, we have to move toward federation without disrupting the over 40K daily users. In this paper, we describe the computational tools that will be useful for federating a bioinformatics infrastructure and the open research challenges that we see in federating such infrastructures. Our manuscript can hopefully serve to spur greater federation of bioinformatics infrastructures by showing the steps involved, and thus, allow them to scale to support larger user bases. Contact: Dr. Somali Chaterji; schaterji@purdue.edu Supplementary information: Supplementary data are available at Briefings in Bioinformatics online. Keywords: Computational genomics, cyberinfrastructure, federation, identity management, MG-RAST, genomic privacy.