publications 2017-03-09

In this group meeting, we quickly discussed these latest papers:

  • S. Salehi, A. Steif, A. Roth, S. Aparicio, A. Bouchard-Côté, and S. P. Shah, “Ddclone: joint statistical inference of clonal populations from single cell and bulk tumour sequencing data,” Genome biology, vol. 18, iss. 1, p. 44, 2017. doi:10.1186/s13059-017-1169-3
    [BibTeX] [Abstract] [Download PDF]

    Next-generation sequencing (NGS) of bulk tumour tissue can identify constituent cell populations in cancers and measure their abundance. This requires computational deconvolution of allelic counts from somatic mutations, which may be incapable of fully resolving the underlying population structure. Single cell sequencing (SCS) is a more direct method, although its replacement of NGS is impeded by technical noise and sampling limitations. We propose ddClone, which analytically integrates NGS and SCS data, leveraging their complementary attributes through joint statistical inference. We show on real and simulated datasets that ddClone produces more accurate results than can be achieved by either method alone.

    @Article{Salehi2017,
    author="Salehi, Sohrab
    and Steif, Adi
    and Roth, Andrew
    and Aparicio, Samuel
    and Bouchard-C{\^o}t{\'e}, Alexandre
    and Shah, Sohrab P.",
    title="ddClone: joint statistical inference of clonal populations from single cell and bulk tumour sequencing data",
    journal="Genome Biology",
    year="2017",
    volume="18",
    number="1",
    pages="44",
    abstract="Next-generation sequencing (NGS) of bulk tumour tissue can identify constituent cell populations in cancers and measure their abundance. This requires computational deconvolution of allelic counts from somatic mutations, which may be incapable of fully resolving the underlying population structure. Single cell sequencing (SCS) is a more direct method, although its replacement of NGS is impeded by technical noise and sampling limitations. We propose ddClone, which analytically integrates NGS and SCS data, leveraging their complementary attributes through joint statistical inference. We show on real and simulated datasets that ddClone produces more accurate results than can be achieved by either method alone.",
    issn="1474-760X",
    doi="10.1186/s13059-017-1169-3",
    url="//dx.doi.org/10.1186/s13059-017-1169-3"
    }

  • I. de Santiago, W. Liu, K. Yuan, M. O’Reilly, C. S. R. Chilamakuri, B. A. J. Ponder, K. B. Meyer, and F. Markowetz, “Baalchip: bayesian analysis of allele-specific transcription factor binding in cancer genomes,” Genome biology, vol. 18, iss. 1, p. 39, 2017. doi:10.1186/s13059-017-1165-7
    [BibTeX] [Abstract] [Download PDF]

    Allele-specific measurements of transcription factor binding from ChIP-seq data are key to dissecting the allelic effects of non-coding variants and their contribution to phenotypic diversity. However, most methods of detecting an allelic imbalance assume diploid genomes. This assumption severely limits their applicability to cancer samples with frequent DNA copy-number changes. Here we present a Bayesian statistical approach called BaalChIP to correct for the effect of background allele frequency on the observed ChIP-seq read counts. BaalChIP allows the joint analysis of multiple ChIP-seq samples across a single variant and outperforms competing approaches in simulations. Using 548 ENCODE ChIP-seq and six targeted FAIRE-seq samples, we show that BaalChIP effectively corrects allele-specific analysis for copy-number variation and increases the power to detect putative cis-acting regulatory variants in cancer genomes.

    @Article{deSantiago2017,
    author="de Santiago, Ines
    and Liu, Wei
    and Yuan, Ke
    and O'Reilly, Martin
    and Chilamakuri, Chandra Sekhar Reddy
    and Ponder, Bruce A. J.
    and Meyer, Kerstin B.
    and Markowetz, Florian",
    title="BaalChIP: Bayesian analysis of allele-specific transcription factor binding in cancer genomes",
    journal="Genome Biology",
    year="2017",
    volume="18",
    number="1",
    pages="39",
    abstract="Allele-specific measurements of transcription factor binding from ChIP-seq data are key to dissecting the allelic effects of non-coding variants and their contribution to phenotypic diversity. However, most methods of detecting an allelic imbalance assume diploid genomes. This assumption severely limits their applicability to cancer samples with frequent DNA copy-number changes. Here we present a Bayesian statistical approach called BaalChIP to correct for the effect of background allele frequency on the observed ChIP-seq read counts. BaalChIP allows the joint analysis of multiple ChIP-seq samples across a single variant and outperforms competing approaches in simulations. Using 548 ENCODE ChIP-seq and six targeted FAIRE-seq samples, we show that BaalChIP effectively corrects allele-specific analysis for copy-number variation and increases the power to detect putative cis-acting regulatory variants in cancer genomes.",
    issn="1474-760X",
    doi="10.1186/s13059-017-1165-7",
    url="//dx.doi.org/10.1186/s13059-017-1165-7"
    }

  • M. Vincent, K. Mundbjerg, J. Skou Pedersen, G. Liang, P. A. Jones, T. F. {O}rntoft, K. Dalsgaard S{o}rensen, and C. Wiuf, “Epig: statistical inference and profiling of dna methylation from whole-genome bisulfite sequencing data,” Genome biology, vol. 18, iss. 1, p. 38, 2017. doi:10.1186/s13059-017-1168-4
    [BibTeX] [Abstract] [Download PDF]

    The study of epigenetic heterogeneity at the level of individual cells and in whole populations is the key to understanding cellular differentiation, organismal development, and the evolution of cancer. We develop a statistical method, epiG, to infer and differentiate between different epi-allelic haplotypes, annotated with CpG methylation status and DNA polymorphisms, from whole-genome bisulfite sequencing data, and nucleosome occupancy from NOMe-seq data. We demonstrate the capabilities of the method by inferring allele-specific methylation and nucleosome occupancy in cell lines, and colon and tumor samples, and by benchmarking the method against independent experimental data.

    @Article{Vincent2017,
    author="Vincent, Martin
    and Mundbjerg, Kamilla
    and Skou Pedersen, Jakob
    and Liang, Gangning
    and Jones, Peter A.
    and {\O}rntoft, Torben Falck
    and Dalsgaard S{\o}rensen, Karina
    and Wiuf, Carsten",
    title="epiG: statistical inference and profiling of DNA methylation from whole-genome bisulfite sequencing data",
    journal="Genome Biology",
    year="2017",
    volume="18",
    number="1",
    pages="38",
    abstract="The study of epigenetic heterogeneity at the level of individual cells and in whole populations is the key to understanding cellular differentiation, organismal development, and the evolution of cancer. We develop a statistical method, epiG, to infer and differentiate between different epi-allelic haplotypes, annotated with CpG methylation status and DNA polymorphisms, from whole-genome bisulfite sequencing data, and nucleosome occupancy from NOMe-seq data. We demonstrate the capabilities of the method by inferring allele-specific methylation and nucleosome occupancy in cell lines, and colon and tumor samples, and by benchmarking the method against independent experimental data.",
    issn="1474-760X",
    doi="10.1186/s13059-017-1168-4",
    url="//dx.doi.org/10.1186/s13059-017-1168-4"
    }

  • L. Musheng, X. Xueying, Z. Jing, S. Mengying, Y. Xiaofeng, K. Eun-A, Z. Tong, and G. Wanjun, “Quantifying circular rna expression from rna-seq data using model-based framework,” Bioinformatics, 2917.
    [BibTeX] [Abstract]

    Motivation: Circular RNAs (circRNAs) are a class of non-coding RNAs that are widely expressed in various cell lines and tissues of many organisms. Although the exact function of many circRNAs is largely unknown, the cell type– and tissue-specific circRNA expression has implicated their crucial functions in many biological processes. Hence, the quantification of circRNA expression from high-throughput RNA-seq data is becoming important to ascertain. Although many model-based methods have been developed to quantify linear RNA expression from RNA-seq data, these methods are not applicable to circRNA quantification. Results: Here we proposed a novel strategy that transforms circular transcripts to pseudo-linear transcripts and estimates the expression values of both circular and linear transcripts using an existing model-based algorithm, Sailfish. The new strategy can accurately estimate transcript expression of both linear and circular transcripts from RNA-seq data. Several factors, such as gene length, amount of expression, and the ratio of circular to linear transcripts, had impacts on quantification performance of circular transcripts. In comparison to count-based tools, the new computational framework had superior performance in estimating the amount of circRNA expression from both simulated and real ribosomal RNA-depleted (rRNA-depleted) RNA-seq datasets. On the other hand, the consideration of circular transcripts in expression quantification from rRNA-depleted RNA-seq data showed substantial increased accuracy of linear transcript expression. Our proposed strategy was implemented in a program named Sailfish-cir. Availability:Sailfish-cir is freely available at //github.com/zerodel/Sailfish-cir.

    @article{li:quantifying2017,
    Abstract = {Motivation: Circular RNAs (circRNAs) are a class of non-coding RNAs that are widely expressed in various cell lines and tissues of many organisms. Although the exact function of many circRNAs is largely unknown, the cell type-- and tissue-specific circRNA expression has implicated their crucial functions in many biological processes. Hence, the quantification of circRNA expression from high-throughput RNA-seq data is becoming important to ascertain. Although many model-based methods have been developed to quantify linear RNA expression from RNA-seq data, these methods are not applicable to circRNA quantification.
    Results: Here we proposed a novel strategy that transforms circular transcripts to pseudo-linear transcripts and estimates the expression values of both circular and linear transcripts using an existing model-based algorithm, Sailfish. The new strategy can accurately estimate transcript expression of both linear and circular transcripts from RNA-seq data. Several factors, such as gene length, amount of expression, and the ratio of circular to linear transcripts, had impacts on quantification performance of circular transcripts. In comparison to count-based tools, the new computational framework had superior performance in estimating the amount of circRNA expression from both simulated and real ribosomal RNA-depleted (rRNA-depleted) RNA-seq datasets. On the other hand, the consideration of circular transcripts in expression quantification from rRNA-depleted RNA-seq data showed substantial increased accuracy of linear transcript expression. Our proposed strategy was implemented in a program named Sailfish-cir.
    Availability:Sailfish-cir is freely available at //github.com/zerodel/Sailfish-cir.},
    Author = {Musheng, Li and Xueying, Xie and Jing, Zhou and Mengying, Sheng and Xiaofeng, Yin and Eun-A, Ko and Tong, Zhou and Wanjun, Gu},
    Date-Added = {2017-03-09 08:39:18 +0000},
    Date-Modified = {2017-03-09 08:40:40 +0000},
    Journal = {Bioinformatics},
    Title = {Quantifying circular RNA expression from RNA-seq data using model-based framework},
    Year = {2917}}

  • M. Milad, J. Alexander, C. Fabrizio, S. E. Stefan, H. H. Jakob, G. Jan, and B. Rolf, “Rnascclust: clustering rna sequences using structure conservation and graph based motifs,” Bioinformatics, 2017.
    [BibTeX] [Abstract]

    Motivation: Clustering RNA sequences with common secondary structure is an essential step towards studying RNA function. Whereas structural RNA alignment strategies typically identify common structure for orthologous structured RNAs, clustering seeks to group paralogous RNAs based on structural similarities. However, existing approaches for clustering paralogous RNAs, do not take the compensatory base pair changes obtained from structure conservation in orthologous sequences into account. Results: Here, we present RNAscClust, the implementation of a new algorithm to cluster a set of structured RNAs taking their respective structural conservation into account. For a set of multiple structural alignments of RNA sequences, each containing a paralog sequence included in a structural alignment of its orthologs, RNAscClust computes minimum free-energy structures for each sequence using conserved base pairs as prior information for the folding. The paralogs are then clustered using a graph kernel-based strategy, which identifies common structural features.We show that the clustering accuracy clearly benefits from an increasing degree of compensatory base pair changes in the alignments. Availability: RNAscClust is available at //www.bioinf.uni-freiburg.de/Software/RNAscClust

    @article{miladi:RNAscClust2017,
    Abstract = {Motivation: Clustering RNA sequences with common secondary structure is an essential step towards studying RNA function. Whereas structural RNA alignment strategies typically identify common structure for orthologous structured RNAs, clustering seeks to group paralogous RNAs based on structural similarities. However, existing approaches for clustering paralogous RNAs, do not take the compensatory base pair changes obtained from structure conservation in orthologous sequences into account.
    Results: Here, we present RNAscClust, the implementation of a new algorithm to cluster a set of structured RNAs taking their respective structural conservation into account. For a set of multiple structural alignments of RNA sequences, each containing a paralog sequence included in a structural alignment of its orthologs, RNAscClust computes minimum free-energy structures for each sequence using conserved base pairs as prior information for the folding. The paralogs are then clustered using a graph kernel-based strategy, which identifies common structural features.We show that the clustering accuracy clearly benefits from an increasing degree of compensatory base pair changes in the alignments.
    Availability: RNAscClust is available at //www.bioinf.uni-freiburg.de/Software/RNAscClust},
    Author = {Milad, Miladi and Alexander, Junge and Fabrizio, Costa and Stefan, E. Seemann and Jakob, Hull Havgaard and Jan, Gorodkin and Rolf, Backofen},
    Date-Added = {2017-03-09 08:36:32 +0000},
    Date-Modified = {2017-03-09 08:38:00 +0000},
    Journal = {bioinformatics},
    Title = {RNAscClust: clustering RNA sequences using structure conservation and graph based motifs},
    Year = {2017}}

  • M. Michel, C. Demel, B. Zacher, B. Schwalb, S. Krebs, H. Blum, J. Gagneur, and P. Cramer, “Tt-seq captures enhancer landscapes immediately after t-cell stimulation,” Mol syst biol, vol. 13, iss. 3, p. 920, 2017.
    [BibTeX] [Abstract]

    To monitor transcriptional regulation in human cells, rapid changes in enhancer and promoter activity must be captured with high sensitivity and temporal resolution. Here, we show that the recently established protocol TT-seq ("transient transcriptome sequencing") can monitor rapid changes in transcription from enhancers and promoters during the immediate response of T cells to ionomycin and phorbol 12-myristate 13-acetate (PMA). TT-seq maps eRNAs and mRNAs every 5 min after T-cell stimulation with high sensitivity and identifies many new primary response genes. TT-seq reveals that the synthesis of 1,601 eRNAs and 650 mRNAs changes significantly within only 15 min after stimulation, when standard RNA-seq does not detect differentially expressed genes. Transcription of enhancers that are primed for activation by nucleosome depletion can occur immediately and simultaneously with transcription of target gene promoters. Our results indicate that enhancer transcription is a good proxy for enhancer regulatory activity in target gene activation, and establish TT-seq as a tool for monitoring the dynamics of enhancer landscapes and transcription programs during cellular responses and differentiation.

    @article{Michel:2017kx,
    Abstract = {To monitor transcriptional regulation in human cells, rapid changes in enhancer and promoter activity must be captured with high sensitivity and temporal resolution. Here, we show that the recently established protocol TT-seq ("transient transcriptome sequencing") can monitor rapid changes in transcription from enhancers and promoters during the immediate response of T cells to ionomycin and phorbol 12-myristate 13-acetate (PMA). TT-seq maps eRNAs and mRNAs every 5 min after T-cell stimulation with high sensitivity and identifies many new primary response genes. TT-seq reveals that the synthesis of 1,601 eRNAs and 650 mRNAs changes significantly within only 15 min after stimulation, when standard RNA-seq does not detect differentially expressed genes. Transcription of enhancers that are primed for activation by nucleosome depletion can occur immediately and simultaneously with transcription of target gene promoters. Our results indicate that enhancer transcription is a good proxy for enhancer regulatory activity in target gene activation, and establish TT-seq as a tool for monitoring the dynamics of enhancer landscapes and transcription programs during cellular responses and differentiation.},
    Author = {Michel, Margaux and Demel, Carina and Zacher, Benedikt and Schwalb, Bj{\"o}rn and Krebs, Stefan and Blum, Helmut and Gagneur, Julien and Cramer, Patrick},
    Date-Added = {2017-03-09 08:21:22 +0000},
    Date-Modified = {2017-03-09 08:21:22 +0000},
    Journal = {Mol Syst Biol},
    Journal-Full = {Molecular systems biology},
    Keywords = {T‐cell response; enhancers; functional genomics; promoters; transcriptome analysis},
    Month = {Mar},
    Number = {3},
    Pages = {920},
    Pmid = {28270558},
    Pst = {epublish},
    Title = {TT-seq captures enhancer landscapes immediately after T-cell stimulation},
    Volume = {13},
    Year = {2017}}

  • Y. He, D. U. Gorkin, D. E. Dickel, J. R. Nery, R. G. Castanon, A. Y. Lee, Y. Shen, A. Visel, L. A. Pennacchio, B. Ren, and J. R. Ecker, “Improved regulatory element prediction based on tissue-specific local epigenomic signatures,” Proc natl acad sci u s a, vol. 114, iss. 9, p. E1633-E1640, 2017. doi:10.1073/pnas.1618353114
    [BibTeX] [Abstract]

    Accurate enhancer identification is critical for understanding the spatiotemporal transcriptional regulation during development as well as the functional impact of disease-related noncoding genetic variants. Computational methods have been developed to predict the genomic locations of active enhancers based on histone modifications, but the accuracy and resolution of these methods remain limited. Here, we present an algorithm, regulatory element prediction based on tissue-specific local epigenetic marks (REPTILE), which integrates histone modification and whole-genome cytosine DNA methylation profiles to identify the precise location of enhancers. We tested the ability of REPTILE to identify enhancers previously validated in reporter assays. Compared with existing methods, REPTILE shows consistently superior performance across diverse cell and tissue types, and the enhancer locations are significantly more refined. We show that, by incorporating base-resolution methylation data, REPTILE greatly improves upon current methods for annotation of enhancers across a variety of cell and tissue types. REPTILE is available at //github.com/yupenghe/REPTILE/.

    @article{He:2017uq,
    Abstract = {Accurate enhancer identification is critical for understanding the spatiotemporal transcriptional regulation during development as well as the functional impact of disease-related noncoding genetic variants. Computational methods have been developed to predict the genomic locations of active enhancers based on histone modifications, but the accuracy and resolution of these methods remain limited. Here, we present an algorithm, regulatory element prediction based on tissue-specific local epigenetic marks (REPTILE), which integrates histone modification and whole-genome cytosine DNA methylation profiles to identify the precise location of enhancers. We tested the ability of REPTILE to identify enhancers previously validated in reporter assays. Compared with existing methods, REPTILE shows consistently superior performance across diverse cell and tissue types, and the enhancer locations are significantly more refined. We show that, by incorporating base-resolution methylation data, REPTILE greatly improves upon current methods for annotation of enhancers across a variety of cell and tissue types. REPTILE is available at //github.com/yupenghe/REPTILE/.},
    Author = {He, Yupeng and Gorkin, David U and Dickel, Diane E and Nery, Joseph R and Castanon, Rosa G and Lee, Ah Young and Shen, Yin and Visel, Axel and Pennacchio, Len A and Ren, Bing and Ecker, Joseph R},
    Date-Added = {2017-03-08 16:00:15 +0000},
    Date-Modified = {2017-03-08 16:00:15 +0000},
    Doi = {10.1073/pnas.1618353114},
    Journal = {Proc Natl Acad Sci U S A},
    Journal-Full = {Proceedings of the National Academy of Sciences of the United States of America},
    Keywords = {DNA methylation; bioinformatics; enhancer prediction; epigenetics; gene regulation},
    Month = {Feb},
    Number = {9},
    Pages = {E1633-E1640},
    Pmid = {28193886},
    Pst = {ppublish},
    Title = {Improved regulatory element prediction based on tissue-specific local epigenomic signatures},
    Volume = {114},
    Year = {2017},
    Bdsk-Url-1 = {//dx.doi.org/10.1073/pnas.1618353114}}

  • F. Neri, S. Rapelli, A. Krepelova, D. Incarnato, C. Parlato, G. Basile, M. Maldotti, F. Anselmi, and S. Oliviero, “Intragenic dna methylation prevents spurious transcription initiation,” Nature, vol. 543, iss. 7643, pp. 72-77, 2017. doi:10.1038/nature21373
    [BibTeX] [Abstract]

    In mammals, DNA methylation occurs mainly at CpG dinucleotides. Methylation of the promoter suppresses gene expression, but the functional role of gene-body DNA methylation in highly expressed genes has yet to be clarified. Here we show that, in mouse embryonic stem cells, Dnmt3b-dependent intragenic DNA methylation protects the gene body from spurious RNA polymerase II entry and cryptic transcription initiation. Using different genome-wide approaches, we demonstrate that this Dnmt3b function is dependent on its enzymatic activity and recruitment to the gene body by H3K36me3. Furthermore, the spurious transcripts can either be degraded by the RNA exosome complex or capped, polyadenylated, and delivered to the ribosome to produce aberrant proteins. Elongating RNA polymerase II therefore triggers an epigenetic crosstalk mechanism that involves SetD2, H3K36me3, Dnmt3b and DNA methylation to ensure the fidelity of gene transcription initiation, with implications for intragenic hypomethylation in cancer.

    @article{Neri:2017fk,
    Abstract = {In mammals, DNA methylation occurs mainly at CpG dinucleotides. Methylation of the promoter suppresses gene expression, but the functional role of gene-body DNA methylation in highly expressed genes has yet to be clarified. Here we show that, in mouse embryonic stem cells, Dnmt3b-dependent intragenic DNA methylation protects the gene body from spurious RNA polymerase II entry and cryptic transcription initiation. Using different genome-wide approaches, we demonstrate that this Dnmt3b function is dependent on its enzymatic activity and recruitment to the gene body by H3K36me3. Furthermore, the spurious transcripts can either be degraded by the RNA exosome complex or capped, polyadenylated, and delivered to the ribosome to produce aberrant proteins. Elongating RNA polymerase II therefore triggers an epigenetic crosstalk mechanism that involves SetD2, H3K36me3, Dnmt3b and DNA methylation to ensure the fidelity of gene transcription initiation, with implications for intragenic hypomethylation in cancer.},
    Author = {Neri, Francesco and Rapelli, Stefania and Krepelova, Anna and Incarnato, Danny and Parlato, Caterina and Basile, Giulia and Maldotti, Mara and Anselmi, Francesca and Oliviero, Salvatore},
    Date-Added = {2017-03-08 15:45:48 +0000},
    Date-Modified = {2017-03-08 15:45:48 +0000},
    Doi = {10.1038/nature21373},
    Journal = {Nature},
    Journal-Full = {Nature},
    Month = {Mar},
    Number = {7643},
    Pages = {72-77},
    Pmid = {28225755},
    Pst = {ppublish},
    Title = {Intragenic DNA methylation prevents spurious transcription initiation},
    Volume = {543},
    Year = {2017},
    Bdsk-Url-1 = {//dx.doi.org/10.1038/nature21373}}

publications 2017-02-02

In this group meeting, we quickly discussed these latest papers:

  • X. Qiu, A. Hill, J. Packer, D. Lin, Y. Ma, and C. Trapnell, “Single-cell mrna quantification and differential analysis with census.,” Nature methods, 2017.
    [BibTeX]
    @article{qiu2017single,
    title={Single-cell mRNA quantification and differential analysis with Census.},
    author={Qiu, X and Hill, A and Packer, J and Lin, D and Ma, YA and Trapnell, C},
    journal={Nature methods},
    year={2017}
    }

  • K. R. Ghusinga, J. J. Dennehy, and A. Singh, “First-passage time approach to controlling noise in the timing of intracellular events,” Proc natl acad sci u s a, vol. 114, iss. 4, pp. 693-698, 2017. doi:10.1073/pnas.1609012114
    [BibTeX] [Abstract]

    In the noisy cellular environment, gene products are subject to inherent random fluctuations in copy numbers over time. How cells ensure precision in the timing of key intracellular events despite such stochasticity is an intriguing fundamental problem. We formulate event timing as a first-passage time problem, where an event is triggered when the level of a protein crosses a critical threshold for the first time. Analytical calculations are performed for the first-passage time distribution in stochastic models of gene expression. Derivation of these formulas motivates an interesting question: Is there an optimal feedback strategy to regulate the synthesis of a protein to ensure that an event will occur at a precise time, while minimizing deviations or noise about the mean? Counterintuitively, results show that for a stable long-lived protein, the optimal strategy is to express the protein at a constant rate without any feedback regulation, and any form of feedback (positive, negative, or any combination of them) will always amplify noise in event timing. In contrast, a positive feedback mechanism provides the highest precision in timing for an unstable protein. These theoretical results explain recent experimental observations of single-cell lysis times in bacteriophage [Formula: see text] Here, lysis of an infected bacterial cell is orchestrated by the expression and accumulation of a stable [Formula: see text] protein up to a threshold, and precision in timing is achieved via feedforward rather than feedback control. Our results have broad implications for diverse cellular processes that rely on precise temporal triggering of events.

    @article{Ghusinga:2017zr,
    Abstract = {In the noisy cellular environment, gene products are subject to inherent random fluctuations in copy numbers over time. How cells ensure precision in the timing of key intracellular events despite such stochasticity is an intriguing fundamental problem. We formulate event timing as a first-passage time problem, where an event is triggered when the level of a protein crosses a critical threshold for the first time. Analytical calculations are performed for the first-passage time distribution in stochastic models of gene expression. Derivation of these formulas motivates an interesting question: Is there an optimal feedback strategy to regulate the synthesis of a protein to ensure that an event will occur at a precise time, while minimizing deviations or noise about the mean? Counterintuitively, results show that for a stable long-lived protein, the optimal strategy is to express the protein at a constant rate without any feedback regulation, and any form of feedback (positive, negative, or any combination of them) will always amplify noise in event timing. In contrast, a positive feedback mechanism provides the highest precision in timing for an unstable protein. These theoretical results explain recent experimental observations of single-cell lysis times in bacteriophage [Formula: see text] Here, lysis of an infected bacterial cell is orchestrated by the expression and accumulation of a stable [Formula: see text] protein up to a threshold, and precision in timing is achieved via feedforward rather than feedback control. Our results have broad implications for diverse cellular processes that rely on precise temporal triggering of events.},
    Author = {Ghusinga, Khem Raj and Dennehy, John J and Singh, Abhyudai},
    Date-Added = {2017-02-01 21:24:54 +0000},
    Date-Modified = {2017-02-01 21:24:54 +0000},
    Doi = {10.1073/pnas.1609012114},
    Journal = {Proc Natl Acad Sci U S A},
    Journal-Full = {Proceedings of the National Academy of Sciences of the United States of America},
    Keywords = {event timing; feedback control; first-passage time; single cell; stochastic gene expression},
    Month = {Jan},
    Number = {4},
    Pages = {693-698},
    Pmid = {28069947},
    Pst = {ppublish},
    Title = {First-passage time approach to controlling noise in the timing of intracellular events},
    Volume = {114},
    Year = {2017},
    Bdsk-Url-1 = {//dx.doi.org/10.1073/pnas.1609012114}}

  • C. Donner, K. Obermayer, and H. Shimazaki, “Approximate inference for time-varying interactions and macroscopic dynamics of neural populations,” Plos comput biol, vol. 13, iss. 1, p. e1005309, 2017. doi:10.1371/journal.pcbi.1005309
    [BibTeX] [Abstract]

    The models in statistical physics such as an Ising model offer a convenient way to characterize stationary activity of neural populations. Such stationary activity of neurons may be expected for recordings from in vitro slices or anesthetized animals. However, modeling activity of cortical circuitries of awake animals has been more challenging because both spike-rates and interactions can change according to sensory stimulation, behavior, or an internal state of the brain. Previous approaches modeling the dynamics of neural interactions suffer from computational cost; therefore, its application was limited to only a dozen neurons. Here by introducing multiple analytic approximation methods to a state-space model of neural population activity, we make it possible to estimate dynamic pairwise interactions of up to 60 neurons. More specifically, we applied the pseudolikelihood approximation to the state-space model, and combined it with the Bethe or TAP mean-field approximation to make the sequential Bayesian estimation of the model parameters possible. The large-scale analysis allows us to investigate dynamics of macroscopic properties of neural circuitries underlying stimulus processing and behavior. We show that the model accurately estimates dynamics of network properties such as sparseness, entropy, and heat capacity by simulated data, and demonstrate utilities of these measures by analyzing activity of monkey V4 neurons as well as a simulated balanced network of spiking neurons.

    @article{Donner:2017ys,
    Abstract = {The models in statistical physics such as an Ising model offer a convenient way to characterize stationary activity of neural populations. Such stationary activity of neurons may be expected for recordings from in vitro slices or anesthetized animals. However, modeling activity of cortical circuitries of awake animals has been more challenging because both spike-rates and interactions can change according to sensory stimulation, behavior, or an internal state of the brain. Previous approaches modeling the dynamics of neural interactions suffer from computational cost; therefore, its application was limited to only a dozen neurons. Here by introducing multiple analytic approximation methods to a state-space model of neural population activity, we make it possible to estimate dynamic pairwise interactions of up to 60 neurons. More specifically, we applied the pseudolikelihood approximation to the state-space model, and combined it with the Bethe or TAP mean-field approximation to make the sequential Bayesian estimation of the model parameters possible. The large-scale analysis allows us to investigate dynamics of macroscopic properties of neural circuitries underlying stimulus processing and behavior. We show that the model accurately estimates dynamics of network properties such as sparseness, entropy, and heat capacity by simulated data, and demonstrate utilities of these measures by analyzing activity of monkey V4 neurons as well as a simulated balanced network of spiking neurons.},
    Author = {Donner, Christian and Obermayer, Klaus and Shimazaki, Hideaki},
    Date-Added = {2017-02-01 21:22:29 +0000},
    Date-Modified = {2017-02-01 21:22:29 +0000},
    Doi = {10.1371/journal.pcbi.1005309},
    Journal = {PLoS Comput Biol},
    Journal-Full = {PLoS computational biology},
    Month = {Jan},
    Number = {1},
    Pages = {e1005309},
    Pmid = {28095421},
    Pst = {epublish},
    Title = {Approximate Inference for Time-Varying Interactions and Macroscopic Dynamics of Neural Populations},
    Volume = {13},
    Year = {2017},
    Bdsk-Url-1 = {//dx.doi.org/10.1371/journal.pcbi.1005309}}

  • F. Fröhlich, B. Kaltenbacher, F. J. Theis, and J. Hasenauer, “Scalable parameter estimation for genome-scale biochemical reaction networks,” Plos comput biol, vol. 13, iss. 1, p. e1005331, 2017. doi:10.1371/journal.pcbi.1005331
    [BibTeX] [Abstract]

    Mechanistic mathematical modeling of biochemical reaction networks using ordinary differential equation (ODE) models has improved our understanding of small- and medium-scale biological processes. While the same should in principle hold for large- and genome-scale processes, the computational methods for the analysis of ODE models which describe hundreds or thousands of biochemical species and reactions are missing so far. While individual simulations are feasible, the inference of the model parameters from experimental data is computationally too intensive. In this manuscript, we evaluate adjoint sensitivity analysis for parameter estimation in large scale biochemical reaction networks. We present the approach for time-discrete measurement and compare it to state-of-the-art methods used in systems and computational biology. Our comparison reveals a significantly improved computational efficiency and a superior scalability of adjoint sensitivity analysis. The computational complexity is effectively independent of the number of parameters, enabling the analysis of large- and genome-scale models. Our study of a comprehensive kinetic model of ErbB signaling shows that parameter estimation using adjoint sensitivity analysis requires a fraction of the computation time of established methods. The proposed method will facilitate mechanistic modeling of genome-scale cellular processes, as required in the age of omics.

    @article{Frohlich:2017vn,
    Abstract = {Mechanistic mathematical modeling of biochemical reaction networks using ordinary differential equation (ODE) models has improved our understanding of small- and medium-scale biological processes. While the same should in principle hold for large- and genome-scale processes, the computational methods for the analysis of ODE models which describe hundreds or thousands of biochemical species and reactions are missing so far. While individual simulations are feasible, the inference of the model parameters from experimental data is computationally too intensive. In this manuscript, we evaluate adjoint sensitivity analysis for parameter estimation in large scale biochemical reaction networks. We present the approach for time-discrete measurement and compare it to state-of-the-art methods used in systems and computational biology. Our comparison reveals a significantly improved computational efficiency and a superior scalability of adjoint sensitivity analysis. The computational complexity is effectively independent of the number of parameters, enabling the analysis of large- and genome-scale models. Our study of a comprehensive kinetic model of ErbB signaling shows that parameter estimation using adjoint sensitivity analysis requires a fraction of the computation time of established methods. The proposed method will facilitate mechanistic modeling of genome-scale cellular processes, as required in the age of omics.},
    Author = {Fr{\"o}hlich, Fabian and Kaltenbacher, Barbara and Theis, Fabian J and Hasenauer, Jan},
    Date-Added = {2017-02-01 21:17:41 +0000},
    Date-Modified = {2017-02-01 21:17:41 +0000},
    Doi = {10.1371/journal.pcbi.1005331},
    Journal = {PLoS Comput Biol},
    Journal-Full = {PLoS computational biology},
    Month = {Jan},
    Number = {1},
    Pages = {e1005331},
    Pmid = {28114351},
    Pst = {epublish},
    Title = {Scalable Parameter Estimation for Genome-Scale Biochemical Reaction Networks},
    Volume = {13},
    Year = {2017},
    Bdsk-Url-1 = {//dx.doi.org/10.1371/journal.pcbi.1005331}}

  • D. Lamparter, D. Marbach, R. Rueedi, S. Bergmann, and Z. Kutalik, “Genome-wide association between transcription factor expression and chromatin accessibility reveals regulators of chromatin accessibility,” Plos comput biol, vol. 13, iss. 1, p. e1005311, 2017. doi:10.1371/journal.pcbi.1005311
    [BibTeX] [Abstract]

    To better understand genome regulation, it is important to uncover the role of transcription factors in the process of chromatin structure establishment and maintenance. Here we present a data-driven approach to systematically characterise transcription factors that are relevant for this process. Our method uses a linear mixed modelling approach to combine datasets of transcription factor binding motif enrichments in open chromatin and gene expression across the same set of cell lines. Applying this approach to the ENCODE dataset, we confirm already known and imply numerous novel transcription factors that play a role in the establishment or maintenance of open chromatin. In particular, our approach rediscovers many factors that have been annotated as pioneer factors.

    @article{Lamparter:2017kx,
    Abstract = {To better understand genome regulation, it is important to uncover the role of transcription factors in the process of chromatin structure establishment and maintenance. Here we present a data-driven approach to systematically characterise transcription factors that are relevant for this process. Our method uses a linear mixed modelling approach to combine datasets of transcription factor binding motif enrichments in open chromatin and gene expression across the same set of cell lines. Applying this approach to the ENCODE dataset, we confirm already known and imply numerous novel transcription factors that play a role in the establishment or maintenance of open chromatin. In particular, our approach rediscovers many factors that have been annotated as pioneer factors.},
    Author = {Lamparter, David and Marbach, Daniel and Rueedi, Rico and Bergmann, Sven and Kutalik, Zolt{\'a}n},
    Date-Added = {2017-02-01 21:16:14 +0000},
    Date-Modified = {2017-02-01 21:16:14 +0000},
    Doi = {10.1371/journal.pcbi.1005311},
    Journal = {PLoS Comput Biol},
    Journal-Full = {PLoS computational biology},
    Month = {Jan},
    Number = {1},
    Pages = {e1005311},
    Pmid = {28118358},
    Pst = {epublish},
    Title = {Genome-Wide Association between Transcription Factor Expression and Chromatin Accessibility Reveals Regulators of Chromatin Accessibility},
    Volume = {13},
    Year = {2017},
    Bdsk-Url-1 = {//dx.doi.org/10.1371/journal.pcbi.1005311}}

  • V. I. Risca, S. K. Denny, A. F. Straight, and W. J. Greenleaf, “Variable chromatin structure revealed by in situ spatially correlated dna cleavage mapping,” Nature, vol. 541, iss. 7636, pp. 237-241, 2017. doi:10.1038/nature20781
    [BibTeX] [Abstract]

    Chromatin structure at the length scale encompassing local nucleosome-nucleosome interactions is thought to play a crucial role in regulating transcription and access to DNA. However, this secondary structure of chromatin remains poorly understood compared with the primary structure of single nucleosomes or the tertiary structure of long-range looping interactions. Here we report the first genome-wide map of chromatin conformation in human cells at the 1-3 nucleosome (50-500 bp) scale, obtained using ionizing radiation-induced spatially correlated cleavage of DNA with sequencing (RICC-seq) to identify DNA-DNA contacts that are spatially proximal. Unbiased analysis of RICC-seq signal reveals regional enrichment of DNA fragments characteristic of alternating rather than adjacent nucleosome interactions in tri-nucleosome units, particularly in H3K9me3-marked heterochromatin. We infer differences in the likelihood of nucleosome-nucleosome contacts among open chromatin, H3K27me3-marked, and H3K9me3-marked repressed chromatin regions. After calibrating RICC-seq signal to three-dimensional distances, we show that compact two-start helical fibre structures with stacked alternating nucleosomes are consistent with RICC-seq fragmentation patterns from H3K9me3-marked chromatin, while non-compact structures and solenoid structures are consistent with open chromatin. Our data support a model of chromatin architecture in intact interphase nuclei consistent with variable longitudinal compaction of two-start helical fibres.

    @article{Risca:2017uq,
    Abstract = {Chromatin structure at the length scale encompassing local nucleosome-nucleosome interactions is thought to play a crucial role in regulating transcription and access to DNA. However, this secondary structure of chromatin remains poorly understood compared with the primary structure of single nucleosomes or the tertiary structure of long-range looping interactions. Here we report the first genome-wide map of chromatin conformation in human cells at the 1-3 nucleosome (50-500 bp) scale, obtained using ionizing radiation-induced spatially correlated cleavage of DNA with sequencing (RICC-seq) to identify DNA-DNA contacts that are spatially proximal. Unbiased analysis of RICC-seq signal reveals regional enrichment of DNA fragments characteristic of alternating rather than adjacent nucleosome interactions in tri-nucleosome units, particularly in H3K9me3-marked heterochromatin. We infer differences in the likelihood of nucleosome-nucleosome contacts among open chromatin, H3K27me3-marked, and H3K9me3-marked repressed chromatin regions. After calibrating RICC-seq signal to three-dimensional distances, we show that compact two-start helical fibre structures with stacked alternating nucleosomes are consistent with RICC-seq fragmentation patterns from H3K9me3-marked chromatin, while non-compact structures and solenoid structures are consistent with open chromatin. Our data support a model of chromatin architecture in intact interphase nuclei consistent with variable longitudinal compaction of two-start helical fibres.},
    Author = {Risca, Viviana I and Denny, Sarah K and Straight, Aaron F and Greenleaf, William J},
    Date-Added = {2017-02-01 21:02:58 +0000},
    Date-Modified = {2017-02-01 21:02:58 +0000},
    Doi = {10.1038/nature20781},
    Journal = {Nature},
    Journal-Full = {Nature},
    Month = {Jan},
    Number = {7636},
    Pages = {237-241},
    Pmid = {28024297},
    Pst = {ppublish},
    Title = {Variable chromatin structure revealed by in situ spatially correlated DNA cleavage mapping},
    Volume = {541},
    Year = {2017},
    Bdsk-Url-1 = {//dx.doi.org/10.1038/nature20781}}

  • C. T. Woods and A. Laederach, “Classification of rna structure change by "gazing" at experimental data,” Bioinformatics, 2017. doi:10.1093/bioinformatics/btx041
    [BibTeX] [Abstract]

    MOTIVATION: Mutations (or Single Nucleotide Variants) in folded RiboNucleic Acid structures that cause local or global conformational change are riboSNitches. Predicting riboSNitches is challenging, as it requires making two, albeit related, structure predictions. The data most often used to experimentally validate riboSNitch predictions is Selective 2′ Hydroxyl Acylation by Primer Extension, or SHAPE. Experimentally establishing a riboSNitch requires the quantitative comparison of two SHAPE traces: wild-type (WT) and mutant. Historically, SHAPE data was collected on electropherograms and change in structure was evaluated by "gel gazing." SHAPE data is now routinely collected with next generation sequencing and/or capillary sequencers. We aim to establish a classifier capable of simulating human "gazing" by identifying features of the SHAPE profile that human experts agree "looks" like a riboSNitch. RESULTS: We find strong quantitative agreement between experts when RNA scientists "gaze" at SHAPE data and identify riboSNitches. We identify dynamic time warping and seven other features predictive of the human consensus. The classSNitch classifier reported here accurately reproduces human consensus for 167 mutant/WT comparisons with an Area Under the Curve (AUC) above 0.8. When we analyze 2019 mutant traces for 17 different RNAs, we find that features of the WT SHAPE reactivity allow us to improve thermodynamic structure predictions of riboSNitches. This is significant, as accurate RNA structural analysis and prediction is likely to become an important aspect of precision medicine. AVAILABILITY: The classSNitch R package is freely available at //classsnitch.r-forge.r-project.org CONTACT: alain@email.unc.eduSupplementary information: Supplementary data are available at Bioinformatics online.

    @article{Woods:2017fk,
    Abstract = {MOTIVATION: Mutations (or Single Nucleotide Variants) in folded RiboNucleic Acid structures that cause local or global conformational change are riboSNitches. Predicting riboSNitches is challenging, as it requires making two, albeit related, structure predictions. The data most often used to experimentally validate riboSNitch predictions is Selective 2' Hydroxyl Acylation by Primer Extension, or SHAPE. Experimentally establishing a riboSNitch requires the quantitative comparison of two SHAPE traces: wild-type (WT) and mutant. Historically, SHAPE data was collected on electropherograms and change in structure was evaluated by "gel gazing." SHAPE data is now routinely collected with next generation sequencing and/or capillary sequencers. We aim to establish a classifier capable of simulating human "gazing" by identifying features of the SHAPE profile that human experts agree "looks" like a riboSNitch.
    RESULTS: We find strong quantitative agreement between experts when RNA scientists "gaze" at SHAPE data and identify riboSNitches. We identify dynamic time warping and seven other features predictive of the human consensus. The classSNitch classifier reported here accurately reproduces human consensus for 167 mutant/WT comparisons with an Area Under the Curve (AUC) above 0.8. When we analyze 2019 mutant traces for 17 different RNAs, we find that features of the WT SHAPE reactivity allow us to improve thermodynamic structure predictions of riboSNitches. This is significant, as accurate RNA structural analysis and prediction is likely to become an important aspect of precision medicine.
    AVAILABILITY: The classSNitch R package is freely available at //classsnitch.r-forge.r-project.org CONTACT: alain@email.unc.eduSupplementary information: Supplementary data are available at Bioinformatics online.},
    Author = {Woods, Chanin Tolson and Laederach, Alain},
    Date-Added = {2017-02-01 20:46:43 +0000},
    Date-Modified = {2017-02-01 20:46:43 +0000},
    Doi = {10.1093/bioinformatics/btx041},
    Journal = {Bioinformatics},
    Journal-Full = {Bioinformatics (Oxford, England)},
    Month = {Jan},
    Pmid = {28130241},
    Pst = {aheadofprint},
    Title = {Classification of RNA structure change by "gazing" at experimental data},
    Year = {2017},
    Bdsk-Url-1 = {//dx.doi.org/10.1093/bioinformatics/btx041}}

publications 2017-01-12

In this group meeting, we quickly discussed these latest papers:

  • L. S. L. Tan, V. M. H. Ong, D. J. Nott, and A. Jasra, “Variational inference for sparse spectrum gaussian process regression,” Statistics and computing, vol. 26, iss. 6, pp. 1243-1261, 2016. doi:10.1007/s11222-015-9600-7
    [BibTeX] [Abstract] [Download PDF]

    We develop a fast variational approximation scheme for Gaussian process (GP) regression, where the spectrum of the covariance function is subjected to a sparse approximation. Our approach enables uncertainty in covariance function hyperparameters to be treated without using Monte Carlo methods and is robust to overfitting. Our article makes three contributions. First, we present a variational Bayes algorithm for fitting sparse spectrum GP regression models that uses nonconjugate variational message passing to derive fast and efficient updates. Second, we propose a novel adaptive neighbourhood technique for obtaining predictive inference that is effective in dealing with nonstationarity. Regression is performed locally at each point to be predicted and the neighbourhood is determined using a measure defined based on lengthscales estimated from an initial fit. Weighting dimensions according to lengthscales, this downweights variables of little relevance, leading to automatic variable selection and improved prediction. Third, we introduce a technique for accelerating convergence in nonconjugate variational message passing by adapting step sizes in the direction of the natural gradient of the lower bound. Our adaptive strategy can be easily implemented and empirical results indicate significant speedups.

    @article{Tan2016,
    author="Tan, Linda S. L. and Ong, Victor M. H. and Nott, David J. and Jasra, Ajay",
    title="Variational inference for sparse spectrum Gaussian process regression",
    journal="Statistics and Computing",
    year="2016",
    volume="26",
    number="6",
    pages="1243--1261",
    abstract="We develop a fast variational approximation scheme for Gaussian process (GP) regression, where the spectrum of the covariance function is subjected to a sparse approximation. Our approach enables uncertainty in covariance function hyperparameters to be treated without using Monte Carlo methods and is robust to overfitting. Our article makes three contributions. First, we present a variational Bayes algorithm for fitting sparse spectrum GP regression models that uses nonconjugate variational message passing to derive fast and efficient updates. Second, we propose a novel adaptive neighbourhood technique for obtaining predictive inference that is effective in dealing with nonstationarity. Regression is performed locally at each point to be predicted and the neighbourhood is determined using a measure defined based on lengthscales estimated from an initial fit. Weighting dimensions according to lengthscales, this downweights variables of little relevance, leading to automatic variable selection and improved prediction. Third, we introduce a technique for accelerating convergence in nonconjugate variational message passing by adapting step sizes in the direction of the natural gradient of the lower bound. Our adaptive strategy can be easily implemented and empirical results indicate significant speedups.",
    issn="1573-1375",
    doi="10.1007/s11222-015-9600-7",
    url="//dx.doi.org/10.1007/s11222-015-9600-7"
    }

  • J. Yao, A. Pilko, and R. Wollman, “Distinct cellular states determine calcium signaling~response,” Molecular systems biology, vol. 12, iss. 12, 2016. doi:10.15252/msb.20167137
    [BibTeX] [Abstract] [Download PDF]

    The heterogeneity in mammalian cells signaling response is largely a result of pre-existing cell-to-cell variability. It is unknown whether cell-to-cell variability rises from biochemical stochastic fluctuations or distinct cellular states. Here, we utilize calcium response to adenosine trisphosphate as a model for investigating the structure of heterogeneity within a population of cells and analyze whether distinct cellular response states coexist. We use a functional definition of cellular state that is based on a mechanistic dynamical systems model of calcium signaling. Using Bayesian parameter inference, we obtain high confidence parameter value distributions for several hundred cells, each fitted individually. Clustering the inferred parameter distributions revealed three major distinct cellular states within the population. The existence of distinct cellular states raises the possibility that the observed variability in response is a result of structured heterogeneity between cells. The inferred parameter distribution predicts, and experiments confirm that variability in IP3R response explains the majority of calcium heterogeneity. Our work shows how mechanistic models and single-cell parameter fitting can uncover hidden population structure and demonstrate the need for parameter inference at the single-cell level.Mol Syst Biol. (2016) 12: 894

    @article{Yao894,
    author = {Yao, Jason and Pilko, Anna and Wollman, Roy},
    title = {Distinct cellular states determine calcium signaling~response},
    volume = {12},
    number = {12},
    year = {2016},
    doi = {10.15252/msb.20167137},
    publisher = {EMBO Press},
    abstract = {The heterogeneity in mammalian cells signaling response is largely a result of pre-existing cell-to-cell variability. It is unknown whether cell-to-cell variability rises from biochemical stochastic fluctuations or distinct cellular states. Here, we utilize calcium response to adenosine trisphosphate as a model for investigating the structure of heterogeneity within a population of cells and analyze whether distinct cellular response states coexist. We use a functional definition of cellular state that is based on a mechanistic dynamical systems model of calcium signaling. Using Bayesian parameter inference, we obtain high confidence parameter value distributions for several hundred cells, each fitted individually. Clustering the inferred parameter distributions revealed three major distinct cellular states within the population. The existence of distinct cellular states raises the possibility that the observed variability in response is a result of structured heterogeneity between cells. The inferred parameter distribution predicts, and experiments confirm that variability in IP3R response explains the majority of calcium heterogeneity. Our work shows how mechanistic models and single-cell parameter fitting can uncover hidden population structure and demonstrate the need for parameter inference at the single-cell level.Mol Syst Biol. (2016) 12: 894},
    URL = {//msb.embopress.org/content/12/12/894},
    eprint = {//msb.embopress.org/content/12/12/894.full.pdf},
    journal = {Molecular Systems Biology}
    }

  • C. Zechner and M. Khammash, “A molecular implementation of the least mean squares estimator,” in 2016 ieee 55th conference on decision and control (cdc), 2016, pp. 5869-5874. doi:10.1109/CDC.2016.7799172
    [BibTeX] [Abstract]

    In order to function reliably, synthetic molecular circuits require mechanisms that allow them to adapt to environmental disturbances. Least mean squares (LMS) schemes, such as commonly encountered in signal processing and control, provide a powerful means to accomplish that goal. In this paper we show how the traditional LMS algorithm can be implemented at the molecular level using only a few elementary biomolecular reactions. We demonstrate our approach using several simulation studies and discuss its relevance to synthetic biology.

    @INPROCEEDINGS{Zechner2016,
    author={C. Zechner and M. Khammash},
    booktitle={2016 IEEE 55th Conference on Decision and Control (CDC)},
    title={A molecular implementation of the least mean squares estimator},
    year={2016},
    pages={5869-5874},
    abstract={In order to function reliably, synthetic molecular circuits require mechanisms that allow them to adapt to environmental disturbances. Least mean squares (LMS) schemes, such as commonly encountered in signal processing and control, provide a powerful means to accomplish that goal. In this paper we show how the traditional LMS algorithm can be implemented at the molecular level using only a few elementary biomolecular reactions. We demonstrate our approach using several simulation studies and discuss its relevance to synthetic biology.},
    keywords={Biological system modeling;Convergence;Estimation;Integrated circuit modeling;Mathematical model;Signal processing algorithms;Stochastic processes},
    doi={10.1109/CDC.2016.7799172},
    month={Dec}
    }

  • L. Bronstein and H. Koeppl, “Scalable inference using pmcmc and parallel tempering for high-throughput measurements of biomolecular reaction networks,” in 2016 ieee 55th conference on decision and control (cdc), 2016, pp. 770-775. doi:10.1109/CDC.2016.7798361
    [BibTeX] [Abstract]

    Inferring quantities of interest from fluorescence microscopy time-lapse measurements of cells is a key step in parameterizing models of biomolecular reaction networks, and also in comparing different models. In this article, we propose a method which performs inference in continuous-time Markov chain models and thus takes into account the discrete nature of molecule counts. It targets the important situation of inference from many measured cells. Our method, a complement to a recently proposed approach, is based on particle Markov chain Monte Carlo and can be argued to have improved scaling behavior as the number of measured cells increases. We numerically demonstrate the performance of our algorithm on simulated data.

    @INPROCEEDINGS{Bronstein2016,
    author={L. Bronstein and H. Koeppl},
    booktitle={2016 IEEE 55th Conference on Decision and Control (CDC)},
    title={Scalable inference using PMCMC and parallel tempering for high-throughput measurements of biomolecular reaction networks},
    year={2016},
    pages={770-775},
    abstract={Inferring quantities of interest from fluorescence microscopy time-lapse measurements of cells is a key step in parameterizing models of biomolecular reaction networks, and also in comparing different models. In this article, we propose a method which performs inference in continuous-time Markov chain models and thus takes into account the discrete nature of molecule counts. It targets the important situation of inference from many measured cells. Our method, a complement to a recently proposed approach, is based on particle Markov chain Monte Carlo and can be argued to have improved scaling behavior as the number of measured cells increases. We numerically demonstrate the performance of our algorithm on simulated data.},
    keywords={Atmospheric measurements;Markov processes;Mathematical model;Monte Carlo methods;Particle measurements;Time measurement;Trajectory},
    doi={10.1109/CDC.2016.7798361},
    month={Dec}
    }

  • S. A. Sevier, D. A. Kessler, and H. Levine, “Mechanical bounds to transcriptional noise,” Proc natl acad sci u s a, vol. 113, iss. 49, pp. 13983-13988, 2016. doi:10.1073/pnas.1612651113
    [BibTeX] [Abstract]

    Over the past several decades it has been increasingly recognized that stochastic processes play a central role in transcription. Although many stochastic effects have been explained, the source of transcriptional bursting (one of the most well-known sources of stochasticity) has continued to evade understanding. Recent results have pointed to mechanical feedback as the source of transcriptional bursting, but a reconciliation of this perspective with preexisting views of transcriptional regulation is lacking. In this article, we present a simple phenomenological model that is able to incorporate the traditional view of gene expression within a framework with mechanical limits to transcription. By introducing a simple competition between mechanical arrest and relaxation copy number probability distributions collapse onto a shared universal curve under shifting and rescaling and a lower limit of intrinsic noise for any mean expression level is found.

    @article{Sevier:2016ve,
    Abstract = {Over the past several decades it has been increasingly recognized that stochastic processes play a central role in transcription. Although many stochastic effects have been explained, the source of transcriptional bursting (one of the most well-known sources of stochasticity) has continued to evade understanding. Recent results have pointed to mechanical feedback as the source of transcriptional bursting, but a reconciliation of this perspective with preexisting views of transcriptional regulation is lacking. In this article, we present a simple phenomenological model that is able to incorporate the traditional view of gene expression within a framework with mechanical limits to transcription. By introducing a simple competition between mechanical arrest and relaxation copy number probability distributions collapse onto a shared universal curve under shifting and rescaling and a lower limit of intrinsic noise for any mean expression level is found.},
    Author = {Sevier, Stuart A and Kessler, David A and Levine, Herbert},
    Date-Added = {2017-01-12 09:37:36 +0000},
    Date-Modified = {2017-01-12 09:37:36 +0000},
    Doi = {10.1073/pnas.1612651113},
    Journal = {Proc Natl Acad Sci U S A},
    Journal-Full = {Proceedings of the National Academy of Sciences of the United States of America},
    Keywords = {bursting noise; supercoiling; topoisomerase; transcription},
    Month = {Dec},
    Number = {49},
    Pages = {13983-13988},
    Pmc = {PMC5150389},
    Pmid = {27911801},
    Pst = {ppublish},
    Title = {Mechanical bounds to transcriptional noise},
    Volume = {113},
    Year = {2016},
    Bdsk-Url-1 = {//dx.doi.org/10.1073/pnas.1612651113}}

  • C. J. Tokheim, N. Papadopoulos, K. W. Kinzler, B. Vogelstein, and R. Karchin, “Evaluating the evaluation of cancer driver genes,” Proc natl acad sci u s a, vol. 113, iss. 50, pp. 14330-14335, 2016. doi:10.1073/pnas.1616440113
    [BibTeX] [Abstract]

    Sequencing has identified millions of somatic mutations in human cancers, but distinguishing cancer driver genes remains a major challenge. Numerous methods have been developed to identify driver genes, but evaluation of the performance of these methods is hindered by the lack of a gold standard, that is, bona fide driver gene mutations. Here, we establish an evaluation framework that can be applied to driver gene prediction methods. We used this framework to compare the performance of eight such methods. One of these methods, described here, incorporated a machine-learning-based ratiometric approach. We show that the driver genes predicted by each of the eight methods vary widely. Moreover, the P values reported by several of the methods were inconsistent with the uniform values expected, thus calling into question the assumptions that were used to generate them. Finally, we evaluated the potential effects of unexplained variability in mutation rates on false-positive driver gene predictions. Our analysis points to the strengths and weaknesses of each of the currently available methods and offers guidance for improving them in the future.

    @article{Tokheim:2016ly,
    Abstract = {Sequencing has identified millions of somatic mutations in human cancers, but distinguishing cancer driver genes remains a major challenge. Numerous methods have been developed to identify driver genes, but evaluation of the performance of these methods is hindered by the lack of a gold standard, that is, bona fide driver gene mutations. Here, we establish an evaluation framework that can be applied to driver gene prediction methods. We used this framework to compare the performance of eight such methods. One of these methods, described here, incorporated a machine-learning-based ratiometric approach. We show that the driver genes predicted by each of the eight methods vary widely. Moreover, the P values reported by several of the methods were inconsistent with the uniform values expected, thus calling into question the assumptions that were used to generate them. Finally, we evaluated the potential effects of unexplained variability in mutation rates on false-positive driver gene predictions. Our analysis points to the strengths and weaknesses of each of the currently available methods and offers guidance for improving them in the future.},
    Author = {Tokheim, Collin J and Papadopoulos, Nickolas and Kinzler, Kenneth W and Vogelstein, Bert and Karchin, Rachel},
    Date-Added = {2017-01-12 09:31:51 +0000},
    Date-Modified = {2017-01-12 09:31:51 +0000},
    Doi = {10.1073/pnas.1616440113},
    Journal = {Proc Natl Acad Sci U S A},
    Journal-Full = {Proceedings of the National Academy of Sciences of the United States of America},
    Keywords = {DNA sequencing; cancer genomics; cancer mutations; computational method evaluation; driver genes},
    Month = {Dec},
    Number = {50},
    Pages = {14330-14335},
    Pmc = {PMC5167163},
    Pmid = {27911828},
    Pst = {ppublish},
    Title = {Evaluating the evaluation of cancer driver genes},
    Volume = {113},
    Year = {2016},
    Bdsk-Url-1 = {//dx.doi.org/10.1073/pnas.1616440113}}

  • Z. Lin, C. Yang, Y. Zhu, J. Duchi, Y. Fu, Y. Wang, B. Jiang, M. Zamanighomi, X. Xu, M. Li, N. Sestan, H. Zhao, and W. H. Wong, “Simultaneous dimension reduction and adjustment for confounding variation,” Proc natl acad sci u s a, vol. 113, iss. 51, pp. 14662-14667, 2016. doi:10.1073/pnas.1617317113
    [BibTeX] [Abstract]

    Dimension reduction methods are commonly applied to high-throughput biological datasets. However, the results can be hindered by confounding factors, either biological or technical in origin. In this study, we extend principal component analysis (PCA) to propose AC-PCA for simultaneous dimension reduction and adjustment for confounding (AC) variation. We show that AC-PCA can adjust for (i) variations across individual donors present in a human brain exon array dataset and (ii) variations of different species in a model organism ENCODE RNA sequencing dataset. Our approach is able to recover the anatomical structure of neocortical regions and to capture the shared variation among species during embryonic development. For gene selection purposes, we extend AC-PCA with sparsity constraints and propose and implement an efficient algorithm. The methods developed in this paper can also be applied to more general settings. The R package and MATLAB source code are available at //github.com/linzx06/AC-PCA.

    @article{Lin:2016zr,
    Abstract = {Dimension reduction methods are commonly applied to high-throughput biological datasets. However, the results can be hindered by confounding factors, either biological or technical in origin. In this study, we extend principal component analysis (PCA) to propose AC-PCA for simultaneous dimension reduction and adjustment for confounding (AC) variation. We show that AC-PCA can adjust for (i) variations across individual donors present in a human brain exon array dataset and (ii) variations of different species in a model organism ENCODE RNA sequencing dataset. Our approach is able to recover the anatomical structure of neocortical regions and to capture the shared variation among species during embryonic development. For gene selection purposes, we extend AC-PCA with sparsity constraints and propose and implement an efficient algorithm. The methods developed in this paper can also be applied to more general settings. The R package and MATLAB source code are available at //github.com/linzx06/AC-PCA.},
    Author = {Lin, Zhixiang and Yang, Can and Zhu, Ying and Duchi, John and Fu, Yao and Wang, Yong and Jiang, Bai and Zamanighomi, Mahdi and Xu, Xuming and Li, Mingfeng and Sestan, Nenad and Zhao, Hongyu and Wong, Wing Hung},
    Date-Added = {2017-01-12 09:29:21 +0000},
    Date-Modified = {2017-01-12 09:29:21 +0000},
    Doi = {10.1073/pnas.1617317113},
    Journal = {Proc Natl Acad Sci U S A},
    Journal-Full = {Proceedings of the National Academy of Sciences of the United States of America},
    Keywords = {confounding variation; dimension reduction; transcriptome},
    Month = {Dec},
    Number = {51},
    Pages = {14662-14667},
    Pmc = {PMC5187682},
    Pmid = {27930330},
    Pst = {ppublish},
    Title = {Simultaneous dimension reduction and adjustment for confounding variation},
    Volume = {113},
    Year = {2016},
    Bdsk-Url-1 = {//dx.doi.org/10.1073/pnas.1617317113}}

  • S. Raguideau, S. Plancade, N. Pons, M. Leclerc, and B. Laroche, “Inferring aggregated functional traits from metagenomic data using constrained non-negative matrix factorization: application to fiber degradation in the human gut microbiota,” Plos comput biol, vol. 12, iss. 12, p. e1005252, 2016. doi:10.1371/journal.pcbi.1005252
    [BibTeX] [Abstract]

    Whole Genome Shotgun (WGS) metagenomics is increasingly used to study the structure and functions of complex microbial ecosystems, both from the taxonomic and functional point of view. Gene inventories of otherwise uncultured microbial communities make the direct functional profiling of microbial communities possible. The concept of community aggregated trait has been adapted from environmental and plant functional ecology to the framework of microbial ecology. Community aggregated traits are quantified from WGS data by computing the abundance of relevant marker genes. They can be used to study key processes at the ecosystem level and correlate environmental factors and ecosystem functions. In this paper we propose a novel model based approach to infer combinations of aggregated traits characterizing specific ecosystemic metabolic processes. We formulate a model of these Combined Aggregated Functional Traits (CAFTs) accounting for a hierarchical structure of genes, which are associated on microbial genomes, further linked at the ecosystem level by complex co-occurrences or interactions. The model is completed with constraints specifically designed to exploit available genomic information, in order to favor biologically relevant CAFTs. The CAFTs structure, as well as their intensity in the ecosystem, is obtained by solving a constrained Non-negative Matrix Factorization (NMF) problem. We developed a multicriteria selection procedure for the number of CAFTs. We illustrated our method on the modelling of ecosystemic functional traits of fiber degradation by the human gut microbiota. We used 1408 samples of gene abundances from several high-throughput sequencing projects and found that four CAFTs only were needed to represent the fiber degradation potential. This data reduction highlighted biologically consistent functional patterns while providing a high quality preservation of the original data. Our method is generic and can be applied to other metabolic processes in the gut or in other ecosystems.

    @article{Raguideau:2016ys,
    Abstract = {Whole Genome Shotgun (WGS) metagenomics is increasingly used to study the structure and functions of complex microbial ecosystems, both from the taxonomic and functional point of view. Gene inventories of otherwise uncultured microbial communities make the direct functional profiling of microbial communities possible. The concept of community aggregated trait has been adapted from environmental and plant functional ecology to the framework of microbial ecology. Community aggregated traits are quantified from WGS data by computing the abundance of relevant marker genes. They can be used to study key processes at the ecosystem level and correlate environmental factors and ecosystem functions. In this paper we propose a novel model based approach to infer combinations of aggregated traits characterizing specific ecosystemic metabolic processes. We formulate a model of these Combined Aggregated Functional Traits (CAFTs) accounting for a hierarchical structure of genes, which are associated on microbial genomes, further linked at the ecosystem level by complex co-occurrences or interactions. The model is completed with constraints specifically designed to exploit available genomic information, in order to favor biologically relevant CAFTs. The CAFTs structure, as well as their intensity in the ecosystem, is obtained by solving a constrained Non-negative Matrix Factorization (NMF) problem. We developed a multicriteria selection procedure for the number of CAFTs. We illustrated our method on the modelling of ecosystemic functional traits of fiber degradation by the human gut microbiota. We used 1408 samples of gene abundances from several high-throughput sequencing projects and found that four CAFTs only were needed to represent the fiber degradation potential. This data reduction highlighted biologically consistent functional patterns while providing a high quality preservation of the original data. Our method is generic and can be applied to other metabolic processes in the gut or in other ecosystems.},
    Author = {Raguideau, S{\'e}bastien and Plancade, Sandra and Pons, Nicolas and Leclerc, Marion and Laroche, B{\'e}atrice},
    Date-Added = {2017-01-12 09:19:55 +0000},
    Date-Modified = {2017-01-12 09:19:55 +0000},
    Doi = {10.1371/journal.pcbi.1005252},
    Journal = {PLoS Comput Biol},
    Journal-Full = {PLoS computational biology},
    Month = {Dec},
    Number = {12},
    Pages = {e1005252},
    Pmc = {PMC5161307},
    Pmid = {27984592},
    Pst = {epublish},
    Title = {Inferring Aggregated Functional Traits from Metagenomic Data Using Constrained Non-negative Matrix Factorization: Application to Fiber Degradation in the Human Gut Microbiota},
    Volume = {12},
    Year = {2016},
    Bdsk-Url-1 = {//dx.doi.org/10.1371/journal.pcbi.1005252}}

  • L. Cardelli, R. D. Hernansaiz-Ballesteros, N. Dalchau, and A. Csikász-Nagy, “Efficient switches in biology and computer science,” Plos comput biol, vol. 13, iss. 1, p. e1005100, 2017. doi:10.1371/journal.pcbi.1005100
    [BibTeX]
    @article{Cardelli:2017vn,
    Author = {Cardelli, Luca and Hernansaiz-Ballesteros, Rosa D and Dalchau, Neil and Csik{\'a}sz-Nagy, Attila},
    Date-Added = {2017-01-12 09:15:36 +0000},
    Date-Modified = {2017-01-12 09:15:36 +0000},
    Doi = {10.1371/journal.pcbi.1005100},
    Journal = {PLoS Comput Biol},
    Journal-Full = {PLoS computational biology},
    Month = {Jan},
    Number = {1},
    Pages = {e1005100},
    Pmid = {28056093},
    Pst = {epublish},
    Title = {Efficient Switches in Biology and Computer Science},
    Volume = {13},
    Year = {2017},
    Bdsk-Url-1 = {//dx.doi.org/10.1371/journal.pcbi.1005100}}

  • K. R. Campbell and C. Yau, “Switchde: inference of switch-like differential expression along single-cell trajectories,” Bioinformatics, 2016. doi:10.1093/bioinformatics/btw798
    [BibTeX] [Abstract]

    MOTIVATION: Pseudotime analyses of single-cell RNA-seq data have become increasingly common. Typically, a latent trajectory corresponding to a biological process of interest – such as differentiation or cell cycle – is discovered. However, relatively little attention has been paid to modelling the differential expression of genes along such trajectories. RESULTS: We present switchde, a statistical framework and accompanying R package for identifying switch-like differential expression of genes along pseudotemporal trajectories. Our method includes fast model fitting that provides interpretable parameter estimates corresponding to how quickly a gene is up or down regulated as well as where in the trajectory such regulation occurs. It also reports a p-value in favour of rejecting a constant-expression model for switch-like differential expression and optionally models the zero-inflation prevalent in single-cell data. AVAILABILITY: The R package switchde is available at //www.github.com/kieranrcampbell/switchde CONTACT: kieran.campbell@sjc.ox.ac.uk SUPPLEMENTARY INFORMATION: Supplementary text is available at Bioinformatics online.

    @article{Campbell:2016kx,
    Abstract = {MOTIVATION: Pseudotime analyses of single-cell RNA-seq data have become increasingly common. Typically, a latent trajectory corresponding to a biological process of interest - such as differentiation or cell cycle - is discovered. However, relatively little attention has been paid to modelling the differential expression of genes along such trajectories.
    RESULTS: We present switchde, a statistical framework and accompanying R package for identifying switch-like differential expression of genes along pseudotemporal trajectories. Our method includes fast model fitting that provides interpretable parameter estimates corresponding to how quickly a gene is up or down regulated as well as where in the trajectory such regulation occurs. It also reports a p-value in favour of rejecting a constant-expression model for switch-like differential expression and optionally models the zero-inflation prevalent in single-cell data.
    AVAILABILITY: The R package switchde is available at //www.github.com/kieranrcampbell/switchde CONTACT: kieran.campbell@sjc.ox.ac.uk SUPPLEMENTARY INFORMATION: Supplementary text is available at Bioinformatics online.},
    Author = {Campbell, Kieran R and Yau, Christopher},
    Date-Added = {2017-01-11 21:31:57 +0000},
    Date-Modified = {2017-01-11 21:31:57 +0000},
    Doi = {10.1093/bioinformatics/btw798},
    Journal = {Bioinformatics},
    Journal-Full = {Bioinformatics (Oxford, England)},
    Month = {Dec},
    Pmid = {28011787},
    Pst = {aheadofprint},
    Title = {switchde: Inference of switch-like differential expression along single-cell trajectories},
    Year = {2016},
    Bdsk-Url-1 = {//dx.doi.org/10.1093/bioinformatics/btw798}}

  • C. Maier, C. Loos, and J. Hasenauer, “Robust parameter estimation for dynamical systems from outlier-corrupted data,” Bioinformatics, 2017. doi:10.1093/bioinformatics/btw703
    [BibTeX] [Abstract]

    MOTIVATION: Dynamics of cellular processes are often studied using mechanistic mathematical models. These models possess unknown parameters which are generally estimated from experimental data assuming normally distributed measurement noise. Outlier corruption of datasets often cannot be avoided. These outliers may distort the parameter estimates, resulting in incorrect model predictions. Robust parameter estimation methods are required which provide reliable parameter estimates in the presence of outliers. RESULTS: In this manuscript, we propose and evaluate methods for estimating the parameters of ordinary differential equation models from outlier-corrupted data. As alternatives to the normal distribution as noise distribution, we consider the Laplace, the Huber, the Cauchy and the Student’s t distribution. We assess accuracy, robustness and computational efficiency of estimators using these different distribution assumptions. To this end, we consider artificial data of a conversion process, as well as published experimental data for Epo-induced JAK/STAT signaling. We study how well the methods can compensate and discover artificially introduced outliers. Our evaluation reveals that using alternative distributions improves the robustness of parameter estimates. AVAILABILITY AND IMPLEMENTATION: The MATLAB implementation of the likelihood functions using the distribution assumptions is available at Bioinformatics online. CONTACT: jan.hasenauer@helmholtz-muenchen.deSupplementary information: Supplementary material are available at Bioinformatics online.

    @article{Maier:2017uq,
    Abstract = {MOTIVATION: Dynamics of cellular processes are often studied using mechanistic mathematical models. These models possess unknown parameters which are generally estimated from experimental data assuming normally distributed measurement noise. Outlier corruption of datasets often cannot be avoided. These outliers may distort the parameter estimates, resulting in incorrect model predictions. Robust parameter estimation methods are required which provide reliable parameter estimates in the presence of outliers.
    RESULTS: In this manuscript, we propose and evaluate methods for estimating the parameters of ordinary differential equation models from outlier-corrupted data. As alternatives to the normal distribution as noise distribution, we consider the Laplace, the Huber, the Cauchy and the Student's t distribution. We assess accuracy, robustness and computational efficiency of estimators using these different distribution assumptions. To this end, we consider artificial data of a conversion process, as well as published experimental data for Epo-induced JAK/STAT signaling. We study how well the methods can compensate and discover artificially introduced outliers. Our evaluation reveals that using alternative distributions improves the robustness of parameter estimates.
    AVAILABILITY AND IMPLEMENTATION: The MATLAB implementation of the likelihood functions using the distribution assumptions is available at Bioinformatics online.
    CONTACT: jan.hasenauer@helmholtz-muenchen.deSupplementary information: Supplementary material are available at Bioinformatics online.},
    Author = {Maier, Corinna and Loos, Carolin and Hasenauer, Jan},
    Date-Added = {2017-01-11 21:28:42 +0000},
    Date-Modified = {2017-01-11 21:28:42 +0000},
    Doi = {10.1093/bioinformatics/btw703},
    Journal = {Bioinformatics},
    Journal-Full = {Bioinformatics (Oxford, England)},
    Month = {Jan},
    Pmid = {28062444},
    Pst = {aheadofprint},
    Title = {Robust parameter estimation for dynamical systems from outlier-corrupted data},
    Year = {2017},
    Bdsk-Url-1 = {//dx.doi.org/10.1093/bioinformatics/btw703}}

  • X. Zhang and S. Liu, “Rbppred: predicting rna-binding proteins from sequence using svm,” Bioinformatics, 2016. doi:10.1093/bioinformatics/btw730
    [BibTeX] [Abstract]

    MOTIVATION: Detection of RNA-binding proteins (RBPs) is essential since the RNA-binding proteins play critical roles in post-transcriptional regulation and have diverse roles in various biological processes. Moreover, identifying RBPs by computational prediction is much more efficient than experimental methods and may have guiding significance on the experiment design. RESULTS: In this study, we present the RBPPred (an RNA-binding protein predictor), a new method based on the support vector machine, to predict whether a protein binds RNAs, based on a comprehensive feature representation. By integrating the physicochemical properties with the evolutionary information of protein sequences, the new approach RBPPred performed much better than state-of-the-art methods. The results show that RBPPred correctly predicted 83% of 2780 RBPs and 96% out of 7093 non-RBPs with MCC of 0.808 using the 10-fold cross validation. Furthermore, we achieved a sensitivity of 84%, specificity of 97% and MCC of 0.788 on the testing set of human proteome. In addition we tested the capability of RBPPred to identify new RBPs, which further confirmed the practicability and predictability of the method. AVAILABILITY AND IMPLEMENTATION: RBPPred program can be accessed at: //rnabinding.com/RBPPred.html CONTACT: liushiyong@gmail.comSupplementary information: Supplementary data are available at Bioinformatics online.

    @article{Zhang:2016fk,
    Abstract = {MOTIVATION: Detection of RNA-binding proteins (RBPs) is essential since the RNA-binding proteins play critical roles in post-transcriptional regulation and have diverse roles in various biological processes. Moreover, identifying RBPs by computational prediction is much more efficient than experimental methods and may have guiding significance on the experiment design.
    RESULTS: In this study, we present the RBPPred (an RNA-binding protein predictor), a new method based on the support vector machine, to predict whether a protein binds RNAs, based on a comprehensive feature representation. By integrating the physicochemical properties with the evolutionary information of protein sequences, the new approach RBPPred performed much better than state-of-the-art methods. The results show that RBPPred correctly predicted 83% of 2780 RBPs and 96% out of 7093 non-RBPs with MCC of 0.808 using the 10-fold cross validation. Furthermore, we achieved a sensitivity of 84%, specificity of 97% and MCC of 0.788 on the testing set of human proteome. In addition we tested the capability of RBPPred to identify new RBPs, which further confirmed the practicability and predictability of the method.
    AVAILABILITY AND IMPLEMENTATION: RBPPred program can be accessed at: //rnabinding.com/RBPPred.html CONTACT: liushiyong@gmail.comSupplementary information: Supplementary data are available at Bioinformatics online.},
    Author = {Zhang, Xiaoli and Liu, Shiyong},
    Date-Added = {2017-01-11 21:27:06 +0000},
    Date-Modified = {2017-01-11 21:27:06 +0000},
    Doi = {10.1093/bioinformatics/btw730},
    Journal = {Bioinformatics},
    Journal-Full = {Bioinformatics (Oxford, England)},
    Month = {Dec},
    Pmid = {27993780},
    Pst = {aheadofprint},
    Title = {RBPPred: predicting RNA-binding proteins from sequence using SVM},
    Year = {2016},
    Bdsk-Url-1 = {//dx.doi.org/10.1093/bioinformatics/btw730}}

publications 2016-11-23

In this group meeting, we quickly discussed these latest papers:

  • D. Kleftogiannis, P. Kalnis, E. Arner, and V. B. Bajic, “Discriminative identification of transcriptional responses of promoters and enhancers after stimulus,” Nucleic acids research, p. gkw1015, 2016.
    [BibTeX]
    @article{kleftogiannis2016discriminative,
    title={Discriminative identification of transcriptional responses of promoters and enhancers after stimulus},
    author={Kleftogiannis, Dimitrios and Kalnis, Panos and Arner, Erik and Bajic, Vladimir B},
    journal={Nucleic Acids Research},
    pages={gkw1015},
    year={2016},
    publisher={Oxford Univ Press}
    }

  • A. F. Siahpirani and S. Roy, “A prior-based integrative framework for functional transcriptional regulatory network inference,” Nucleic acids research, p. gkw963, 2016.
    [BibTeX]
    @article{siahpirani2016prior,
    title={A prior-based integrative framework for functional transcriptional regulatory network inference},
    author={Siahpirani, Alireza F and Roy, Sushmita},
    journal={Nucleic Acids Research},
    pages={gkw963},
    year={2016},
    publisher={Oxford Univ Press}
    }

  • A. Wibisono, A. C. Wilson, and M. I. Jordan, “A variational perspective on accelerated methods in optimization,” Proc natl acad sci u s a, 2016. doi:10.1073/pnas.1614734113
    [BibTeX] [Abstract]

    Accelerated gradient methods play a central role in optimization, achieving optimal rates in many settings. Although many generalizations and extensions of Nesterov’s original acceleration method have been proposed, it is not yet clear what is the natural scope of the acceleration concept. In this paper, we study accelerated methods from a continuous-time perspective. We show that there is a Lagrangian functional that we call the Bregman Lagrangian, which generates a large class of accelerated methods in continuous time, including (but not limited to) accelerated gradient descent, its non-Euclidean extension, and accelerated higher-order gradient methods. We show that the continuous-time limit of all of these methods corresponds to traveling the same curve in spacetime at different speeds. From this perspective, Nesterov’s technique and many of its generalizations can be viewed as a systematic way to go from the continuous-time curves generated by the Bregman Lagrangian to a family of discrete-time accelerated algorithms.

    @article{Wibisono:2016kl,
    Abstract = {Accelerated gradient methods play a central role in optimization, achieving optimal rates in many settings. Although many generalizations and extensions of Nesterov's original acceleration method have been proposed, it is not yet clear what is the natural scope of the acceleration concept. In this paper, we study accelerated methods from a continuous-time perspective. We show that there is a Lagrangian functional that we call the Bregman Lagrangian, which generates a large class of accelerated methods in continuous time, including (but not limited to) accelerated gradient descent, its non-Euclidean extension, and accelerated higher-order gradient methods. We show that the continuous-time limit of all of these methods corresponds to traveling the same curve in spacetime at different speeds. From this perspective, Nesterov's technique and many of its generalizations can be viewed as a systematic way to go from the continuous-time curves generated by the Bregman Lagrangian to a family of discrete-time accelerated algorithms.},
    Author = {Wibisono, Andre and Wilson, Ashia C and Jordan, Michael I},
    Date-Added = {2016-11-23 20:48:30 +0000},
    Date-Modified = {2016-11-23 20:48:30 +0000},
    Doi = {10.1073/pnas.1614734113},
    Journal = {Proc Natl Acad Sci U S A},
    Journal-Full = {Proceedings of the National Academy of Sciences of the United States of America},
    Keywords = {Bregman divergence; Lagrangian framework; accelerated methods; convex optimization; mirror descent},
    Month = {Nov},
    Pmid = {27834219},
    Pst = {aheadofprint},
    Title = {A variational perspective on accelerated methods in optimization},
    Year = {2016},
    Bdsk-Url-1 = {//dx.doi.org/10.1073/pnas.1614734113}}

  • K. R. Campbell and C. Yau, “Order under uncertainty: robust differential expression analysis using probabilistic models for pseudotime inference,” Plos comput biol, vol. 12, iss. 11, p. e1005212, 2016. doi:10.1371/journal.pcbi.1005212
    [BibTeX] [Abstract]

    Single cell gene expression profiling can be used to quantify transcriptional dynamics in temporal processes, such as cell differentiation, using computational methods to label each cell with a ‘pseudotime’ where true time series experimentation is too difficult to perform. However, owing to the high variability in gene expression between individual cells, there is an inherent uncertainty in the precise temporal ordering of the cells. Pre-existing methods for pseudotime estimation have predominantly given point estimates precluding a rigorous analysis of the implications of uncertainty. We use probabilistic modelling techniques to quantify pseudotime uncertainty and propagate this into downstream differential expression analysis. We demonstrate that reliance on a point estimate of pseudotime can lead to inflated false discovery rates and that probabilistic approaches provide greater robustness and measures of the temporal resolution that can be obtained from pseudotime inference.

    @article{Campbell:2016oq,
    Abstract = {Single cell gene expression profiling can be used to quantify transcriptional dynamics in temporal processes, such as cell differentiation, using computational methods to label each cell with a 'pseudotime' where true time series experimentation is too difficult to perform. However, owing to the high variability in gene expression between individual cells, there is an inherent uncertainty in the precise temporal ordering of the cells. Pre-existing methods for pseudotime estimation have predominantly given point estimates precluding a rigorous analysis of the implications of uncertainty. We use probabilistic modelling techniques to quantify pseudotime uncertainty and propagate this into downstream differential expression analysis. We demonstrate that reliance on a point estimate of pseudotime can lead to inflated false discovery rates and that probabilistic approaches provide greater robustness and measures of the temporal resolution that can be obtained from pseudotime inference.},
    Author = {Campbell, Kieran R and Yau, Christopher},
    Date-Added = {2016-11-23 20:43:44 +0000},
    Date-Modified = {2016-11-23 20:43:44 +0000},
    Doi = {10.1371/journal.pcbi.1005212},
    Journal = {PLoS Comput Biol},
    Journal-Full = {PLoS computational biology},
    Month = {Nov},
    Number = {11},
    Pages = {e1005212},
    Pmid = {27870852},
    Pst = {epublish},
    Title = {Order Under Uncertainty: Robust Differential Expression Analysis Using Probabilistic Models for Pseudotime Inference},
    Volume = {12},
    Year = {2016},
    Bdsk-Url-1 = {//dx.doi.org/10.1371/journal.pcbi.1005212}}

  • I. Tirosh, A. S. Venteicher, C. Hebert, L. E. Escalante, A. P. Patel, K. Yizhak, J. M. Fisher, C. Rodman, C. Mount, M. G. Filbin, C. Neftel, N. Desai, J. Nyman, B. Izar, C. C. Luo, J. M. Francis, A. A. Patel, M. L. Onozato, N. Riggi, K. J. Livak, D. Gennert, R. Satija, B. V. Nahed, W. T. Curry, R. L. Martuza, R. Mylvaganam, J. A. Iafrate, M. P. Frosch, T. R. Golub, M. N. Rivera, G. Getz, O. Rozenblatt-Rosen, D. P. Cahill, M. Monje, B. E. Bernstein, D. N. Louis, A. Regev, and M. L. Suvà, “Single-cell rna-seq supports a developmental hierarchy in human oligodendroglioma,” Nature, vol. 539, iss. 7628, pp. 309-313, 2016. doi:10.1038/nature20123
    [BibTeX] [Abstract]

    Although human tumours are shaped by the genetic evolution of cancer cells, evidence also suggests that they display hierarchies related to developmental pathways and epigenetic programs in which cancer stem cells (CSCs) can drive tumour growth and give rise to differentiated progeny. Yet, unbiased evidence for CSCs in solid human malignancies remains elusive. Here we profile 4,347 single cells from six IDH1 or IDH2 mutant human oligodendrogliomas by RNA sequencing (RNA-seq) and reconstruct their developmental programs from genome-wide expression signatures. We infer that most cancer cells are differentiated along two specialized glial programs, whereas a rare subpopulation of cells is undifferentiated and associated with a neural stem cell expression program. Cells with expression signatures for proliferation are highly enriched in this rare subpopulation, consistent with a model in which CSCs are primarily responsible for fuelling the growth of oligodendroglioma in humans. Analysis of copy number variation (CNV) shows that distinct CNV sub-clones within tumours display similar cellular hierarchies, suggesting that the architecture of oligodendroglioma is primarily dictated by developmental programs. Subclonal point mutation analysis supports a similar model, although a full phylogenetic tree would be required to definitively determine the effect of genetic evolution on the inferred hierarchies. Our single-cell analyses provide insight into the cellular architecture of oligodendrogliomas at single-cell resolution and support the cancer stem cell model, with substantial implications for disease management.

    @article{Tirosh:2016nx,
    Abstract = {Although human tumours are shaped by the genetic evolution of cancer cells, evidence also suggests that they display hierarchies related to developmental pathways and epigenetic programs in which cancer stem cells (CSCs) can drive tumour growth and give rise to differentiated progeny. Yet, unbiased evidence for CSCs in solid human malignancies remains elusive. Here we profile 4,347 single cells from six IDH1 or IDH2 mutant human oligodendrogliomas by RNA sequencing (RNA-seq) and reconstruct their developmental programs from genome-wide expression signatures. We infer that most cancer cells are differentiated along two specialized glial programs, whereas a rare subpopulation of cells is undifferentiated and associated with a neural stem cell expression program. Cells with expression signatures for proliferation are highly enriched in this rare subpopulation, consistent with a model in which CSCs are primarily responsible for fuelling the growth of oligodendroglioma in humans. Analysis of copy number variation (CNV) shows that distinct CNV sub-clones within tumours display similar cellular hierarchies, suggesting that the architecture of oligodendroglioma is primarily dictated by developmental programs. Subclonal point mutation analysis supports a similar model, although a full phylogenetic tree would be required to definitively determine the effect of genetic evolution on the inferred hierarchies. Our single-cell analyses provide insight into the cellular architecture of oligodendrogliomas at single-cell resolution and support the cancer stem cell model, with substantial implications for disease management.},
    Author = {Tirosh, Itay and Venteicher, Andrew S and Hebert, Christine and Escalante, Leah E and Patel, Anoop P and Yizhak, Keren and Fisher, Jonathan M and Rodman, Christopher and Mount, Christopher and Filbin, Mariella G and Neftel, Cyril and Desai, Niyati and Nyman, Jackson and Izar, Benjamin and Luo, Christina C and Francis, Joshua M and Patel, Aanand A and Onozato, Maristela L and Riggi, Nicolo and Livak, Kenneth J and Gennert, Dave and Satija, Rahul and Nahed, Brian V and Curry, William T and Martuza, Robert L and Mylvaganam, Ravindra and Iafrate, A John and Frosch, Matthew P and Golub, Todd R and Rivera, Miguel N and Getz, Gad and Rozenblatt-Rosen, Orit and Cahill, Daniel P and Monje, Michelle and Bernstein, Bradley E and Louis, David N and Regev, Aviv and Suv{\`a}, Mario L},
    Date-Added = {2016-11-23 20:39:53 +0000},
    Date-Modified = {2016-11-23 20:39:53 +0000},
    Doi = {10.1038/nature20123},
    Journal = {Nature},
    Journal-Full = {Nature},
    Month = {Nov},
    Number = {7628},
    Pages = {309-313},
    Pmid = {27806376},
    Pst = {aheadofprint},
    Title = {Single-cell RNA-seq supports a developmental hierarchy in human oligodendroglioma},
    Volume = {539},
    Year = {2016},
    Bdsk-Url-1 = {//dx.doi.org/10.1038/nature20123}}

  • D. P. Dever, R. O. Bak, A. Reinisch, J. Camarena, G. Washington, C. E. Nicolas, M. Pavel-Dinu, N. Saxena, A. B. Wilkens, S. Mantri, N. Uchida, A. Hendel, A. Narla, R. Majeti, K. I. Weinberg, and M. H. Porteus, “Crispr/cas9 β-globin gene targeting in human haematopoietic stem cells,” Nature, vol. 539, iss. 7629, pp. 384-389, 2016. doi:10.1038/nature20134
    [BibTeX] [Abstract]

    The β-haemoglobinopathies, such as sickle cell disease and β-thalassaemia, are caused by mutations in the β-globin (HBB) gene and affect millions of people worldwide. Ex vivo gene correction in patient-derived haematopoietic stem cells followed by autologous transplantation could be used to cure β-haemoglobinopathies. Here we present a CRISPR/Cas9 gene-editing system that combines Cas9 ribonucleoproteins and adeno-associated viral vector delivery of a homologous donor to achieve homologous recombination at the HBB gene in haematopoietic stem cells. Notably, we devise an enrichment model to purify a population of haematopoietic stem and progenitor cells with more than 90% targeted integration. We also show efficient correction of the Glu6Val mutation responsible for sickle cell disease by using patient-derived stem and progenitor cells that, after differentiation into erythrocytes, express adult β-globin (HbA) messenger RNA, which confirms intact transcriptional regulation of edited HBB alleles. Collectively, these preclinical studies outline a CRISPR-based methodology for targeting haematopoietic stem cells by homologous recombination at the HBB locus to advance the development of next-generation therapies for β-haemoglobinopathies.

    @article{Dever:2016cr,
    Abstract = {The β-haemoglobinopathies, such as sickle cell disease and β-thalassaemia, are caused by mutations in the β-globin (HBB) gene and affect millions of people worldwide. Ex vivo gene correction in patient-derived haematopoietic stem cells followed by autologous transplantation could be used to cure β-haemoglobinopathies. Here we present a CRISPR/Cas9 gene-editing system that combines Cas9 ribonucleoproteins and adeno-associated viral vector delivery of a homologous donor to achieve homologous recombination at the HBB gene in haematopoietic stem cells. Notably, we devise an enrichment model to purify a population of haematopoietic stem and progenitor cells with more than 90% targeted integration. We also show efficient correction of the Glu6Val mutation responsible for sickle cell disease by using patient-derived stem and progenitor cells that, after differentiation into erythrocytes, express adult β-globin (HbA) messenger RNA, which confirms intact transcriptional regulation of edited HBB alleles. Collectively, these preclinical studies outline a CRISPR-based methodology for targeting haematopoietic stem cells by homologous recombination at the HBB locus to advance the development of next-generation therapies for β-haemoglobinopathies.},
    Author = {Dever, Daniel P and Bak, Rasmus O and Reinisch, Andreas and Camarena, Joab and Washington, Gabriel and Nicolas, Carmencita E and Pavel-Dinu, Mara and Saxena, Nivi and Wilkens, Alec B and Mantri, Sruthi and Uchida, Nobuko and Hendel, Ayal and Narla, Anupama and Majeti, Ravindra and Weinberg, Kenneth I and Porteus, Matthew H},
    Date-Added = {2016-11-23 20:36:13 +0000},
    Date-Modified = {2016-11-23 20:36:13 +0000},
    Doi = {10.1038/nature20134},
    Journal = {Nature},
    Journal-Full = {Nature},
    Month = {Nov},
    Number = {7629},
    Pages = {384-389},
    Pmid = {27820943},
    Pst = {aheadofprint},
    Title = {CRISPR/Cas9 β-globin gene targeting in human haematopoietic stem cells},
    Volume = {539},
    Year = {2016},
    Bdsk-Url-1 = {//dx.doi.org/10.1038/nature20134}}

  • M. Norris, C. K. Kwok, J. Cheema, M. Hartley, R. J. Morris, S. Aviran, and Y. Ding, “Foldatlas: a repository for genome-wide rna structure probing data,” Bioinformatics, 2016. doi:10.1093/bioinformatics/btw611
    [BibTeX] [Abstract]

    Most RNA molecules form internal base pairs, leading to a folded secondary structure. Some of these structures have been demonstrated to be functionally significant. High-throughput RNA structure chemical probing methods generate millions of sequencing reads to provide structural constraints for RNA secondary structure prediction. At present, processed data from these experiments are difficult to access without computational expertise. Here we present FoldAtlas, a web interface for accessing raw and processed structural data across thousands of transcripts. FoldAtlas allows a researcher to easily locate, view, and retrieve probing data for a given RNA molecule. We also provide in silico and in vivo secondary structure predictions for comparison, visualized in the browser as circle plots and topology diagrams. Data currently integrated into FoldAtlas are from a new high-depth Structure-seq data analysis in Arabidopsis thaliana, released with this work. AVAILABILITY AND IMPLEMENTATION: The FoldAtlas website can be accessed at www.foldatlas.com Source code is freely available at github.com/mnori/foldatlas under the MIT license. Raw reads data are available under the NCBI SRA accession SRP066985. CONTACT: yiliang.ding@jic.ac.uk or matthew.norris@jic.ac.ukSupplementary information: Supplementary data are available at Bioinformatics online.

    @article{Norris:2016dq,
    Abstract = {Most RNA molecules form internal base pairs, leading to a folded secondary structure. Some of these structures have been demonstrated to be functionally significant. High-throughput RNA structure chemical probing methods generate millions of sequencing reads to provide structural constraints for RNA secondary structure prediction. At present, processed data from these experiments are difficult to access without computational expertise. Here we present FoldAtlas, a web interface for accessing raw and processed structural data across thousands of transcripts. FoldAtlas allows a researcher to easily locate, view, and retrieve probing data for a given RNA molecule. We also provide in silico and in vivo secondary structure predictions for comparison, visualized in the browser as circle plots and topology diagrams. Data currently integrated into FoldAtlas are from a new high-depth Structure-seq data analysis in Arabidopsis thaliana, released with this work.
    AVAILABILITY AND IMPLEMENTATION: The FoldAtlas website can be accessed at www.foldatlas.com Source code is freely available at github.com/mnori/foldatlas under the MIT license. Raw reads data are available under the NCBI SRA accession SRP066985.
    CONTACT: yiliang.ding@jic.ac.uk or matthew.norris@jic.ac.ukSupplementary information: Supplementary data are available at Bioinformatics online.},
    Author = {Norris, Matthew and Kwok, Chun Kit and Cheema, Jitender and Hartley, Matthew and Morris, Richard J and Aviran, Sharon and Ding, Yiliang},
    Date-Added = {2016-11-23 20:27:39 +0000},
    Date-Modified = {2016-11-23 20:27:39 +0000},
    Doi = {10.1093/bioinformatics/btw611},
    Journal = {Bioinformatics},
    Journal-Full = {Bioinformatics (Oxford, England)},
    Month = {Sep},
    Pmid = {27663500},
    Pst = {aheadofprint},
    Title = {FoldAtlas: a repository for genome-wide RNA structure probing data},
    Year = {2016},
    Bdsk-Url-1 = {//dx.doi.org/10.1093/bioinformatics/btw611}}

  • Y. Kato, T. Mori, K. Sato, S. Maegawa, H. Hosokawa, and T. Akutsu, “An accessibility-incorporated method for accurate prediction of rna-rna interactions from sequence data,” Bioinformatics, 2016. doi:10.1093/bioinformatics/btw603
    [BibTeX] [Abstract]

    MOTIVATION: RNA-RNA interactions via base pairing play a vital role in the post-transcriptional regulation of gene expression. Efficient identification of targets for such regulatory RNAs needs not only discriminative power for positive and negative RNA-RNA interacting sequence data but also accurate prediction of interaction sites from positive data. Recently, a few studies have incorporated interaction site accessibility into their prediction methods, indicating the enhancement of predictive performance on limited positive data. RESULTS: Here we show the efficacy of our accessibility-based prediction model RactIPAce on newly compiled datasets. The first experiment in interaction site prediction shows that RactIPAce achieves the best predictive performance on the newly compiled dataset of experimentally verified interactions in the literature as compared with the state-of-the-art methods. In addition, the second experiment in discrimination between positive and negative interacting pairs reveals that the combination of accessibility-based methods including our approach can be effective to discern real interacting RNAs. Taking these into account, our prediction model can be effective to predict interaction sites after screening for real interacting RNAs, which will boost the functional analysis of regulatory RNAs. AVAILABILITY: The program RactIPAce along with data used in this work is available at //github.com/satoken/ractip/releases/tag/v1.0.1 CONTACT: ykato@rna.med.osaka-u.ac.jp; shingo@i.kyoto-u.ac.jp SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

    @article{Kato:2016bh,
    Abstract = {MOTIVATION: RNA-RNA interactions via base pairing play a vital role in the post-transcriptional regulation of gene expression. Efficient identification of targets for such regulatory RNAs needs not only discriminative power for positive and negative RNA-RNA interacting sequence data but also accurate prediction of interaction sites from positive data. Recently, a few studies have incorporated interaction site accessibility into their prediction methods, indicating the enhancement of predictive performance on limited positive data.
    RESULTS: Here we show the efficacy of our accessibility-based prediction model RactIPAce on newly compiled datasets. The first experiment in interaction site prediction shows that RactIPAce achieves the best predictive performance on the newly compiled dataset of experimentally verified interactions in the literature as compared with the state-of-the-art methods. In addition, the second experiment in discrimination between positive and negative interacting pairs reveals that the combination of accessibility-based methods including our approach can be effective to discern real interacting RNAs. Taking these into account, our prediction model can be effective to predict interaction sites after screening for real interacting RNAs, which will boost the functional analysis of regulatory RNAs.
    AVAILABILITY: The program RactIPAce along with data used in this work is available at //github.com/satoken/ractip/releases/tag/v1.0.1 CONTACT: ykato@rna.med.osaka-u.ac.jp; shingo@i.kyoto-u.ac.jp SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.},
    Author = {Kato, Yuki and Mori, Tomoya and Sato, Kengo and Maegawa, Shingo and Hosokawa, Hiroshi and Akutsu, Tatsuya},
    Date-Added = {2016-11-23 20:26:50 +0000},
    Date-Modified = {2016-11-23 20:26:50 +0000},
    Doi = {10.1093/bioinformatics/btw603},
    Journal = {Bioinformatics},
    Journal-Full = {Bioinformatics (Oxford, England)},
    Month = {Sep},
    Pmid = {27663495},
    Pst = {aheadofprint},
    Title = {An accessibility-incorporated method for accurate prediction of RNA-RNA interactions from sequence data},
    Year = {2016},
    Bdsk-Url-1 = {//dx.doi.org/10.1093/bioinformatics/btw603}}

publications 2016-11-10

In this group meeting, we quickly discussed these latest papers:

  • Y. Benjamini, J. Taylor, and R. A. Irizarry, “Selection corrected statistical inference for region detection with high-throughput assays,” Biorxiv, p. 82321, 2016.
    [BibTeX]
    @article{benjamini2016selection,
    title={Selection Corrected Statistical Inference for Region Detection with High-throughput Assays},
    author={Benjamini, Yuval and Taylor, Jonathan and Irizarry, Rafael A},
    journal={bioRxiv},
    pages={082321},
    year={2016},
    publisher={Cold Spring Harbor Labs Journals}
    }

  • S. C. Hicks, K. Okrah, J. N. Paulson, J. Quackenbush, R. A. Irizarry, and H. C. Bravo, “Smooth quantile normalization,” Biorxiv, p. 85175, 2016.
    [BibTeX]
    @article{hicks2016smooth,
    title={Smooth Quantile Normalization},
    author={Hicks, Stephanie C and Okrah, Kwame and Paulson, Joseph N and Quackenbush, John and Irizarry, Rafael A and Bravo, Hector Corrada},
    journal={bioRxiv},
    pages={085175},
    year={2016},
    publisher={Cold Spring Harbor Labs Journals}
    }

  • M. Capogrosso, T. Milekovic, D. Borton, F. Wagner, E. M. Moraud, J. Mignardot, N. Buse, J. Gandar, Q. Barraud, D. Xing, and others, “A brain–spine interface alleviating gait deficits after spinal cord injury in primates,” Nature, vol. 539, iss. 7628, pp. 284-288, 2016.
    [BibTeX]
    @article{capogrosso2016brain,
    title={A brain--spine interface alleviating gait deficits after spinal cord injury in primates},
    author={Capogrosso, Marco and Milekovic, Tomislav and Borton, David and Wagner, Fabien and Moraud, Eduardo Martin and Mignardot, Jean-Baptiste and Buse, Nicolas and Gandar, Jerome and Barraud, Quentin and Xing, David and others},
    journal={Nature},
    volume={539},
    number={7628},
    pages={284--288},
    year={2016},
    publisher={Nature Research}
    }

  • E. Rivas, J. Clements, and S. R. Eddy, “A statistical test for conserved rna structure shows lack of evidence for structure in lncrnas,” Nature methods, 2016.
    [BibTeX]
    @article{rivas2016statistical,
    title={A statistical test for conserved RNA structure shows lack of evidence for structure in lncRNAs},
    author={Rivas, Elena and Clements, Jody and Eddy, Sean R},
    journal={Nature Methods},
    year={2016},
    publisher={Nature Research}
    }

  • M. Zubradt, P. Gupta, S. Persad, A. M. Lambowitz, J. S. Weissman, and S. Rouskin, “Dms-mapseq for genome-wide or targeted rna structure probing in vivo,” Nature methods, 2016.
    [BibTeX]
    @article{zubradt2016dms,
    title={DMS-MaPseq for genome-wide or targeted RNA structure probing in vivo},
    author={Zubradt, Meghan and Gupta, Paromita and Persad, Sitara and Lambowitz, Alan M and Weissman, Jonathan S and Rouskin, Silvi},
    journal={Nature Methods},
    year={2016},
    publisher={Nature Research}
    }

  • S. M. Oliveira, A. Häkkinen, J. Lloyd-Price, H. Tran, V. Kandavalli, and A. S. Ribeiro, “Temperature-dependent model of multi-step transcription initiation in escherichia coli based on live single-cell measurements,” Plos comput biol, vol. 12, iss. 10, p. e1005174, 2016.
    [BibTeX]
    @article{oliveira2016temperature,
    title={Temperature-Dependent Model of Multi-step Transcription Initiation in Escherichia coli Based on Live Single-Cell Measurements},
    author={Oliveira, Samuel MD and H{\"a}kkinen, Antti and Lloyd-Price, Jason and Tran, Huy and Kandavalli, Vinodh and Ribeiro, Andre S},
    journal={PLoS Comput Biol},
    volume={12},
    number={10},
    pages={e1005174},
    year={2016},
    publisher={Public Library of Science}
    }

  • A. Cheung, “Probabilistic learning by rodent grid cells,” Plos comput biol, vol. 12, iss. 10, p. e1005165, 2016.
    [BibTeX]
    @article{cheung2016probabilistic,
    title={Probabilistic Learning by Rodent Grid Cells},
    author={Cheung, Allen},
    journal={PLoS Comput Biol},
    volume={12},
    number={10},
    pages={e1005165},
    year={2016},
    publisher={Public Library of Science}
    }

  • T. D. Hocking, P. Goerner-Potvin, A. Morin, X. Shao, T. Pastinen, and G. Bourque, “Optimizing chip-seq peak detectors using visual labels and supervised machine learning,” Bioinformatics, p. btw672, 2016.
    [BibTeX]
    @article{hocking2016optimizing,
    title={Optimizing ChIP-seq peak detectors using visual labels and supervised machine learning},
    author={Hocking, Toby Dylan and Goerner-Potvin, Patricia and Morin, Andreanne and Shao, Xiaojian and Pastinen, Tomi and Bourque, Guillaume},
    journal={Bioinformatics},
    pages={btw672},
    year={2016},
    publisher={Oxford Univ Press}
    }

  • B. Chang, M. Croson, L. Straker, S. Gart, C. Dove, J. Gerwin, and S. Jung, “How seabirds plunge-dive without injuries,” Proc natl acad sci u s a, vol. 113, iss. 43, pp. 12006-12011, 2016. doi:10.1073/pnas.1608628113
    [BibTeX] [Abstract]

    In nature, several seabirds (e.g., gannets and boobies) dive into water at up to 24 m/s as a hunting mechanism; furthermore, gannets and boobies have a slender neck, which is potentially the weakest part of the body under compression during high-speed impact. In this work, we investigate the stability of the bird’s neck during plunge-diving by understanding the interaction between the fluid forces acting on the head and the flexibility of the neck. First, we use a salvaged bird to identify plunge-diving phases. Anatomical features of the skull and neck were acquired to quantify the effect of beak geometry and neck musculature on the stability during a plunge-dive. Second, physical experiments using an elastic beam as a model for the neck attached to a skull-like cone revealed the limits for the stability of the neck during the bird’s dive as a function of impact velocity and geometric factors. We find that the neck length, neck muscles, and diving speed of the bird predominantly reduce the likelihood of injury during the plunge-dive. Finally, we use our results to discuss maximum diving speeds for humans to avoid injury.

    @article{Chang:2016qf,
    Abstract = {In nature, several seabirds (e.g., gannets and boobies) dive into water at up to 24 m/s as a hunting mechanism; furthermore, gannets and boobies have a slender neck, which is potentially the weakest part of the body under compression during high-speed impact. In this work, we investigate the stability of the bird's neck during plunge-diving by understanding the interaction between the fluid forces acting on the head and the flexibility of the neck. First, we use a salvaged bird to identify plunge-diving phases. Anatomical features of the skull and neck were acquired to quantify the effect of beak geometry and neck musculature on the stability during a plunge-dive. Second, physical experiments using an elastic beam as a model for the neck attached to a skull-like cone revealed the limits for the stability of the neck during the bird's dive as a function of impact velocity and geometric factors. We find that the neck length, neck muscles, and diving speed of the bird predominantly reduce the likelihood of injury during the plunge-dive. Finally, we use our results to discuss maximum diving speeds for humans to avoid injury.},
    Author = {Chang, Brian and Croson, Matthew and Straker, Lorian and Gart, Sean and Dove, Carla and Gerwin, John and Jung, Sunghwan},
    Date-Added = {2016-11-10 11:51:34 +0000},
    Date-Modified = {2016-11-10 11:51:34 +0000},
    Doi = {10.1073/pnas.1608628113},
    Journal = {Proc Natl Acad Sci U S A},
    Journal-Full = {Proceedings of the National Academy of Sciences of the United States of America},
    Keywords = {buckling; diving; injury; seabirds; water entry},
    Month = {Oct},
    Number = {43},
    Pages = {12006-12011},
    Pmc = {PMC5087068},
    Pmid = {27702905},
    Pst = {ppublish},
    Title = {How seabirds plunge-dive without injuries},
    Volume = {113},
    Year = {2016},
    Bdsk-Url-1 = {//dx.doi.org/10.1073/pnas.1608628113}}

  • V. Fitz, J. Shin, C. Ehrlich, L. Farnung, P. Cramer, V. Zaburdaev, and S. W. Grill, “Nucleosomal arrangement affects single-molecule transcription dynamics,” Proc natl acad sci u s a, 2016. doi:10.1073/pnas.1602764113
    [BibTeX] [Abstract]

    In eukaryotes, gene expression depends on chromatin organization. However, how chromatin affects the transcription dynamics of individual RNA polymerases has remained elusive. Here, we use dual trap optical tweezers to study single yeast RNA polymerase II (Pol II) molecules transcribing along a DNA template with two nucleosomes. The slowdown and the changes in pausing behavior within the nucleosomal region allow us to determine a drift coefficient, χ, which characterizes the ability of the enzyme to recover from a nucleosomal backtrack. Notably, χ can be used to predict the probability to pass the first nucleosome. Importantly, the presence of a second nucleosome changes χ in a manner that depends on the spacing between the two nucleosomes, as well as on their rotational arrangement on the helical DNA molecule. Our results indicate that the ability of Pol II to pass the first nucleosome is increased when the next nucleosome is turned away from the first one to face the opposite side of the DNA template. These findings help to rationalize how chromatin arrangement affects Pol II transcription dynamics.

    @article{Fitz:2016ve,
    Abstract = {In eukaryotes, gene expression depends on chromatin organization. However, how chromatin affects the transcription dynamics of individual RNA polymerases has remained elusive. Here, we use dual trap optical tweezers to study single yeast RNA polymerase II (Pol II) molecules transcribing along a DNA template with two nucleosomes. The slowdown and the changes in pausing behavior within the nucleosomal region allow us to determine a drift coefficient, χ, which characterizes the ability of the enzyme to recover from a nucleosomal backtrack. Notably, χ can be used to predict the probability to pass the first nucleosome. Importantly, the presence of a second nucleosome changes χ in a manner that depends on the spacing between the two nucleosomes, as well as on their rotational arrangement on the helical DNA molecule. Our results indicate that the ability of Pol II to pass the first nucleosome is increased when the next nucleosome is turned away from the first one to face the opposite side of the DNA template. These findings help to rationalize how chromatin arrangement affects Pol II transcription dynamics.},
    Author = {Fitz, Veronika and Shin, Jaeoh and Ehrlich, Christoph and Farnung, Lucas and Cramer, Patrick and Zaburdaev, Vasily and Grill, Stephan W},
    Date-Added = {2016-11-10 11:48:26 +0000},
    Date-Modified = {2016-11-10 11:48:26 +0000},
    Doi = {10.1073/pnas.1602764113},
    Journal = {Proc Natl Acad Sci U S A},
    Journal-Full = {Proceedings of the National Academy of Sciences of the United States of America},
    Keywords = {Pol II; internucleosomal distance; optical tweezers; single-molecule; transcription},
    Month = {Oct},
    Pmid = {27791062},
    Pst = {aheadofprint},
    Title = {Nucleosomal arrangement affects single-molecule transcription dynamics},
    Year = {2016},
    Bdsk-Url-1 = {//dx.doi.org/10.1073/pnas.1602764113}}

  • S. Wager, W. Du, J. Taylor, and R. J. Tibshirani, “High-dimensional regression adjustments in randomized experiments,” Proc natl acad sci u s a, 2016. doi:10.1073/pnas.1614732113
    [BibTeX] [Abstract]

    We study the problem of treatment effect estimation in randomized experiments with high-dimensional covariate information and show that essentially any risk-consistent regression adjustment can be used to obtain efficient estimates of the average treatment effect. Our results considerably extend the range of settings where high-dimensional regression adjustments are guaranteed to provide valid inference about the population average treatment effect. We then propose cross-estimation, a simple method for obtaining finite-sample-unbiased treatment effect estimates that leverages high-dimensional regression adjustments. Our method can be used when the regression model is estimated using the lasso, the elastic net, subset selection, etc. Finally, we extend our analysis to allow for adaptive specification search via cross-validation and flexible nonparametric regression adjustments with machine-learning methods such as random forests or neural networks.

    @article{Wager:2016ly,
    Abstract = {We study the problem of treatment effect estimation in randomized experiments with high-dimensional covariate information and show that essentially any risk-consistent regression adjustment can be used to obtain efficient estimates of the average treatment effect. Our results considerably extend the range of settings where high-dimensional regression adjustments are guaranteed to provide valid inference about the population average treatment effect. We then propose cross-estimation, a simple method for obtaining finite-sample-unbiased treatment effect estimates that leverages high-dimensional regression adjustments. Our method can be used when the regression model is estimated using the lasso, the elastic net, subset selection, etc. Finally, we extend our analysis to allow for adaptive specification search via cross-validation and flexible nonparametric regression adjustments with machine-learning methods such as random forests or neural networks.},
    Author = {Wager, Stefan and Du, Wenfei and Taylor, Jonathan and Tibshirani, Robert J},
    Date-Added = {2016-11-10 11:46:22 +0000},
    Date-Modified = {2016-11-10 11:46:22 +0000},
    Doi = {10.1073/pnas.1614732113},
    Journal = {Proc Natl Acad Sci U S A},
    Journal-Full = {Proceedings of the National Academy of Sciences of the United States of America},
    Keywords = {high-dimensional confounders; randomized trials; regression adjustment},
    Month = {Oct},
    Pmid = {27791165},
    Pst = {aheadofprint},
    Title = {High-dimensional regression adjustments in randomized experiments},
    Year = {2016},
    Bdsk-Url-1 = {//dx.doi.org/10.1073/pnas.1614732113}}

  • A. Onken, J. K. Liu, C. P. P. R. Karunasekara, I. Delis, T. Gollisch, and S. Panzeri, “Using matrix and tensor factorizations for the single-trial analysis of population spike trains,” Plos comput biol, vol. 12, iss. 11, p. e1005189, 2016. doi:10.1371/journal.pcbi.1005189
    [BibTeX] [Abstract]

    Advances in neuronal recording techniques are leading to ever larger numbers of simultaneously monitored neurons. This poses the important analytical challenge of how to capture compactly all sensory information that neural population codes carry in their spatial dimension (differences in stimulus tuning across neurons at different locations), in their temporal dimension (temporal neural response variations), or in their combination (temporally coordinated neural population firing). Here we investigate the utility of tensor factorizations of population spike trains along space and time. These factorizations decompose a dataset of single-trial population spike trains into spatial firing patterns (combinations of neurons firing together), temporal firing patterns (temporal activation of these groups of neurons) and trial-dependent activation coefficients (strength of recruitment of such neural patterns on each trial). We validated various factorization methods on simulated data and on populations of ganglion cells simultaneously recorded in the salamander retina. We found that single-trial tensor space-by-time decompositions provided low-dimensional data-robust representations of spike trains that capture efficiently both their spatial and temporal information about sensory stimuli. Tensor decompositions with orthogonality constraints were the most efficient in extracting sensory information, whereas non-negative tensor decompositions worked well even on non-independent and overlapping spike patterns, and retrieved informative firing patterns expressed by the same population in response to novel stimuli. Our method showed that populations of retinal ganglion cells carried information in their spike timing on the ten-milliseconds-scale about spatial details of natural images. This information could not be recovered from the spike counts of these cells. First-spike latencies carried the majority of information provided by the whole spike train about fine-scale image features, and supplied almost as much information about coarse natural image features as firing rates. Together, these results highlight the importance of spike timing, and particularly of first-spike latencies, in retinal coding.

    @article{Onken:2016zr,
    Abstract = {Advances in neuronal recording techniques are leading to ever larger numbers of simultaneously monitored neurons. This poses the important analytical challenge of how to capture compactly all sensory information that neural population codes carry in their spatial dimension (differences in stimulus tuning across neurons at different locations), in their temporal dimension (temporal neural response variations), or in their combination (temporally coordinated neural population firing). Here we investigate the utility of tensor factorizations of population spike trains along space and time. These factorizations decompose a dataset of single-trial population spike trains into spatial firing patterns (combinations of neurons firing together), temporal firing patterns (temporal activation of these groups of neurons) and trial-dependent activation coefficients (strength of recruitment of such neural patterns on each trial). We validated various factorization methods on simulated data and on populations of ganglion cells simultaneously recorded in the salamander retina. We found that single-trial tensor space-by-time decompositions provided low-dimensional data-robust representations of spike trains that capture efficiently both their spatial and temporal information about sensory stimuli. Tensor decompositions with orthogonality constraints were the most efficient in extracting sensory information, whereas non-negative tensor decompositions worked well even on non-independent and overlapping spike patterns, and retrieved informative firing patterns expressed by the same population in response to novel stimuli. Our method showed that populations of retinal ganglion cells carried information in their spike timing on the ten-milliseconds-scale about spatial details of natural images. This information could not be recovered from the spike counts of these cells. First-spike latencies carried the majority of information provided by the whole spike train about fine-scale image features, and supplied almost as much information about coarse natural image features as firing rates. Together, these results highlight the importance of spike timing, and particularly of first-spike latencies, in retinal coding.},
    Author = {Onken, Arno and Liu, Jian K and Karunasekara, P P Chamanthi R and Delis, Ioannis and Gollisch, Tim and Panzeri, Stefano},
    Date-Added = {2016-11-10 11:36:27 +0000},
    Date-Modified = {2016-11-10 11:36:27 +0000},
    Doi = {10.1371/journal.pcbi.1005189},
    Journal = {PLoS Comput Biol},
    Journal-Full = {PLoS computational biology},
    Month = {Nov},
    Number = {11},
    Pages = {e1005189},
    Pmid = {27814363},
    Pst = {epublish},
    Title = {Using Matrix and Tensor Factorizations for the Single-Trial Analysis of Population Spike Trains},
    Volume = {12},
    Year = {2016},
    Bdsk-Url-1 = {//dx.doi.org/10.1371/journal.pcbi.1005189}}

  • R. Petegrosso, S. Park, T. H. Hwang, and R. Kuang, “Transfer learning across ontologies for phenome-genome association prediction,” Bioinformatics, 2016. doi:10.1093/bioinformatics/btw649
    [BibTeX] [Abstract]

    MOTIVATION: To better predict and analyze gene associations with the collection of phenotypes organized in a phenotype ontology, it is crucial to effectively model the hierarchical structure among the phenotypes in the ontology and leverage the sparse known associations with additional training information. In this paper, we first introduce Dual Label Propagation (DLP) to impose consistent associations with the entire phenotype paths in predicting phenotype-gene associations in Human Phenotype Ontology (HPO). DLP is then used as the base model in a transfer learning framework (tlDLP) to incorporate functional annotations in Gene Ontology (GO). By simultaneously reconstructing GO term-gene associations and HPO phenotype-gene associations for all the genes in a protein-protein interaction network, tlDLP benefits from the enriched training associations indirectly through relation with GO terms. RESULTS: In the experiments to predict the associations between human genes and phenotypes in HPO based on human protein-protein interaction network, both DLP and tlDLP improved the prediction of gene associations with phenotype paths in HPO in cross-validation and the prediction of the most recent associations added after the snapshot of the training data. Moreover, the transfer learning through GO term-gene associations significantly improved association predictions for the phenotypes with no more specific known associations by a large margin. Examples are also shown to demonstrate how phenotype paths in phenotype ontology and transfer learning with gene ontology can improve the predictions. AVAILABILITY: Source code is available at //compbio.cs.umn.edu/ontophenome CONTACT: kuang@cs.umn.com SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

    @article{Petegrosso:2016ys,
    Abstract = {MOTIVATION: To better predict and analyze gene associations with the collection of phenotypes organized in a phenotype ontology, it is crucial to effectively model the hierarchical structure among the phenotypes in the ontology and leverage the sparse known associations with additional training information. In this paper, we first introduce Dual Label Propagation (DLP) to impose consistent associations with the entire phenotype paths in predicting phenotype-gene associations in Human Phenotype Ontology (HPO). DLP is then used as the base model in a transfer learning framework (tlDLP) to incorporate functional annotations in Gene Ontology (GO). By simultaneously reconstructing GO term-gene associations and HPO phenotype-gene associations for all the genes in a protein-protein interaction network, tlDLP benefits from the enriched training associations indirectly through relation with GO terms.
    RESULTS: In the experiments to predict the associations between human genes and phenotypes in HPO based on human protein-protein interaction network, both DLP and tlDLP improved the prediction of gene associations with phenotype paths in HPO in cross-validation and the prediction of the most recent associations added after the snapshot of the training data. Moreover, the transfer learning through GO term-gene associations significantly improved association predictions for the phenotypes with no more specific known associations by a large margin. Examples are also shown to demonstrate how phenotype paths in phenotype ontology and transfer learning with gene ontology can improve the predictions.
    AVAILABILITY: Source code is available at //compbio.cs.umn.edu/ontophenome CONTACT: kuang@cs.umn.com SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.},
    Author = {Petegrosso, Raphael and Park, Sunho and Hwang, Tae Hyun and Kuang, Rui},
    Date-Added = {2016-11-10 11:33:27 +0000},
    Date-Modified = {2016-11-10 11:33:27 +0000},
    Doi = {10.1093/bioinformatics/btw649},
    Journal = {Bioinformatics},
    Journal-Full = {Bioinformatics (Oxford, England)},
    Month = {Oct},
    Pmid = {27797759},
    Pst = {aheadofprint},
    Title = {Transfer Learning across Ontologies for Phenome-Genome Association Prediction},
    Year = {2016},
    Bdsk-Url-1 = {//dx.doi.org/10.1093/bioinformatics/btw649}}

  • T. D. Hocking, P. Goerner-Potvin, A. Morin, X. Shao, T. Pastinen, and G. Bourque, “Optimizing chip-seq peak detectors using visual labels and supervised machine learning,” Bioinformatics, 2016. doi:10.1093/bioinformatics/btw672
    [BibTeX] [Abstract]

    MOTIVATION: Many peak detection algorithms have been proposed for ChIP-seq data analysis, but it is not obvious which algorithm and what parameters are optimal for any given data set. In contrast, regions with and without obvious peaks can be easily labeled by visual inspection of aligned read counts in a genome browser. We propose a supervised machine learning approach for ChIP-seq data analysis, using labels that encode qualitative judgments about which genomic regions contain or do not contain peaks. The main idea is to manually label a small subset of the genome, and then learn a model that makes consistent peak predictions on the rest of the genome. RESULTS: We created 7 new histone mark data sets with 12,826 visually determined labels , and analyzed 3 existing transcription factor data sets. We observed that default peak detection parameters yield high false positive rates, which can be reduced by learning parameters using a relatively small training set of labeled data from the same experiment type. We also observed that labels from different people are highly consistent. Overall, these data indicate that our supervised labeling method is useful for quantitatively training and testing peak detection algorithms. AVAILABILITY: Labeled histone mark data //cbio.ensmp.fr/thocking/chip-seq-chunk-db/, R package to compute the label error of predicted peaks //github.com/tdhock/PeakError CONTACT: toby.hocking@mail.mcgill.ca, guil.bourque@mcgill.ca.

    @article{Hocking:2016vn,
    Abstract = {MOTIVATION: Many peak detection algorithms have been proposed for ChIP-seq data analysis, but it is not obvious which algorithm and what parameters are optimal for any given data set. In contrast, regions with and without obvious peaks can be easily labeled by visual inspection of aligned read counts in a genome browser. We propose a supervised machine learning approach for ChIP-seq data analysis, using labels that encode qualitative judgments about which genomic regions contain or do not contain peaks. The main idea is to manually label a small subset of the genome, and then learn a model that makes consistent peak predictions on the rest of the genome.
    RESULTS: We created 7 new histone mark data sets with 12,826 visually determined labels , and analyzed 3 existing transcription factor data sets. We observed that default peak detection parameters yield high false positive rates, which can be reduced by learning parameters using a relatively small training set of labeled data from the same experiment type. We also observed that labels from different people are highly consistent. Overall, these data indicate that our supervised labeling method is useful for quantitatively training and testing peak detection algorithms.
    AVAILABILITY: Labeled histone mark data //cbio.ensmp.fr/thocking/chip-seq-chunk-db/, R package to compute the label error of predicted peaks //github.com/tdhock/PeakError CONTACT: toby.hocking@mail.mcgill.ca, guil.bourque@mcgill.ca.},
    Author = {Hocking, Toby Dylan and Goerner-Potvin, Patricia and Morin, Andreanne and Shao, Xiaojian and Pastinen, Tomi and Bourque, Guillaume},
    Date-Added = {2016-11-10 11:31:51 +0000},
    Date-Modified = {2016-11-10 11:31:51 +0000},
    Doi = {10.1093/bioinformatics/btw672},
    Journal = {Bioinformatics},
    Journal-Full = {Bioinformatics (Oxford, England)},
    Month = {Oct},
    Pmid = {27797775},
    Pst = {aheadofprint},
    Title = {Optimizing ChIP-seq peak detectors using visual labels and supervised machine learning},
    Year = {2016},
    Bdsk-Url-1 = {//dx.doi.org/10.1093/bioinformatics/btw672}}

  • L. Potvin-Trottier, N. D. Lord, G. Vinnicombe, and J. Paulsson, “Synchronous long-term oscillations in a synthetic gene circuit,” Nature, vol. 538, iss. 7626, pp. 514-517, 2016. doi:10.1038/nature19841
    [BibTeX] [Abstract]

    Synthetically engineered genetic circuits can perform a wide variety of tasks but are generally less accurate than natural systems. Here we revisit the first synthetic genetic oscillator, the repressilator, and modify it using principles from stochastic chemistry in single cells. Specifically, we sought to reduce error propagation and information losses, not by adding control loops, but by simply removing existing features. We show that this modification created highly regular and robust oscillations. Furthermore, some streamlined circuits kept 14 generation periods over a range of growth conditions and kept phase for hundreds of generations in single cells, allowing cells in flasks and colonies to oscillate synchronously without any coupling between them. Our results suggest that even the simplest synthetic genetic networks can achieve a precision that rivals natural systems, and emphasize the importance of noise analyses for circuit design in synthetic biology.

    @article{Potvin-Trottier:2016kx,
    Abstract = {Synthetically engineered genetic circuits can perform a wide variety of tasks but are generally less accurate than natural systems. Here we revisit the first synthetic genetic oscillator, the repressilator, and modify it using principles from stochastic chemistry in single cells. Specifically, we sought to reduce error propagation and information losses, not by adding control loops, but by simply removing existing features. We show that this modification created highly regular and robust oscillations. Furthermore, some streamlined circuits kept 14 generation periods over a range of growth conditions and kept phase for hundreds of generations in single cells, allowing cells in flasks and colonies to oscillate synchronously without any coupling between them. Our results suggest that even the simplest synthetic genetic networks can achieve a precision that rivals natural systems, and emphasize the importance of noise analyses for circuit design in synthetic biology.},
    Author = {Potvin-Trottier, Laurent and Lord, Nathan D and Vinnicombe, Glenn and Paulsson, Johan},
    Date-Added = {2016-11-10 11:26:55 +0000},
    Date-Modified = {2016-11-10 11:26:55 +0000},
    Doi = {10.1038/nature19841},
    Journal = {Nature},
    Journal-Full = {Nature},
    Month = {Oct},
    Number = {7626},
    Pages = {514-517},
    Pmid = {27732583},
    Pst = {aheadofprint},
    Title = {Synchronous long-term oscillations in a synthetic gene circuit},
    Volume = {538},
    Year = {2016},
    Bdsk-Url-1 = {//dx.doi.org/10.1038/nature19841}}

  • A. Graves, G. Wayne, M. Reynolds, T. Harley, I. Danihelka, A. Grabska-Barwińska, S. G. Colmenarejo, E. Grefenstette, T. Ramalho, J. Agapiou, A. P. Badia, K. M. Hermann, Y. Zwols, G. Ostrovski, A. Cain, H. King, C. Summerfield, P. Blunsom, K. Kavukcuoglu, and D. Hassabis, “Hybrid computing using a neural network with dynamic external memory,” Nature, vol. 538, iss. 7626, pp. 471-476, 2016. doi:10.1038/nature20101
    [BibTeX] [Abstract]

    Artificial neural networks are remarkably adept at sensory processing, sequence learning and reinforcement learning, but are limited in their ability to represent variables and data structures and to store data over long timescales, owing to the lack of an external memory. Here we introduce a machine learning model called a differentiable neural computer (DNC), which consists of a neural network that can read from and write to an external memory matrix, analogous to the random-access memory in a conventional computer. Like a conventional computer, it can use its memory to represent and manipulate complex data structures, but, like a neural network, it can learn to do so from data. When trained with supervised learning, we demonstrate that a DNC can successfully answer synthetic questions designed to emulate reasoning and inference problems in natural language. We show that it can learn tasks such as finding the shortest path between specified points and inferring the missing links in randomly generated graphs, and then generalize these tasks to specific graphs such as transport networks and family trees. When trained with reinforcement learning, a DNC can complete a moving blocks puzzle in which changing goals are specified by sequences of symbols. Taken together, our results demonstrate that DNCs have the capacity to solve complex, structured tasks that are inaccessible to neural networks without external read-write memory.

    @article{Graves:2016uq,
    Abstract = {Artificial neural networks are remarkably adept at sensory processing, sequence learning and reinforcement learning, but are limited in their ability to represent variables and data structures and to store data over long timescales, owing to the lack of an external memory. Here we introduce a machine learning model called a differentiable neural computer (DNC), which consists of a neural network that can read from and write to an external memory matrix, analogous to the random-access memory in a conventional computer. Like a conventional computer, it can use its memory to represent and manipulate complex data structures, but, like a neural network, it can learn to do so from data. When trained with supervised learning, we demonstrate that a DNC can successfully answer synthetic questions designed to emulate reasoning and inference problems in natural language. We show that it can learn tasks such as finding the shortest path between specified points and inferring the missing links in randomly generated graphs, and then generalize these tasks to specific graphs such as transport networks and family trees. When trained with reinforcement learning, a DNC can complete a moving blocks puzzle in which changing goals are specified by sequences of symbols. Taken together, our results demonstrate that DNCs have the capacity to solve complex, structured tasks that are inaccessible to neural networks without external read-write memory.},
    Author = {Graves, Alex and Wayne, Greg and Reynolds, Malcolm and Harley, Tim and Danihelka, Ivo and Grabska-Barwi{\'n}ska, Agnieszka and Colmenarejo, Sergio G{\'o}mez and Grefenstette, Edward and Ramalho, Tiago and Agapiou, John and Badia, Adri{\`a} Puigdom{\`e}nech and Hermann, Karl Moritz and Zwols, Yori and Ostrovski, Georg and Cain, Adam and King, Helen and Summerfield, Christopher and Blunsom, Phil and Kavukcuoglu, Koray and Hassabis, Demis},
    Date-Added = {2016-11-10 11:26:08 +0000},
    Date-Modified = {2016-11-10 11:26:08 +0000},
    Doi = {10.1038/nature20101},
    Journal = {Nature},
    Journal-Full = {Nature},
    Month = {Oct},
    Number = {7626},
    Pages = {471-476},
    Pmid = {27732574},
    Pst = {aheadofprint},
    Title = {Hybrid computing using a neural network with dynamic external memory},
    Volume = {538},
    Year = {2016},
    Bdsk-Url-1 = {//dx.doi.org/10.1038/nature20101}}

  • R. L. Dilley, P. Verma, N. W. Cho, H. D. Winters, A. R. Wondisford, and R. A. Greenberg, “Break-induced telomere synthesis underlies alternative telomere maintenance,” Nature, vol. 539, iss. 7627, pp. 54-58, 2016. doi:10.1038/nature20099
    [BibTeX] [Abstract]

    Homology-directed DNA repair is essential for genome maintenance through templated DNA synthesis. Alternative lengthening of telomeres (ALT) necessitates homology-directed DNA repair to maintain telomeres in about 10-15% of human cancers. How DNA damage induces assembly and execution of a DNA replication complex (break-induced replisome) at telomeres or elsewhere in the mammalian genome is poorly understood. Here we define break-induced telomere synthesis and demonstrate that it utilizes a specialized replisome, which underlies ALT telomere maintenance. DNA double-strand breaks enact nascent telomere synthesis by long-tract unidirectional replication. Proliferating cell nuclear antigen (PCNA) loading by replication factor C (RFC) acts as the initial sensor of telomere damage to establish predominance of DNA polymerase δ (Pol δ) through its POLD3 subunit. Break-induced telomere synthesis requires the RFC-PCNA-Pol δ axis, but is independent of other canonical replisome components, ATM and ATR, or the homologous recombination protein Rad51. Thus, the inception of telomere damage recognition by the break-induced replisome orchestrates homology-directed telomere maintenance.

    @article{Dilley:2016fk,
    Abstract = {Homology-directed DNA repair is essential for genome maintenance through templated DNA synthesis. Alternative lengthening of telomeres (ALT) necessitates homology-directed DNA repair to maintain telomeres in about 10-15% of human cancers. How DNA damage induces assembly and execution of a DNA replication complex (break-induced replisome) at telomeres or elsewhere in the mammalian genome is poorly understood. Here we define break-induced telomere synthesis and demonstrate that it utilizes a specialized replisome, which underlies ALT telomere maintenance. DNA double-strand breaks enact nascent telomere synthesis by long-tract unidirectional replication. Proliferating cell nuclear antigen (PCNA) loading by replication factor C (RFC) acts as the initial sensor of telomere damage to establish predominance of DNA polymerase δ (Pol δ) through its POLD3 subunit. Break-induced telomere synthesis requires the RFC-PCNA-Pol δ axis, but is independent of other canonical replisome components, ATM and ATR, or the homologous recombination protein Rad51. Thus, the inception of telomere damage recognition by the break-induced replisome orchestrates homology-directed telomere maintenance.},
    Author = {Dilley, Robert L and Verma, Priyanka and Cho, Nam Woo and Winters, Harrison D and Wondisford, Anne R and Greenberg, Roger A},
    Date-Added = {2016-11-10 11:24:31 +0000},
    Date-Modified = {2016-11-10 11:24:31 +0000},
    Doi = {10.1038/nature20099},
    Journal = {Nature},
    Journal-Full = {Nature},
    Month = {Oct},
    Number = {7627},
    Pages = {54-58},
    Pmid = {27760120},
    Pst = {aheadofprint},
    Title = {Break-induced telomere synthesis underlies alternative telomere maintenance},
    Volume = {539},
    Year = {2016},
    Bdsk-Url-1 = {//dx.doi.org/10.1038/nature20099}}

publications 2016-10-13

In this group meeting, we quickly discussed these latest papers:

  • Y. Wan, G. I. Allen, Y. Baker, E. Yang, P. Ravikumar, M. Anderson, and Z. Liu, “Xmrf: an r package to fit markov networks to high-throughput genetics data,” Bmc systems biology, vol. 10, iss. 3, p. 69, 2016. doi:10.1186/s12918-016-0313-0
    [BibTeX] [Abstract] [Download PDF]

    Technological advances in medicine have led to a rapid proliferation of high-throughput “omics” data. Tools to mine this data and discover disrupted disease networks are needed as they hold the key to understanding complicated interactions between genes, mutations and aberrations, and epi-genetic markers.

    @Article{Wan2016,
    author="Wan, Ying-Wooi
    and Allen, Genevera I.
    and Baker, Yulia
    and Yang, Eunho
    and Ravikumar, Pradeep
    and Anderson, Matthew
    and Liu, Zhandong",
    title="XMRF: an R package to fit Markov Networks to high-throughput genetics data",
    journal="BMC Systems Biology",
    year="2016",
    volume="10",
    number="3",
    pages="69",
    abstract="Technological advances in medicine have led to a rapid proliferation of high-throughput ``omics'' data. Tools to mine this data and discover disrupted disease networks are needed as they hold the key to understanding complicated interactions between genes, mutations and aberrations, and epi-genetic markers.",
    issn="1752-0509",
    doi="10.1186/s12918-016-0313-0",
    url="//dx.doi.org/10.1186/s12918-016-0313-0"
    }

  • O. Lenive, P. D. W. Kirk, and M. P. H. Stumpf, “Inferring extrinsic noise from single-cell gene expression data using approximate bayesian computation,” Bmc systems biology, vol. 10, iss. 1, p. 81, 2016. doi:10.1186/s12918-016-0324-x
    [BibTeX] [Abstract] [Download PDF]

    Gene expression is known to be an intrinsically stochastic process which can involve single-digit numbers of mRNA molecules in a cell at any given time. The modelling of such processes calls for the use of exact stochastic simulation methods, most notably the Gillespie algorithm. However, this stochasticity, also termed “intrinsic noise”, does not account for all the variability between genetically identical cells growing in a homogeneous environment.

    @Article{Lenive2016,
    author="Lenive, Oleg
    and W. Kirk, Paul D.
    and H. Stumpf, Michael P.",
    title="Inferring extrinsic noise from single-cell gene expression data using approximate Bayesian computation",
    journal="BMC Systems Biology",
    year="2016",
    volume="10",
    number="1",
    pages="81",
    abstract="Gene expression is known to be an intrinsically stochastic process which can involve single-digit numbers of mRNA molecules in a cell at any given time. The modelling of such processes calls for the use of exact stochastic simulation methods, most notably the Gillespie algorithm. However, this stochasticity, also termed ``intrinsic noise'', does not account for all the variability between genetically identical cells growing in a homogeneous environment.",
    issn="1752-0509",
    doi="10.1186/s12918-016-0324-x",
    url="//dx.doi.org/10.1186/s12918-016-0324-x"
    }

  • S. J. Vollmer, K. C. Zygalakis, and Y. W. Teh, “Exploration of the (non-)asymptotic bias and variance of stochastic gradient langevin dynamics,” Journal of machine learning research, vol. 17, iss. 159, pp. 1-48, 2016.
    [BibTeX] [Abstract] [Download PDF]

    Applying standard Markov chain Monte Carlo (MCMC) algorithms to large data sets is computationally infeasible. The recently proposed stochastic gradient Langevin dynamics (SGLD) method circumvents this problem in three ways: it generates proposed moves using only a subset of the data, it skips the Metropolis- Hastings accept-reject step, and it uses sequences of decreasing step sizes. In Teh et al. (2014), we provided the mathematical foundations for the decreasing step size SGLD, including consistency and a central limit theorem. However, in practice the SGLD is run for a relatively small number of iterations, and its step size is not decreased to zero. The present article investigates the behaviour of the SGLD with fixed step size. In particular we characterise the asymptotic bias explicitly, along with its dependence on the step size and the variance of the stochastic gradient. On that basis a modified SGLD which removes the asymptotic bias due to the variance of the stochastic gradients up to first order in the step size is derived. Moreover, we are able to obtain bounds on the finite-time bias, variance and mean squared error (MSE). The theory is illustrated with a Gaussian toy model for which the bias and the MSE for the estimation of moments can be obtained explicitly. For this toy model we study the gain of the SGLD over the standard Euler method in the limit of large data sets.

    @article{JMLR:v17:15-494,
    author = {Sebastian J. Vollmer and Konstantinos C. Zygalakis and Yee Whye Teh},
    title = {Exploration of the (Non-)Asymptotic Bias and Variance of Stochastic Gradient Langevin Dynamics},
    journal = {Journal of Machine Learning Research},
    year = {2016},
    volume = {17},
    number = {159},
    pages = {1-48},
    url = {//jmlr.org/papers/v17/15-494.html},
    abstract = {Applying standard Markov chain Monte Carlo (MCMC) algorithms to large data sets is computationally infeasible. The recently proposed stochastic gradient Langevin dynamics (SGLD) method circumvents this problem in three ways: it generates proposed moves using only a subset of the data, it skips the Metropolis- Hastings accept-reject step, and it uses sequences of decreasing step sizes. In Teh et al. (2014), we provided the mathematical foundations for the decreasing step size SGLD, including consistency and a central limit theorem. However, in practice the SGLD is run for a relatively small number of iterations, and its step size is not decreased to zero. The present article investigates the behaviour of the SGLD with fixed step size. In particular we characterise the asymptotic bias explicitly, along with its dependence on the step size and the variance of the stochastic gradient. On that basis a modified SGLD which removes the asymptotic bias due to the variance of the stochastic gradients up to first order in the step size is derived. Moreover, we are able to obtain bounds on the finite-time bias, variance and mean squared error (MSE). The theory is illustrated with a Gaussian toy model for which the bias and the MSE for the estimation of moments can be obtained explicitly. For this toy model we study the gain of the SGLD over the standard Euler method in the limit of large data sets.}
    }

  • P. Guarniero, A. M. Johansen, and A. Lee, “The iterated auxiliary particle filter,” Journal of the american statistical association, iss. ja. doi:10.1080/01621459.2016.1222291
    [BibTeX] [Abstract] [Download PDF]

    AbstractWe present an offline, iterated particle filter to facilitate statistical inference in general state space hidden Markov models. Given a model and a sequence of observations, the associated marginal likelihood L is central to likelihood-based inference for unknown statistical parameters. We define a class of “twisted” models: each member is specified by a sequence of positive functions ψ and has an associated ψ-auxiliary particle filter that provides unbiased estimates of L. We identify a sequence ψ* that is optimal in the sense that the ψ*-auxiliary particle filter’s estimate of L has zero variance. In practical applications, ψ* is unknown so the ψ*-auxiliary particle filter cannot straightforwardly be implemented. We use an iterative scheme to approximate ψ*, and demonstrate empirically that the resulting iterated auxiliary particle filter significantly outperforms the bootstrap particle filter in challenging settings. Applications include parameter estimation using a particle Markov chain Monte Carlo algorithm.

    @article{doi:10.1080/01621459.2016.1222291,
    author = {Pieralberto Guarniero and Adam M. Johansen and Anthony Lee},
    title = {The iterated auxiliary particle filter},
    journal = {Journal of the American Statistical Association},
    volume = {0},
    number = {ja},
    pages = {0-0},
    year = {0},
    doi = {10.1080/01621459.2016.1222291},
    URL = {//dx.doi.org/10.1080/01621459.2016.1222291},
    eprint = {//dx.doi.org/10.1080/01621459.2016.1222291},
    abstract = { AbstractWe present an offline, iterated particle filter to facilitate statistical inference in general state space hidden Markov models. Given a model and a sequence of observations, the associated marginal likelihood L is central to likelihood-based inference for unknown statistical parameters. We define a class of “twisted” models: each member is specified by a sequence of positive functions ψ and has an associated ψ-auxiliary particle filter that provides unbiased estimates of L. We identify a sequence ψ* that is optimal in the sense that the ψ*-auxiliary particle filter's estimate of L has zero variance. In practical applications, ψ* is unknown so the ψ*-auxiliary particle filter cannot straightforwardly be implemented. We use an iterative scheme to approximate ψ*, and demonstrate empirically that the resulting iterated auxiliary particle filter significantly outperforms the bootstrap particle filter in challenging settings. Applications include parameter estimation using a particle Markov chain Monte Carlo algorithm. }
    }

  • A. Sarkar and D. B. Dunson, “Bayesian nonparametric modeling of higher order markov chains,” Journal of the american statistical association, iss. ja, pp. 1-36. doi:10.1080/01621459.2015.1115763
    [BibTeX] [Abstract] [Download PDF]

    We consider the problem of flexible modeling of higher order Markov chains when an upper bound on the order of the chain is known but the true order and nature of the serial dependence are unknown. We propose Bayesian nonparametric methodology based on conditional tensor factorizations, which can characterize any transition probability with a specified maximal order. The methodology selects the important lags and captures higher order interactions among the lags, while also facilitating calculation of Bayes factors for a variety of hypotheses of interest. We design efficient Markov chain Monte Carlo algorithms for posterior computation, allowing for uncertainty in the set of important lags to be included and in the nature and order of the serial dependence. The methods are illustrated using simulation experiments and real world applications.

    @article{doi:10.1080/01621459.2015.1115763,
    author = {Abhra Sarkar and David B. Dunson},
    title = {Bayesian Nonparametric Modeling of Higher Order Markov Chains},
    journal = {Journal of the American Statistical Association},
    volume = {0},
    number = {ja},
    pages = {1-36},
    year = {0},
    doi = {10.1080/01621459.2015.1115763},
    URL = {//dx.doi.org/10.1080/01621459.2015.1115763},
    eprint = {//dx.doi.org/10.1080/01621459.2015.1115763},
    abstract = { We consider the problem of flexible modeling of higher order Markov chains when an upper bound on the order of the chain is known but the true order and nature of the serial dependence are unknown. We propose Bayesian nonparametric methodology based on conditional tensor factorizations, which can characterize any transition probability with a specified maximal order. The methodology selects the important lags and captures higher order interactions among the lags, while also facilitating calculation of Bayes factors for a variety of hypotheses of interest. We design efficient Markov chain Monte Carlo algorithms for posterior computation, allowing for uncertainty in the set of important lags to be included and in the nature and order of the serial dependence. The methods are illustrated using simulation experiments and real world applications. }
    }

  • S. X. Liu, H. Wu, X. Ji, Y. Stelzer, X. Wu, S. Czauderna, J. Shu, D. Dadon, R. A. Young, and R. Jaenisch, “Editing dna methylation in the mammalian genome,” Cell, vol. 167, iss. 1, pp. 233-247, 2016.
    [BibTeX]
    @article{liu2016editing,
    title={Editing DNA Methylation in the Mammalian Genome},
    author={Liu, X Shawn and Wu, Hao and Ji, Xiong and Stelzer, Yonatan and Wu, Xuebing and Czauderna, Szymon and Shu, Jian and Dadon, Daniel and Young, Richard A and Jaenisch, Rudolf},
    journal={Cell},
    volume={167},
    number={1},
    pages={233--247},
    year={2016},
    publisher={Elsevier}
    }

  • A. D. King, K. Huang, L. Rubbi, S. Liu, C. Wang, Y. Wang, M. Pellegrini, and G. Fan, “Reversible regulation of promoter and enhancer histone landscape by dna methylation in mouse embryonic stem cells,” Cell reports, vol. 17, iss. 1, pp. 289-302, 2016.
    [BibTeX]
    @article{king2016reversible,
    title={Reversible Regulation of Promoter and Enhancer Histone Landscape by DNA Methylation in Mouse Embryonic Stem Cells},
    author={King, Andrew D and Huang, Kevin and Rubbi, Liudmilla and Liu, Shuo and Wang, Cun-Yu and Wang, Yinsheng and Pellegrini, Matteo and Fan, Guoping},
    journal={Cell Reports},
    volume={17},
    number={1},
    pages={289--302},
    year={2016},
    publisher={Elsevier}
    }

  • Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation: representing model uncertainty in deep learning,” Arxiv preprint arxiv:1506.02142, 2015.
    [BibTeX]
    @article{gal2015dropout,
    title={Dropout as a Bayesian approximation: Representing model uncertainty in deep learning},
    author={Gal, Yarin and Ghahramani, Zoubin},
    journal={arXiv preprint arXiv:1506.02142},
    year={2015}
    }

  • C. Shao and T. Höfer, “Robust classification of single-cell transcriptome data by nonnegative matrix factorization,” Bioinformatics, p. btw607, 2016.
    [BibTeX]
    @article{shao2016robust,
    title={Robust classification of single-cell transcriptome data by nonnegative matrix factorization},
    author={Shao, Chunxuan and H{\"o}fer, Thomas},
    journal={Bioinformatics},
    pages={btw607},
    year={2016},
    publisher={Oxford Univ Press}
    }

  • G. K. Ocker, K. Josić, E. Shea-Brown, and M. A. Buice, “Linking structure and activity in nonlinear spiking networks,” , pp. 1-47, 2016.
    [BibTeX] [Abstract] [Download PDF]

    Recent experimental advances are producing an avalanche of data on both neural connectivity and neural activity. To take full advantage of these two emerging datasets we need a framework that links them, revealing how collective neural activity arises from the structure of neural connectivity and intrinsic neural dynamics. This problem of structure-driven activity has drawn major interest in computational neuroscience. Existing methods for relating activity and architecture in spiking networks rely on linearizing activity around a central operating point and thus fail to capture the nonlinear responses of individual neurons that are the hallmark of neural information processing. Here, we overcome this limitation and present a new relationship between connectivity and activity in networks of nonlinear spiking neurons. We explicitly show how recurrent network structure produces pairwise and higher-order correlated activity, and how nonlinearities impact the networks’ spiking activity. Finally, we demonstrate how correlations due to recurrent connectivity impact the fidelity with which populations encode simple signals and how nonlinear dynamics impose a new effect of correlations on coding: to translate response distributions in addition to stretching them. Our findings open new avenues to investigating how neural nonlinearities, including those expressed across multiple cell types, combine with connectivity to shape population activity and function.

    @article{Ocker2016,
    abstract = {Recent experimental advances are producing an avalanche of data on both neural connectivity and neural activity. To take full advantage of these two emerging datasets we need a framework that links them, revealing how collective neural activity arises from the structure of neural connectivity and intrinsic neural dynamics. This problem of structure-driven activity has drawn major interest in computational neuroscience. Existing methods for relating activity and architecture in spiking networks rely on linearizing activity around a central operating point and thus fail to capture the nonlinear responses of individual neurons that are the hallmark of neural information processing. Here, we overcome this limitation and present a new relationship between connectivity and activity in networks of nonlinear spiking neurons. We explicitly show how recurrent network structure produces pairwise and higher-order correlated activity, and how nonlinearities impact the networks' spiking activity. Finally, we demonstrate how correlations due to recurrent connectivity impact the fidelity with which populations encode simple signals and how nonlinear dynamics impose a new effect of correlations on coding: to translate response distributions in addition to stretching them. Our findings open new avenues to investigating how neural nonlinearities, including those expressed across multiple cell types, combine with connectivity to shape population activity and function.},
    archivePrefix = {arXiv},
    arxivId = {1610.03828},
    author = {Ocker, Gabriel Koch and Josi{\'{c}}, Kre{\v{s}}imir and Shea-Brown, Eric and Buice, Michael A.},
    eprint = {1610.03828},
    file = {:Users/david/PhD/literature/paper/Ocker2016.pdf:pdf},
    pages = {1--47},
    title = {{Linking structure and activity in nonlinear spiking networks}},
    url = {//arxiv.org/abs/1610.03828},
    year = {2016}
    }

  • D. Moyer, B. A. Gutman, J. Faskowitz, N. Jahanshad, and P. M. Thompson, “A Continuous Model of Cortical Connectivity,” , 2016.
    [BibTeX] [Abstract] [Download PDF]

    We present a continuous model for structural brain connectivity based on the Poisson point process. The model treats each streamline curve in a tractography as an observed event in connectome space, here a product space of cortical white matter boundaries. We approximate the model parameter via kernel density estimation. To deal with the heavy computational burden, we develop a fast parameter estimation method by pre-computing associated Legendre products of the data, leveraging properties of the spherical heat kernel. We show how our approach can be used to assess the quality of cortical parcellations with respect to connectivty. We further present empirical results that suggest the discrete connectomes derived from our model have substantially higher test-retest reliability compared to standard methods.

    @article{Moyer2016,
    abstract = {We present a continuous model for structural brain connectivity based on the Poisson point process. The model treats each streamline curve in a tractography as an observed event in connectome space, here a product space of cortical white matter boundaries. We approximate the model parameter via kernel density estimation. To deal with the heavy computational burden, we develop a fast parameter estimation method by pre-computing associated Legendre products of the data, leveraging properties of the spherical heat kernel. We show how our approach can be used to assess the quality of cortical parcellations with respect to connectivty. We further present empirical results that suggest the discrete connectomes derived from our model have substantially higher test-retest reliability compared to standard methods.},
    archivePrefix = {arXiv},
    arxivId = {1610.03809},
    author = {Moyer, Daniel and Gutman, Boris A. and Faskowitz, Joshua and Jahanshad, Neda and Thompson, Paul M.},
    eprint = {1610.03809},
    file = {:Users/david/PhD/literature/paper/Moyer2016.pdf:pdf},
    keywords = {diffusion mri,human connectome,non-parametric esti-},
    title = {{A Continuous Model of Cortical Connectivity}},
    url = {//arxiv.org/abs/1610.03809},
    year = {2016}
    }

  • M. J. Taliaferro, N. J. Lambert, P. H. Sudmant, D. Dominguez, J. J. Merkin, M. S. Alexis, C. A. Bazile, and C. B. Burge, “Rna sequence context effects measured in vitro predict in vivo protein binding and regulation,” Molecular cell, 2016.
    [BibTeX]
    @article{taliaferro2016rna,
    title={RNA Sequence Context Effects Measured In Vitro Predict In Vivo Protein Binding and Regulation},
    author={Taliaferro, J Matthew and Lambert, Nicole J and Sudmant, Peter H and Dominguez, Daniel and Merkin, Jason J and Alexis, Maria S and Bazile, Cassandra A and Burge, Christopher B},
    journal={Molecular Cell},
    year={2016},
    publisher={Elsevier}
    }

  • K. W. Brannan, W. Jin, S. C. Huelga, C. A. Banks, J. M. Gilmore, L. Florens, M. P. Washburn, E. L. Van Nostrand, G. A. Pratt, M. K. Schwinn, and others, “Sonar discovers rna-binding proteins from analysis of large-scale protein-protein interactomes,” Molecular cell, 2016.
    [BibTeX]
    @article{brannan2016sonar,
    title={SONAR Discovers RNA-Binding Proteins from Analysis of Large-Scale Protein-Protein Interactomes},
    author={Brannan, Kristopher W and Jin, Wenhao and Huelga, Stephanie C and Banks, Charles AS and Gilmore, Joshua M and Florens, Laurence and Washburn, Michael P and Van Nostrand, Eric L and Pratt, Gabriel A and Schwinn, Marie K and others},
    journal={Molecular Cell},
    year={2016},
    publisher={Elsevier}
    }

  • G. M. Gould, J. M. Paggi, Y. Guo, D. V. Phizicky, B. Zinshteyn, E. T. Wang, W. V. Gilbert, D. K. Gifford, and C. B. Burge, “Identification of new branch points and unconventional introns in saccharomyces cerevisiae,” Rna, vol. 22, iss. 10, pp. 1522-1534, 2016.
    [BibTeX]
    @article{gould2016identification,
    title={Identification of new branch points and unconventional introns in Saccharomyces cerevisiae},
    author={Gould, Genevieve M and Paggi, Joseph M and Guo, Yuchun and Phizicky, David V and Zinshteyn, Boris and Wang, Eric T and Gilbert, Wendy V and Gifford, David K and Burge, Christopher B},
    journal={rna},
    volume={22},
    number={10},
    pages={1522--1534},
    year={2016},
    publisher={Cold Spring Harbor Lab}
    }

  • A. D. Washburne, J. W. Burby, and D. Lacker, “Novel covariance-based neutrality test of time-series data reveals asymmetries in ecological and economic systems,” Plos comput biol, vol. 12, iss. 9, p. e1005124, 2016. doi:10.1371/journal.pcbi.1005124
    [BibTeX] [Abstract]

    Systems as diverse as the interacting species in a community, alleles at a genetic locus, and companies in a market are characterized by competition (over resources, space, capital, etc) and adaptation. Neutral theory, built around the hypothesis that individual performance is independent of group membership, has found utility across the disciplines of ecology, population genetics, and economics, both because of the success of the neutral hypothesis in predicting system properties and because deviations from these predictions provide information about the underlying dynamics. However, most tests of neutrality are weak, based on static system properties such as species-abundance distributions or the number of singletons in a sample. Time-series data provide a window onto a system’s dynamics, and should furnish tests of the neutral hypothesis that are more powerful to detect deviations from neutrality and more informative about to the type of competitive asymmetry that drives the deviation. Here, we present a neutrality test for time-series data. We apply this test to several microbial time-series and financial time-series and find that most of these systems are not neutral. Our test isolates the covariance structure of neutral competition, thus facilitating further exploration of the nature of asymmetry in the covariance structure of competitive systems. Much like neutrality tests from population genetics that use relative abundance distributions have enabled researchers to scan entire genomes for genes under selection, we anticipate our time-series test will be useful for quick significance tests of neutrality across a range of ecological, economic, and sociological systems for which time-series data are available. Future work can use our test to categorize and compare the dynamic fingerprints of particular competitive asymmetries (frequency dependence, volatility smiles, etc) to improve forecasting and management of complex adaptive systems.

    @article{Washburne:2016kx,
    Abstract = {Systems as diverse as the interacting species in a community, alleles at a genetic locus, and companies in a market are characterized by competition (over resources, space, capital, etc) and adaptation. Neutral theory, built around the hypothesis that individual performance is independent of group membership, has found utility across the disciplines of ecology, population genetics, and economics, both because of the success of the neutral hypothesis in predicting system properties and because deviations from these predictions provide information about the underlying dynamics. However, most tests of neutrality are weak, based on static system properties such as species-abundance distributions or the number of singletons in a sample. Time-series data provide a window onto a system's dynamics, and should furnish tests of the neutral hypothesis that are more powerful to detect deviations from neutrality and more informative about to the type of competitive asymmetry that drives the deviation. Here, we present a neutrality test for time-series data. We apply this test to several microbial time-series and financial time-series and find that most of these systems are not neutral. Our test isolates the covariance structure of neutral competition, thus facilitating further exploration of the nature of asymmetry in the covariance structure of competitive systems. Much like neutrality tests from population genetics that use relative abundance distributions have enabled researchers to scan entire genomes for genes under selection, we anticipate our time-series test will be useful for quick significance tests of neutrality across a range of ecological, economic, and sociological systems for which time-series data are available. Future work can use our test to categorize and compare the dynamic fingerprints of particular competitive asymmetries (frequency dependence, volatility smiles, etc) to improve forecasting and management of complex adaptive systems.},
    Author = {Washburne, Alex D and Burby, Joshua W and Lacker, Daniel},
    Date-Added = {2016-10-13 12:33:11 +0000},
    Date-Modified = {2016-10-13 12:33:11 +0000},
    Doi = {10.1371/journal.pcbi.1005124},
    Journal = {PLoS Comput Biol},
    Journal-Full = {PLoS computational biology},
    Month = {Sep},
    Number = {9},
    Pages = {e1005124},
    Pmid = {27689714},
    Pst = {epublish},
    Title = {Novel Covariance-Based Neutrality Test of Time-Series Data Reveals Asymmetries in Ecological and Economic Systems},
    Volume = {12},
    Year = {2016},
    Bdsk-Url-1 = {//dx.doi.org/10.1371/journal.pcbi.1005124}}

  • S. K. Esser, P. A. Merolla, J. V. Arthur, A. S. Cassidy, R. Appuswamy, A. Andreopoulos, D. J. Berg, J. L. McKinstry, T. Melano, D. R. Barch, C. di Nolfo, P. Datta, A. Amir, B. Taba, M. D. Flickner, and D. S. Modha, “Convolutional networks for fast, energy-efficient neuromorphic computing,” Proc natl acad sci u s a, vol. 113, iss. 41, pp. 11441-11446, 2016. doi:10.1073/pnas.1604850113
    [BibTeX] [Abstract]

    Deep networks are now able to achieve human-level performance on a broad spectrum of recognition tasks. Independently, neuromorphic computing has now demonstrated unprecedented energy-efficiency through a new chip architecture based on spiking neurons, low precision synapses, and a scalable communication network. Here, we demonstrate that neuromorphic computing, despite its novel architectural primitives, can implement deep convolution networks that (i) approach state-of-the-art classification accuracy across eight standard datasets encompassing vision and speech, (ii) perform inference while preserving the hardware’s underlying energy-efficiency and high throughput, running on the aforementioned datasets at between 1,200 and 2,600 frames/s and using between 25 and 275 mW (effectively >6,000 frames/s per Watt), and (iii) can be specified and trained using backpropagation with the same ease-of-use as contemporary deep learning. This approach allows the algorithmic power of deep learning to be merged with the efficiency of neuromorphic processors, bringing the promise of embedded, intelligent, brain-inspired computing one step closer.

    @article{Esser:2016uq,
    Abstract = {Deep networks are now able to achieve human-level performance on a broad spectrum of recognition tasks. Independently, neuromorphic computing has now demonstrated unprecedented energy-efficiency through a new chip architecture based on spiking neurons, low precision synapses, and a scalable communication network. Here, we demonstrate that neuromorphic computing, despite its novel architectural primitives, can implement deep convolution networks that (i) approach state-of-the-art classification accuracy across eight standard datasets encompassing vision and speech, (ii) perform inference while preserving the hardware's underlying energy-efficiency and high throughput, running on the aforementioned datasets at between 1,200 and 2,600 frames/s and using between 25 and 275 mW (effectively >6,000 frames/s per Watt), and (iii) can be specified and trained using backpropagation with the same ease-of-use as contemporary deep learning. This approach allows the algorithmic power of deep learning to be merged with the efficiency of neuromorphic processors, bringing the promise of embedded, intelligent, brain-inspired computing one step closer.},
    Author = {Esser, Steven K and Merolla, Paul A and Arthur, John V and Cassidy, Andrew S and Appuswamy, Rathinakumar and Andreopoulos, Alexander and Berg, David J and McKinstry, Jeffrey L and Melano, Timothy and Barch, Davis R and di Nolfo, Carmelo and Datta, Pallab and Amir, Arnon and Taba, Brian and Flickner, Myron D and Modha, Dharmendra S},
    Date-Added = {2016-10-13 12:24:35 +0000},
    Date-Modified = {2016-10-13 12:24:35 +0000},
    Doi = {10.1073/pnas.1604850113},
    Journal = {Proc Natl Acad Sci U S A},
    Journal-Full = {Proceedings of the National Academy of Sciences of the United States of America},
    Keywords = {TrueNorth; convolutional network; neural network; neuromorphic},
    Month = {Oct},
    Number = {41},
    Pages = {11441-11446},
    Pmid = {27651489},
    Pst = {ppublish},
    Title = {Convolutional networks for fast, energy-efficient neuromorphic computing},
    Volume = {113},
    Year = {2016},
    Bdsk-Url-1 = {//dx.doi.org/10.1073/pnas.1604850113}}

  • Z. Zhou, Y. Dang, M. Zhou, L. Li, C. Yu, J. Fu, S. Chen, and Y. Liu, “Codon usage is an important determinant of gene expression levels largely through its effects on transcription,” Proc natl acad sci u s a, vol. 113, iss. 41, p. E6117-E6125, 2016. doi:10.1073/pnas.1606724113
    [BibTeX] [Abstract]

    Codon usage biases are found in all eukaryotic and prokaryotic genomes, and preferred codons are more frequently used in highly expressed genes. The effects of codon usage on gene expression were previously thought to be mainly mediated by its impacts on translation. Here, we show that codon usage strongly correlates with both protein and mRNA levels genome-wide in the filamentous fungus Neurospora Gene codon optimization also results in strong up-regulation of protein and RNA levels, suggesting that codon usage is an important determinant of gene expression. Surprisingly, we found that the impact of codon usage on gene expression results mainly from effects on transcription and is largely independent of mRNA translation and mRNA stability. Furthermore, we show that histone H3 lysine 9 trimethylation is one of the mechanisms responsible for the codon usage-mediated transcriptional silencing of some genes with nonoptimal codons. Together, these results uncovered an unexpected important role of codon usage in ORF sequences in determining transcription levels and suggest that codon biases are an adaptation of protein coding sequences to both transcription and translation machineries. Therefore, synonymous codons not only specify protein sequences and translation dynamics, but also help determine gene expression levels.

    @article{Zhou:2016fk,
    Abstract = {Codon usage biases are found in all eukaryotic and prokaryotic genomes, and preferred codons are more frequently used in highly expressed genes. The effects of codon usage on gene expression were previously thought to be mainly mediated by its impacts on translation. Here, we show that codon usage strongly correlates with both protein and mRNA levels genome-wide in the filamentous fungus Neurospora Gene codon optimization also results in strong up-regulation of protein and RNA levels, suggesting that codon usage is an important determinant of gene expression. Surprisingly, we found that the impact of codon usage on gene expression results mainly from effects on transcription and is largely independent of mRNA translation and mRNA stability. Furthermore, we show that histone H3 lysine 9 trimethylation is one of the mechanisms responsible for the codon usage-mediated transcriptional silencing of some genes with nonoptimal codons. Together, these results uncovered an unexpected important role of codon usage in ORF sequences in determining transcription levels and suggest that codon biases are an adaptation of protein coding sequences to both transcription and translation machineries. Therefore, synonymous codons not only specify protein sequences and translation dynamics, but also help determine gene expression levels.},
    Author = {Zhou, Zhipeng and Dang, Yunkun and Zhou, Mian and Li, Lin and Yu, Chien-Hung and Fu, Jingjing and Chen, She and Liu, Yi},
    Date-Added = {2016-10-13 11:26:27 +0000},
    Date-Modified = {2016-10-13 11:26:27 +0000},
    Doi = {10.1073/pnas.1606724113},
    Journal = {Proc Natl Acad Sci U S A},
    Journal-Full = {Proceedings of the National Academy of Sciences of the United States of America},
    Keywords = {Neurospora; codon usage; transcription},
    Month = {Oct},
    Number = {41},
    Pages = {E6117-E6125},
    Pmid = {27671647},
    Pst = {ppublish},
    Title = {Codon usage is an important determinant of gene expression levels largely through its effects on transcription},
    Volume = {113},
    Year = {2016},
    Bdsk-Url-1 = {//dx.doi.org/10.1073/pnas.1606724113}}

publications 2016-09-27

In this group meeting, we quickly discussed these latest papers:

  • A. Gitter, F. Huang, R. Valluvan, and E. Fraenkel, “Unsupervised learning of transcriptional regulatory networks via latent tree graphical models,” , pp. 1-37.
    [BibTeX]
    @article{Gitter,
    archivePrefix = {arXiv},
    arxivId = {arXiv:1609.06335v1},
    author = {Gitter, Anthony and Huang, Furong and Valluvan, Ragupathyraj and Fraenkel, Ernest},
    eprint = {arXiv:1609.06335v1},
    file = {:Users/david/PhD/literature/paper/Gitter2016.pdf:pdf},
    pages = {1--37},
    title = {{Unsupervised learning of transcriptional regulatory networks via latent tree graphical models}}
    }

  • A. S. Hansen, L. Huang, L. Pauleve, C. Zechner, M. Unger, A. S. Hansen, and H. Koeppl, “Reconstructing dynamic molecular states from single-cell time series Reconstructing dynamic molecular states from single-cell time series,” , iss. September, 2016.
    [BibTeX]
    @article{Hansen2016,
    author = {Hansen, Anders Sejr and Huang, Lirong and Pauleve, Loic and Zechner, Christoph and Unger, Michael and Hansen, Anders S and Koeppl, Heinz},
    file = {:Users/david/PhD/literature/paper/Huang2016.pdf:pdf},
    isbn = {0000000264},
    number = {September},
    title = {{Reconstructing dynamic molecular states from single-cell time series Reconstructing dynamic molecular states from single-cell time series}},
    year = {2016}
    }

  • S. J. Lam, N. M. O’Brien-Simpson, N. Pantarat, A. Sulistio, E. H. Wong, Y. Chen, J. C. Lenzo, J. A. Holden, A. Blencowe, E. C. Reynolds, and others, “Combating multidrug-resistant gram-negative bacteria with structurally nanoengineered antimicrobial peptide polymers,” Nature microbiology, vol. 1, p. 16162, 2016.
    [BibTeX]
    @article{lam2016combating,
    title={Combating multidrug-resistant Gram-negative bacteria with structurally nanoengineered antimicrobial peptide polymers},
    author={Lam, Shu J and O'Brien-Simpson, Neil M and Pantarat, Namfon and Sulistio, Adrian and Wong, Edgar HH and Chen, Yu-Yen and Lenzo, Jason C and Holden, James A and Blencowe, Anton and Reynolds, Eric C and others},
    journal={Nature Microbiology},
    volume={1},
    pages={16162},
    year={2016},
    publisher={Nature Publishing Group}
    }

  • J. Sztuba-Solinska, L. Diaz, M. R. Kumar, G. Kolb, M. R. Wiley, L. Jozwick, J. H. Kuhn, G. Palacios, S. R. Radoshitzky, S. F. Le Grice, and others, “A small stem-loop structure of the ebola virus trailer is essential for replication and interacts with heat-shock protein a8,” Nucleic acids research, p. gkw825, 2016.
    [BibTeX]
    @article{sztuba2016small,
    title={A small stem-loop structure of the Ebola virus trailer is essential for replication and interacts with heat-shock protein A8},
    author={Sztuba-Solinska, Joanna and Diaz, Larissa and Kumar, Mia R and Kolb, Ga{\"e}lle and Wiley, Michael R and Jozwick, Lucas and Kuhn, Jens H and Palacios, Gustavo and Radoshitzky, Sheli R and Le Grice, Stuart FJ and others},
    journal={Nucleic Acids Research},
    pages={gkw825},
    year={2016},
    publisher={Oxford Univ Press}
    }

  • T. Sutthibutpong, C. Matek, C. Benham, G. G. Slade, A. Noy, C. Laughton, J. P. Doye, A. A. Louis, and S. A. Harris, “Long-range correlations in the mechanics of small dna circles under topological stress revealed by multi-scale simulation,” Nucleic acids research, p. gkw815, 2016.
    [BibTeX]
    @article{sutthibutpong2016long,
    title={Long-range correlations in the mechanics of small DNA circles under topological stress revealed by multi-scale simulation},
    author={Sutthibutpong, Thana and Matek, Christian and Benham, Craig and Slade, Gabriel G and Noy, Agnes and Laughton, Charles and Doye, Jonathan PK and Louis, Ard A and Harris, Sarah A},
    journal={Nucleic Acids Research},
    pages={gkw815},
    year={2016},
    publisher={Oxford Univ Press}
    }

  • M. I. Love, J. B. Hogenesch, and R. A. Irizarry, “Modeling of rna-seq fragment sequence bias reduces systematic errors in transcript abundance estimation,” Nat biotech, 2016.
    [BibTeX] [Abstract]

    We find that current computational methods for estimating transcript abundance from RNA-seq data can lead to hundreds of false-positive results. We show that these systematic errors stem largely from a failure to model fragment GC content bias. Sample-specific biases associated with fragment sequence features lead to misidentification of transcript isoforms. We introduce alpine, a method for estimating sample-specific bias-corrected transcript abundance. By incorporating fragment sequence features, alpine greatly increases the accuracy of transcript abundance estimates, enabling a fourfold reduction in the number of false positives for reported changes in expression compared with Cufflinks. Using simulated data, we also show that alpine retains the ability to discover true positives, similar to other approaches. The method is available as an R/Bioconductor package that includes data visualization tools useful for bias discovery.

    @article{love2015modeling,
    title={Modeling of RNA-seq fragment sequence bias reduces systematic errors in transcript abundance estimation},
    author={Love, Michael I and Hogenesch, John B and Irizarry, Rafael A},
    journal={Nat Biotech},
    year={2016},
    abstract={We find that current computational methods for estimating transcript abundance from RNA-seq data can lead to hundreds of false-positive results. We show that these systematic errors stem largely from a failure to model fragment GC content bias. Sample-specific biases associated with fragment sequence features lead to misidentification of transcript isoforms. We introduce alpine, a method for estimating sample-specific bias-corrected transcript abundance. By incorporating fragment sequence features, alpine greatly increases the accuracy of transcript abundance estimates, enabling a fourfold reduction in the number of false positives for reported changes in expression compared with Cufflinks. Using simulated data, we also show that alpine retains the ability to discover true positives, similar to other approaches. The method is available as an R/Bioconductor package that includes data visualization tools useful for bias discovery.}
    }

  • B. T. Moghadam, M. Dabrowski, B. Kaminska, M. G. Grabherr, and J. Komorowski, “Combinatorial identification of dna methylation patterns over age in the human brain,” Bmc bioinformatics, vol. 17, iss. 1, p. 393, 2016.
    [BibTeX] [Abstract]

    Background DNA methylation plays a key role in developmental processes, which is reflected in changing methylation patterns at specific CpG sites over the lifetime of an individual. The underlying mechanisms are complex and possibly affect multiple genes or entire pathways. Results We applied a multivariate approach to identify combinations of CpG sites that undergo modifications when transitioning between developmental stages. Monte Carlo feature selection produced a list of ranked and statistically significant CpG sites, while rule-based models allowed for identifying particular methylation changes in these sites. Our rule-based classifier reports combinations of CpG sites, together with changes in their methylation status in the form of easy-to-read IF-THEN rules, which allows for identification of the genes associated with the underlying sites. Conclusion We utilized machine learning and statistical methods to discretize decision class (age) values to get a general pattern of methylation changes over the lifespan. The CpG sites present in the significant rules were annotated to genes involved in brain formation, general development, as well as genes linked to cancer and Alzheimer’s disease.

    @article{moghadam2016combinatorial,
    title={Combinatorial identification of DNA methylation patterns over age in the human brain},
    author={Moghadam, Behrooz Torabi and Dabrowski, Michal and Kaminska, Bozena and Grabherr, Manfred G and Komorowski, Jan},
    journal={BMC Bioinformatics},
    volume={17},
    number={1},
    pages={393},
    year={2016},
    publisher={Springer},
    abstract={Background
    DNA methylation plays a key role in developmental processes, which is reflected in changing methylation patterns at specific CpG sites over the lifetime of an individual. The underlying mechanisms are complex and possibly affect multiple genes or entire pathways.
    Results
    We applied a multivariate approach to identify combinations of CpG sites that undergo modifications when transitioning between developmental stages. Monte Carlo feature selection produced a list of ranked and statistically significant CpG sites, while rule-based models allowed for identifying particular methylation changes in these sites.
    Our rule-based classifier reports combinations of CpG sites, together with changes in their methylation status in the form of easy-to-read IF-THEN rules, which allows for identification of the genes associated with the underlying sites.
    Conclusion
    We utilized machine learning and statistical methods to discretize decision class (age) values to get a general pattern of methylation changes over the lifespan. The CpG sites present in the significant rules were annotated to genes involved in brain formation, general development, as well as genes linked to cancer and Alzheimer’s disease.}
    }

  • P. Wulfridge, B. Langmead, A. P. Feinberg, and K. Hansen, “Choice of reference genome can introduce massive bias in bisulfite sequencing data,” Biorxiv, p. 76844, 2016.
    [BibTeX] [Abstract]

    Mapping bias can be introduced in analysis of short read sequencing data, if sequence reads are aligned to a different genome than the sample genome. Here we study mapping bias in whole-genome bisulfite sequencing using data from inbred mice. We show that the choice of reference genome used for alignment can profoundly impact the inferred methylation state, both for high and low resolution analyses. This bias can result in wrongly identifying thousands of differentially methylated regions and hundreds of megabases of large-scale methylation differences. We show that the direction of these biased methylation differences can be reversed by changing the reference genome, clearly establishing mapping bias as a primary cause. We develop a strategy termed personalize-then-smooth for removing the bias by coupling alignment to personal genomes, with post-alignment smoothing. The smoothing step can be viewed as imputation, and allows a differential analysis to include methylation sites which are only present in some samples. Our results have important implications for analysis of bisulfite converted DNA.

    @article{wulfridge2016choice,
    title={Choice of reference genome can introduce massive bias in bisulfite sequencing data},
    author={Wulfridge, Phillip and Langmead, Ben and Feinberg, Andrew P and Hansen, Kasper},
    journal={bioRxiv},
    pages={076844},
    year={2016},
    publisher={Cold Spring Harbor Labs Journals},
    abstract={Mapping bias can be introduced in analysis of short read sequencing data, if sequence reads are aligned to a different genome than the sample genome. Here we study mapping bias in whole-genome bisulfite sequencing using data from inbred mice. We show that the choice of reference genome used for alignment can profoundly impact the inferred methylation state, both for high and low resolution analyses. This bias can result in wrongly identifying thousands of differentially methylated regions and hundreds of megabases of large-scale methylation differences. We show that the direction of these biased methylation differences can be reversed by changing the reference genome, clearly establishing mapping bias as a primary cause. We develop a strategy termed personalize-then-smooth for removing the bias by coupling alignment to personal genomes, with post-alignment smoothing. The smoothing step can be viewed as imputation, and allows a differential analysis to include methylation sites which are only present in some samples. Our results have important implications for analysis of bisulfite converted DNA.}
    }

  • J. Platig, P. Castaldi, D. DeMeo, and J. Quackenbush, “Bipartite community structure of eqtls,” Plos computational biology, 2016.
    [BibTeX] [Abstract]

    Genome Wide Association Studies (GWAS) and eQTL analyses have produced a large and growing number of genetic associations linked to a wide range of human phenotypes. As of 2013, there were more than 11,000 SNPs associated with a trait as reported in the NHGRI GWAS Catalog. However, interpreting the functional roles played by these SNPs remains a challenge. Here we describe an approach that uses the inherent bipartite structure of eQTL networks to place SNPs into a functional context. Using genotyping and gene expression data from 163 lung tissue samples in a study of Chronic Obstructive Pulmonary Disease (COPD) we calculated eQTL associations between SNPs and genes and cast significant associations (FDR <0.1) as links in a bipartite network. To our surprise, we discovered that the highly-connected "hub" SNPs within the network were devoid of disease-associations. However, within the network we identified 35 highly modular communities, which comprise groups of SNPs associated with groups of genes; 13 of these communities were significantly enriched for distinct biological functions (P <5×10−4) including COPD-related functions. Further, we found that GWAS-significant SNPs were enriched at the cores of these communities, including previously identified GWAS associations for COPD, asthma, and pulmonary function, among others. These results speak to our intuition: rather than single SNPs influencing single genes, we see groups of SNPs associated with the expression of families of functionally related genes and that disease SNPs are associated with the perturbation of those functions. These methods are not limited in their application to COPD and can be used in the analysis of a wide variety of disease processes and other phenotypic traits.

    @article{platig2015bipartite,
    title={Bipartite Community Structure of eQTLs},
    author={Platig, John and Castaldi, Peter and DeMeo, Dawn and Quackenbush, John},
    journal={PLOS Computational Biology},
    year={2016},
    abstract={ Genome Wide Association Studies (GWAS) and eQTL analyses have produced a large and growing number of genetic associations linked to a wide range of human phenotypes. As of 2013, there were more than 11,000 SNPs associated with a trait as reported in the NHGRI GWAS Catalog. However, interpreting the functional roles played by these SNPs remains a challenge. Here we describe an approach that uses the inherent bipartite structure of eQTL networks to place SNPs into a functional context.
    Using genotyping and gene expression data from 163 lung tissue samples in a study of Chronic Obstructive Pulmonary Disease (COPD) we calculated eQTL associations between SNPs and genes and cast significant associations (FDR <0.1) as links in a bipartite network. To our surprise, we discovered that the highly-connected "hub" SNPs within the network were devoid of disease-associations. However, within the network we identified 35 highly modular communities, which comprise groups of SNPs associated with groups of genes; 13 of these communities were significantly enriched for distinct biological functions (P <5×10−4) including COPD-related functions. Further, we found that GWAS-significant SNPs were enriched at the cores of these communities, including previously identified GWAS associations for COPD, asthma, and pulmonary function, among others. These results speak to our intuition: rather than single SNPs influencing single genes, we see groups of SNPs associated with the expression of families of functionally related genes and that disease SNPs are associated with the perturbation of those functions. These methods are not limited in their application to COPD and can be used in the analysis of a wide variety of disease processes and other phenotypic traits. }
    }

  • Y. Jiang, Y. Qiu, A. J. Minn, and N. R. Zhang, “Assessing intratumor heterogeneity and tracking longitudinal and spatial clonal evolutionary history by next-generation sequencing,” Proc natl acad sci u s a, vol. 113, iss. 37, p. E5528-37, 2016. doi:10.1073/pnas.1522203113
    [BibTeX] [Abstract]

    Cancer is a disease driven by evolutionary selection on somatic genetic and epigenetic alterations. Here, we propose Canopy, a method for inferring the evolutionary phylogeny of a tumor using both somatic copy number alterations and single-nucleotide alterations from one or more samples derived from a single patient. Canopy is applied to bulk sequencing datasets of both longitudinal and spatial experimental designs and to a transplantable metastasis model derived from human cancer cell line MDA-MB-231. Canopy successfully identifies cell populations and infers phylogenies that are in concordance with existing knowledge and ground truth. Through simulations, we explore the effects of key parameters on deconvolution accuracy and compare against existing methods. Canopy is an open-source R package available at //cran.r-project.org/web/packages/Canopy/.

    @article{Jiang:2016nx,
    Abstract = {Cancer is a disease driven by evolutionary selection on somatic genetic and epigenetic alterations. Here, we propose Canopy, a method for inferring the evolutionary phylogeny of a tumor using both somatic copy number alterations and single-nucleotide alterations from one or more samples derived from a single patient. Canopy is applied to bulk sequencing datasets of both longitudinal and spatial experimental designs and to a transplantable metastasis model derived from human cancer cell line MDA-MB-231. Canopy successfully identifies cell populations and infers phylogenies that are in concordance with existing knowledge and ground truth. Through simulations, we explore the effects of key parameters on deconvolution accuracy and compare against existing methods. Canopy is an open-source R package available at //cran.r-project.org/web/packages/Canopy/.},
    Author = {Jiang, Yuchao and Qiu, Yu and Minn, Andy J and Zhang, Nancy R},
    Date-Added = {2016-09-26 10:26:33 +0000},
    Date-Modified = {2016-09-26 10:26:33 +0000},
    Doi = {10.1073/pnas.1522203113},
    Journal = {Proc Natl Acad Sci U S A},
    Journal-Full = {Proceedings of the National Academy of Sciences of the United States of America},
    Keywords = {cancer evolution; cancer genomics; clonal deconvolution; intratumor heterogeneity; phylogeny inference},
    Month = {Sep},
    Number = {37},
    Pages = {E5528-37},
    Pmid = {27573852},
    Pst = {ppublish},
    Title = {Assessing intratumor heterogeneity and tracking longitudinal and spatial clonal evolutionary history by next-generation sequencing},
    Volume = {113},
    Year = {2016},
    Bdsk-Url-1 = {//dx.doi.org/10.1073/pnas.1522203113}}

  • G. D. Potter, T. A. Byrd, A. Mugler, and B. Sun, “Communication shapes sensory response in multicellular networks,” Proc natl acad sci u s a, vol. 113, iss. 37, pp. 10334-9, 2016. doi:10.1073/pnas.1605559113
    [BibTeX] [Abstract]

    Collective sensing by interacting cells is observed in a variety of biological systems, and yet, a quantitative understanding of how sensory information is collectively encoded is lacking. Here, we investigate the ATP-induced calcium dynamics of monolayers of fibroblast cells that communicate via gap junctions. Combining experiments and stochastic modeling, we find that increasing the ATP stimulus increases the propensity for calcium oscillations, despite large cell-to-cell variability. The model further predicts that the oscillation propensity increases with not only the stimulus, but also the cell density due to increased communication. Experiments confirm this prediction, showing that cell density modulates the collective sensory response. We further implicate cell-cell communication by coculturing the fibroblasts with cancer cells, which we show act as "defects" in the communication network, thereby reducing the oscillation propensity. These results suggest that multicellular networks sit at a point in parameter space where cell-cell communication has a significant effect on the sensory response, allowing cells to simultaneously respond to a sensory input and the presence of neighbors.

    @article{Potter:2016cr,
    Abstract = {Collective sensing by interacting cells is observed in a variety of biological systems, and yet, a quantitative understanding of how sensory information is collectively encoded is lacking. Here, we investigate the ATP-induced calcium dynamics of monolayers of fibroblast cells that communicate via gap junctions. Combining experiments and stochastic modeling, we find that increasing the ATP stimulus increases the propensity for calcium oscillations, despite large cell-to-cell variability. The model further predicts that the oscillation propensity increases with not only the stimulus, but also the cell density due to increased communication. Experiments confirm this prediction, showing that cell density modulates the collective sensory response. We further implicate cell-cell communication by coculturing the fibroblasts with cancer cells, which we show act as "defects" in the communication network, thereby reducing the oscillation propensity. These results suggest that multicellular networks sit at a point in parameter space where cell-cell communication has a significant effect on the sensory response, allowing cells to simultaneously respond to a sensory input and the presence of neighbors.},
    Author = {Potter, Garrett D and Byrd, Tommy A and Mugler, Andrew and Sun, Bo},
    Date-Added = {2016-09-26 10:25:01 +0000},
    Date-Modified = {2016-09-26 10:25:01 +0000},
    Doi = {10.1073/pnas.1605559113},
    Journal = {Proc Natl Acad Sci U S A},
    Journal-Full = {Proceedings of the National Academy of Sciences of the United States of America},
    Keywords = {calcium oscillations; cellular sensing; cell--cell communication; collective behavior; gap junctions},
    Month = {Sep},
    Number = {37},
    Pages = {10334-9},
    Pmid = {27573834},
    Pst = {ppublish},
    Title = {Communication shapes sensory response in multicellular networks},
    Volume = {113},
    Year = {2016},
    Bdsk-Url-1 = {//dx.doi.org/10.1073/pnas.1605559113}}

  • J. M. Vaquerizas and M. Torres-Padilla, “Developmental biology: panoramic views of the early epigenome,” Nature, vol. 537, iss. 7621, pp. 494-496, 2016. doi:10.1038/nature19468
    [BibTeX]
    @article{Vaquerizas:2016dq,
    Author = {Vaquerizas, Juan M and Torres-Padilla, Maria-Elena},
    Date-Added = {2016-09-26 08:53:52 +0000},
    Date-Modified = {2016-09-26 08:53:52 +0000},
    Doi = {10.1038/nature19468},
    Journal = {Nature},
    Journal-Full = {Nature},
    Month = {Sep},
    Number = {7621},
    Pages = {494-496},
    Pmid = {27626372},
    Pst = {aheadofprint},
    Title = {Developmental biology: Panoramic views of the early epigenome},
    Volume = {537},
    Year = {2016},
    Bdsk-Url-1 = {//dx.doi.org/10.1038/nature19468}}

  • C. Shao and T. Höfer, “Robust classification of single-cell transcriptome data by nonnegative matrix factorization,” Bioinformatics, 2016. doi:10.1093/bioinformatics/btw607
    [BibTeX] [Abstract]

    MOTIVATION: Single-cell transcriptome data provide unprecedented resolution to study heterogeneity in cell populations and present a challenge for unsupervised classification. Popular methods, like principal component analysis (PCA), often suffer from the high level of noise in the data. RESULTS: Here we adapt Nonnegative Matrix Factorization (NMF) to study the problem of identifying subpopulations in single-cell transcriptome data. In contrast to the conventional gene-centered view of NMF, identifying metagenes, we used NMF in a cell-centered direction, identifying cell subtypes ("metacells"). Using three different data sets (based on RT-qPCR and single cell RNA-seq data, respectively), we show that NMF outperforms PCA in identifying subpopulations in an accurate and robust way, without the need for prior feature selection; moreover, NMF successfully recovered the broad classes on a large data set (thousands of single-cell transcriptomes), as identified by a computationally sophisticated method. NMF allows to identify feature genes in a direct, unbiased manner. We propose novel approaches for determining a biologically meaningful number of subpopulations based on minimizing the ambiguity of classification. In conclusion, our study shows that NMF is a robust, informative and simple method for the unsupervised learning of cell subtypes from single-cell gene expression data. AVAILABILITY: //github.com/ccshao/nimfa CONTACT: c.shao@Dkfz-Heidelberg.de, t.hoefer@Dkfz-Heidelberg.de SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

    @article{Shao:2016bh,
    Abstract = {MOTIVATION: Single-cell transcriptome data provide unprecedented resolution to study heterogeneity in cell populations and present a challenge for unsupervised classification. Popular methods, like principal component analysis (PCA), often suffer from the high level of noise in the data.
    RESULTS: Here we adapt Nonnegative Matrix Factorization (NMF) to study the problem of identifying subpopulations in single-cell transcriptome data. In contrast to the conventional gene-centered view of NMF, identifying metagenes, we used NMF in a cell-centered direction, identifying cell subtypes ("metacells"). Using three different data sets (based on RT-qPCR and single cell RNA-seq data, respectively), we show that NMF outperforms PCA in identifying subpopulations in an accurate and robust way, without the need for prior feature selection; moreover, NMF successfully recovered the broad classes on a large data set (thousands of single-cell transcriptomes), as identified by a computationally sophisticated method. NMF allows to identify feature genes in a direct, unbiased manner. We propose novel approaches for determining a biologically meaningful number of subpopulations based on minimizing the ambiguity of classification. In conclusion, our study shows that NMF is a robust, informative and simple method for the unsupervised learning of cell subtypes from single-cell gene expression data.
    AVAILABILITY: //github.com/ccshao/nimfa CONTACT: c.shao@Dkfz-Heidelberg.de, t.hoefer@Dkfz-Heidelberg.de SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.},
    Author = {Shao, Chunxuan and H{\"o}fer, Thomas},
    Date-Added = {2016-09-26 08:44:00 +0000},
    Date-Modified = {2016-09-26 08:44:00 +0000},
    Doi = {10.1093/bioinformatics/btw607},
    Journal = {Bioinformatics},
    Journal-Full = {Bioinformatics (Oxford, England)},
    Month = {Sep},
    Pmid = {27663498},
    Pst = {aheadofprint},
    Title = {Robust classification of single-cell transcriptome data by nonnegative matrix factorization},
    Year = {2016},
    Bdsk-Url-1 = {//dx.doi.org/10.1093/bioinformatics/btw607}}

  • J. G. Azofeifa and R. D. Dowell, “A generative model for the behavior of rna polymerase,” Bioinformatics, 2016. doi:10.1093/bioinformatics/btw599
    [BibTeX] [Abstract]

    MOTIVATION: Transcription by RNA polymerases is a highly dynamic process involving multiple distinct points of regulation. Nascent transcription assays are a relatively new set of high throughput techniques that measure the location of actively engaged RNA polymerase genome wide. Hence, nascent transcription is a rich source of information on the regulation of RNA polymerase activity. To fully dissect this data requires the development of stochastic models that can both deconvolve the stages of polymerase activity and identify significant changes in activity between experiments. RESULTS: We present a generative, probabilistic model of RNA polymerase that fully describes loading, initiation, elongation and termination. We fit this model genome wide and profile the enzymatic activity of RNA polymerase across various loci and following experimental perturbation. We observe striking correlation of predicted loading events and regulatory chromatin marks. We provide principled statistics that compute probabilities reminiscent of traveler’s and divergent ratios. We finish with a systematic comparison of RNA Polymerase activity at promoter vs non-promoter associated loci. AVAILABILITY: Transcription Fit (Tfit) is a freely available, open source software package written in C/C++ that requires GNU compilers 4.7.3 or greater. Tfit is available from GitHub (//github.com/azofeifa/Tfit). CONTACT: robin.dowell@colorado.edu.

    @article{Azofeifa:2016qf,
    Abstract = {MOTIVATION: Transcription by RNA polymerases is a highly dynamic process involving multiple distinct points of regulation. Nascent transcription assays are a relatively new set of high throughput techniques that measure the location of actively engaged RNA polymerase genome wide. Hence, nascent transcription is a rich source of information on the regulation of RNA polymerase activity. To fully dissect this data requires the development of stochastic models that can both deconvolve the stages of polymerase activity and identify significant changes in activity between experiments.
    RESULTS: We present a generative, probabilistic model of RNA polymerase that fully describes loading, initiation, elongation and termination. We fit this model genome wide and profile the enzymatic activity of RNA polymerase across various loci and following experimental perturbation. We observe striking correlation of predicted loading events and regulatory chromatin marks. We provide principled statistics that compute probabilities reminiscent of traveler's and divergent ratios. We finish with a systematic comparison of RNA Polymerase activity at promoter vs non-promoter associated loci.
    AVAILABILITY: Transcription Fit (Tfit) is a freely available, open source software package written in C/C++ that requires GNU compilers 4.7.3 or greater. Tfit is available from GitHub (//github.com/azofeifa/Tfit).
    CONTACT: robin.dowell@colorado.edu.},
    Author = {Azofeifa, Joseph G and Dowell, Robin D},
    Date-Added = {2016-09-26 08:41:51 +0000},
    Date-Modified = {2016-09-26 08:41:51 +0000},
    Doi = {10.1093/bioinformatics/btw599},
    Journal = {Bioinformatics},
    Journal-Full = {Bioinformatics (Oxford, England)},
    Month = {Sep},
    Pmid = {27663494},
    Pst = {aheadofprint},
    Title = {A generative model for the behavior of RNA polymerase},
    Year = {2016},
    Bdsk-Url-1 = {//dx.doi.org/10.1093/bioinformatics/btw599}}

publications 2016-09-13

In this group meeting, we quickly discussed these latest papers:

  • B. Thienpont, J. Steinbacher, H. Zhao, F. D’Anna, A. Kuchnio, A. Ploumakis, B. Ghesquière, L. Van Dyck, B. Boeckx, L. Schoonjans, and others, “Tumour hypoxia causes dna hypermethylation by reducing tet activity,” Nature, 2016.
    [BibTeX]
    @article{thienpont2016tumour,
    title={Tumour hypoxia causes DNA hypermethylation by reducing TET activity},
    author={Thienpont, Bernard and Steinbacher, Jessica and Zhao, Hui and D’Anna, Flora and Kuchnio, Anna and Ploumakis, Athanasios and Ghesqui{\`e}re, Bart and Van Dyck, Laurien and Boeckx, Bram and Schoonjans, Luc and others},
    journal={Nature},
    year={2016},
    publisher={Nature Research}
    }

  • Q. Zhao, J. Zhang, R. Chen, L. Wang, B. Li, H. Cheng, X. Duan, H. Zhu, W. Wei, J. Li, and others, “Dissecting the precise role of h3k9 methylation in crosstalk with dna maintenance methylation in mammals,” Nature communications, vol. 7, 2016.
    [BibTeX]
    @article{zhao2016dissecting,
    title={Dissecting the precise role of H3K9 methylation in crosstalk with DNA maintenance methylation in mammals},
    author={Zhao, Qian and Zhang, Jiqin and Chen, Ruoyu and Wang, Lina and Li, Bo and Cheng, Hao and Duan, Xiaoya and Zhu, Haijun and Wei, Wei and Li, Jiwen and others},
    journal={Nature Communications},
    volume={7},
    year={2016},
    publisher={Nature Research}
    }

  • H. Zhu, G. Wang, and J. Qian, “Transcription factors as readers and effectors of dna methylation,” Nature reviews genetics, vol. 17, iss. 9, pp. 551-565, 2016.
    [BibTeX]
    @article{zhu2016transcription,
    title={Transcription factors as readers and effectors of DNA methylation},
    author={Zhu, Heng and Wang, Guohua and Qian, Jiang},
    journal={Nature Reviews Genetics},
    volume={17},
    number={9},
    pages={551--565},
    year={2016},
    publisher={Nature Research}
    }

  • S. Kim, S. Oesterreich, S. Kim, Y. Park, and G. C. Tseng, “Integrative clustering of multi-level omics data for disease subtype discovery using sequential double regularization,” Biostatistics, p. kxw039, 2016.
    [BibTeX]
    @article{kim2016integrative,
    title={Integrative clustering of multi-level omics data for disease subtype discovery using sequential double regularization},
    author={Kim, Sunghwan and Oesterreich, Steffi and Kim, Seyoung and Park, Yongseok and Tseng, George C},
    journal={Biostatistics},
    pages={kxw039},
    year={2016},
    publisher={Biometrika Trust}
    }

  • T. R. Pisanic, P. Athamanolap, and T. Wang, “Defining, distinguishing and detecting the contribution of heterogeneous methylation to cancer heterogeneity,” in Seminars in cell & developmental biology, 2016.
    [BibTeX]
    @inproceedings{pisanic2016defining,
    title={Defining, distinguishing and detecting the contribution of heterogeneous methylation to cancer heterogeneity},
    author={Pisanic, Thomas R and Athamanolap, Pornpat and Wang, Tza-Huei},
    booktitle={Seminars in Cell \& Developmental Biology},
    year={2016},
    organization={Elsevier}
    }

  • L. Haghverdi, M. Buettner, A. F. Wolf, F. Buettner, and F. J. Theis, “Diffusion pseudotime robustly reconstructs lineage branching,” Biorxiv, p. 41384, 2016.
    [BibTeX]
    @article{haghverdi2016diffusion,
    title={Diffusion pseudotime robustly reconstructs lineage branching},
    author={Haghverdi, Laleh and Buettner, Maren and Wolf, F Alexander and Buettner, Florian and Theis, Fabian J},
    journal={bioRxiv},
    pages={041384},
    year={2016},
    publisher={Cold Spring Harbor Labs Journals}
    }

  • C. Lester, C. A. Yates, and R. E. Baker, “Efficient parameter sensitivity computation for spatially-extended reaction networks,” , 2016.
    [BibTeX] [Abstract] [Download PDF]

    Reaction-diffusion models are widely used to study spatially-extended chemical reaction systems. In order to understand how the dynamics of a reaction-diffusion model are affected by changes in its input parameters, efficient methods for computing parametric sensitivities are required. In this work, we focus on stochastic models of spatially-extended chemical reaction systems that involve partitioning the computational domain into voxels. Parametric sensitivities are often calculated using Monte Carlo techniques that are typically computationally expensive; however, variance reduction techniques can decrease the number of Monte Carlo simulations required. By exploiting the characteristic dynamics of spatially-extended reaction networks, we are able to adapt existing finite difference schemes to robustly estimate parametric sensitivities in a spatially-extended network. We show that algorithmic performance depends on the dynamics of the given network and the choice of summary statistics. We then describe a hybrid technique that dynamically chooses the most appropriate simulation method for the network of interest. Our method is tested for functionality and accuracy in a range of different scenarios.

    @article{Lester2016,
    abstract = {Reaction-diffusion models are widely used to study spatially-extended chemical reaction systems. In order to understand how the dynamics of a reaction-diffusion model are affected by changes in its input parameters, efficient methods for computing parametric sensitivities are required. In this work, we focus on stochastic models of spatially-extended chemical reaction systems that involve partitioning the computational domain into voxels. Parametric sensitivities are often calculated using Monte Carlo techniques that are typically computationally expensive; however, variance reduction techniques can decrease the number of Monte Carlo simulations required. By exploiting the characteristic dynamics of spatially-extended reaction networks, we are able to adapt existing finite difference schemes to robustly estimate parametric sensitivities in a spatially-extended network. We show that algorithmic performance depends on the dynamics of the given network and the choice of summary statistics. We then describe a hybrid technique that dynamically chooses the most appropriate simulation method for the network of interest. Our method is tested for functionality and accuracy in a range of different scenarios.},
    archivePrefix = {arXiv},
    arxivId = {1608.08174},
    author = {Lester, Christopher and Yates, Christian A. and Baker, Ruth E.},
    eprint = {1608.08174},
    file = {:Users/david/PhD/literature/paper/Lester2016.pdf:pdf},
    title = {{Efficient parameter sensitivity computation for spatially-extended reaction networks}},
    url = {//arxiv.org/abs/1608.08174},
    year = {2016}
    }

  • C. Ferwerda and O. Lipan, “Splitting Nodes and Linking Channels: A Method for Assembling Biocircuits from Stochastic Elementary Units,” , 2016.
    [BibTeX] [Abstract] [Download PDF]

    Akin to electric circuits, we construct biocircuits that are manipulated by cutting and assembling channels through which stochastic information flows. This diagrammatic manipulation allows us to create a method which constructs networks by joining building blocks selected so that (a) they cover only basic processes; (b) it is scalable to large networks; (c) the mean and variance-covariance from the Pauli master equation form a closed system and; (d) given the initial probability distribution, no special boundary conditions are necessary to solve the master equation. The method aims to help with both designing new synthetic signalling pathways and quantifying naturally existing regulatory networks.

    @article{Ferwerda2016,
    abstract = {Akin to electric circuits, we construct biocircuits that are manipulated by cutting and assembling channels through which stochastic information flows. This diagrammatic manipulation allows us to create a method which constructs networks by joining building blocks selected so that (a) they cover only basic processes; (b) it is scalable to large networks; (c) the mean and variance-covariance from the Pauli master equation form a closed system and; (d) given the initial probability distribution, no special boundary conditions are necessary to solve the master equation. The method aims to help with both designing new synthetic signalling pathways and quantifying naturally existing regulatory networks.},
    archivePrefix = {arXiv},
    arxivId = {1608.04287},
    author = {Ferwerda, Cameron and Lipan, Ovidiu},
    eprint = {1608.04287},
    file = {:Users/david/PhD/literature/paper/Ferwerda2016.pdf:pdf},
    title = {{Splitting Nodes and Linking Channels: A Method for Assembling Biocircuits from Stochastic Elementary Units}},
    url = {//arxiv.org/abs/1608.04287},
    year = {2016}
    }

  • D. Yang, L. McKenzie-Sell, A. Karanjai, and P. A. Robinson, “Wake-sleep transition as a noisy bifurcation,” Physical review e, vol. 94, iss. 2, p. 22412, 2016. doi:10.1103/PhysRevE.94.022412
    [BibTeX] [Download PDF]
    @article{Yang2016,
    author = {Yang, Dong-Ping and McKenzie-Sell, Lauren and Karanjai, Angela and Robinson, P. A.},
    doi = {10.1103/PhysRevE.94.022412},
    file = {:Users/david/PhD/literature/paper/Yang2016.pdf:pdf},
    issn = {2470-0045},
    journal = {Physical Review E},
    number = {2},
    pages = {022412},
    title = {{Wake-sleep transition as a noisy bifurcation}},
    url = {//link.aps.org/doi/10.1103/PhysRevE.94.022412},
    volume = {94},
    year = {2016}
    }

  • P. K. Koo and S. G. J. Mochrie, “Systems-level approach to uncovering diffusive states and their transitions from single particle trajectories,” , pp. 1-17, 2016.
    [BibTeX] [Abstract] [Download PDF]

    The stochastic motions of a diffusing particle contain information concerning the particle’s interactions with binding partners and with its local environment. However, accurate determination of the underlying diffusive properties, beyond normal diffusion, has remained challenging when analyzing particle trajectories on an individual basis. Here, we introduce the maximum likelihood estimator (MLE) for confined diffusion and fractional Brownian motion. We demonstrate that this MLE yields improved estimation over traditional mean square displacement analyses. We also introduce a model selection scheme (that we call mleBIC) that classifies individual trajectories to a given diffusion mode. We demonstrate the statistical limitations of classification via mleBIC using simulated data. To overcome these limitations, we introduce a new version of perturbation expectation-maximization (pEMv2), which simultaneously analyzes a collection of particle trajectories to uncover the system of interactions which give rise to unique normal and/or non-normal diffusive states within the population. We test and evaluate the performance of pEMv2 on various sets of simulated particle trajectories, which transition among several modes of normal and non-normal diffusion, highlighting the key considerations for employing this analysis methodology.

    @article{Koo2016,
    abstract = {The stochastic motions of a diffusing particle contain information concerning the particle's interactions with binding partners and with its local environment. However, accurate determination of the underlying diffusive properties, beyond normal diffusion, has remained challenging when analyzing particle trajectories on an individual basis. Here, we introduce the maximum likelihood estimator (MLE) for confined diffusion and fractional Brownian motion. We demonstrate that this MLE yields improved estimation over traditional mean square displacement analyses. We also introduce a model selection scheme (that we call mleBIC) that classifies individual trajectories to a given diffusion mode. We demonstrate the statistical limitations of classification via mleBIC using simulated data. To overcome these limitations, we introduce a new version of perturbation expectation-maximization (pEMv2), which simultaneously analyzes a collection of particle trajectories to uncover the system of interactions which give rise to unique normal and/or non-normal diffusive states within the population. We test and evaluate the performance of pEMv2 on various sets of simulated particle trajectories, which transition among several modes of normal and non-normal diffusion, highlighting the key considerations for employing this analysis methodology.},
    archivePrefix = {arXiv},
    arxivId = {1608.01419},
    author = {Koo, Peter K. and Mochrie, Simon G. J.},
    eprint = {1608.01419},
    file = {:Users/david/PhD/literature/paper/Koo2016.pdf:pdf},
    pages = {1--17},
    title = {{Systems-level approach to uncovering diffusive states and their transitions from single particle trajectories}},
    url = {//arxiv.org/abs/1608.01419},
    year = {2016}
    }

  • Z. Fox, G. Neuert, and B. Munsky, “Finite state projection based bounds to compare chemical master equation models using single-cell data,” The journal of chemical physics, vol. 145, iss. 7, p. 74101, 2016. doi:10.1063/1.4960505
    [BibTeX] [Abstract] [Download PDF]

    Emerging techniques now allow for precise quantification of distributions of biological molecules in single cells. These rapidly advancing experimental methods have created a need for more rigorous and efficient modeling tools. Here, we derive new bounds on the likelihood that observations of single-cell, single-molecule responses come from a discrete stochastic model, posed in the form of the chemical master equation. These strict upper and lower bounds are based on a finite state projection approach, and they converge monotonically to the exact likelihood value. These bounds allow one to discriminate rigorously between models and with a minimum level of computational effort. In practice, these bounds can be incorporated into stochastic model identification and parameter inference routines, which improve the accuracy and efficiency of endeavors to analyze and predict single-cell behavior. We demonstrate the applicability of our approach using simulated data for three example models as well as for experimental measurements of a time-varying stochastic transcriptional response in yeast.

    @article{Fox2016,
    abstract = {Emerging techniques now allow for precise quantification of distributions of biological molecules in single cells. These rapidly advancing experimental methods have created a need for more rigorous and efficient modeling tools. Here, we derive new bounds on the likelihood that observations of single-cell, single-molecule responses come from a discrete stochastic model, posed in the form of the chemical master equation. These strict upper and lower bounds are based on a finite state projection approach, and they converge monotonically to the exact likelihood value. These bounds allow one to discriminate rigorously between models and with a minimum level of computational effort. In practice, these bounds can be incorporated into stochastic model identification and parameter inference routines, which improve the accuracy and efficiency of endeavors to analyze and predict single-cell behavior. We demonstrate the applicability of our approach using simulated data for three example models as well as for experimental measurements of a time-varying stochastic transcriptional response in yeast.},
    author = {Fox, Zachary and Neuert, Gregor and Munsky, Brian},
    doi = {10.1063/1.4960505},
    file = {:Users/david/PhD/literature/paper/Fox2016.pdf:pdf},
    issn = {0021-9606},
    journal = {The Journal of Chemical Physics},
    number = {7},
    pages = {074101},
    title = {{Finite state projection based bounds to compare chemical master equation models using single-cell data}},
    url = {//scitation.aip.org/content/aip/journal/jcp/145/7/10.1063/1.4960505},
    volume = {145},
    year = {2016}
    }

  • C. K. Fisher and P. Mehta, “Bayesian feature selection for high-dimensional linear regression via the Ising approximation with applications to genomics,” Bioinformatics, vol. 31, iss. 11, pp. 1754-1761, 2015. doi:10.1093/bioinformatics/btv037
    [BibTeX] [Abstract]

    Motivation: Feature selection, identifying a subset of variables that are relevant for predicting a re-sponse, is an important and challenging component of many methods in statistics and machine learning. Feature selection is especially difficult and computationally intensive when the number of variables approaches or exceeds the number of samples, as is often the case for many genomic datasets. Results: Here, we introduce a new approach—the Bayesian Ising Approximation (BIA)—to rapidly calculate posterior probabilities for feature relevance in L2 penalized linear regression. In the re-gime where the regression problem is strongly regularized by the prior, we show that computing the marginal posterior probabilities for features is equivalent to computing the magnetizations of an Ising model with weak couplings. Using a mean field approximation, we show it is possible to rapidly compute the feature selection path described by the posterior probabilities as a function of the L2 penalty. We present simulations and analytical results illustrating the accuracy of the BIA on some simple regression problems. Finally, we demonstrate the applicability of the BIA to high-di-mensional regression by analyzing a gene expression dataset with nearly 30 000 features. These re-sults also highlight the impact of correlations between features on Bayesian feature selection. Availability and implementation: An implementation of the BIA in Cßß, along with data for repro-ducing our gene expression analyses, are freely available at

    @article{Fisher2015,
    abstract = {Motivation: Feature selection, identifying a subset of variables that are relevant for predicting a re-sponse, is an important and challenging component of many methods in statistics and machine learning. Feature selection is especially difficult and computationally intensive when the number of variables approaches or exceeds the number of samples, as is often the case for many genomic datasets. Results: Here, we introduce a new approach—the Bayesian Ising Approximation (BIA)—to rapidly calculate posterior probabilities for feature relevance in L2 penalized linear regression. In the re-gime where the regression problem is strongly regularized by the prior, we show that computing the marginal posterior probabilities for features is equivalent to computing the magnetizations of an Ising model with weak couplings. Using a mean field approximation, we show it is possible to rapidly compute the feature selection path described by the posterior probabilities as a function of the L2 penalty. We present simulations and analytical results illustrating the accuracy of the BIA on some simple regression problems. Finally, we demonstrate the applicability of the BIA to high-di-mensional regression by analyzing a gene expression dataset with nearly 30 000 features. These re-sults also highlight the impact of correlations between features on Bayesian feature selection. Availability and implementation: An implementation of the BIA in Cßß, along with data for repro-ducing our gene expression analyses, are freely available at},
    archivePrefix = {arXiv},
    arxivId = {1407.8187},
    author = {Fisher, Charles K. and Mehta, Pankaj},
    doi = {10.1093/bioinformatics/btv037},
    eprint = {1407.8187},
    file = {:Users/david/PhD/literature/paper/Fisher2015c.pdf:pdf},
    issn = {14602059},
    journal = {Bioinformatics},
    number = {11},
    pages = {1754--1761},
    pmid = {25619995},
    title = {{Bayesian feature selection for high-dimensional linear regression via the Ising approximation with applications to genomics}},
    volume = {31},
    year = {2015}
    }

  • M. Allhoff, K. Seré, J. F. Pires, M. Zenke, and I. G. Costa, “Differential peak calling of chip-seq signals with replicates with thor,” Nucleic acids research, 2016. doi:10.1093/nar/gkw680
    [BibTeX] [Abstract] [Download PDF]

    The study of changes in protein–DNA interactions measured by ChIP-seq on dynamic systems, such as cell differentiation, response to treatments or the comparison of healthy and diseased individuals, is still an open challenge. There are few computational methods comparing changes in ChIP-seq signals with replicates. Moreover, none of these previous approaches addresses ChIP-seq specific experimental artefacts arising from studies with biological replicates. We propose THOR, a Hidden Markov Model based approach, to detect differential peaks between pairs of biological conditions with replicates. THOR provides all pre- and post-processing steps required in ChIP-seq analyses. Moreover, we propose a novel normalization approach based on housekeeping genes to deal with cases where replicates have distinct signal-to-noise ratios. To evaluate differential peak calling methods, we delineate a methodology using both biological and simulated data. This includes an evaluation procedure that associates differential peaks with changes in gene expression as well as histone modifications close to these peaks. We evaluate THOR and seven competing methods on data sets with distinct characteristics from in vitro studies with technical replicates to clinical studies of cancer patients. Our evaluation analysis comprises of 13 comparisons between pairs of biological conditions. We show that THOR performs best in all scenarios.

    @article{Allhoff02082016,
    author = {Allhoff, Manuel and Seré, Kristin and F. Pires, Juliana and Zenke, Martin and G. Costa, Ivan},
    title = {Differential peak calling of ChIP-seq signals with replicates with THOR},
    year = {2016},
    doi = {10.1093/nar/gkw680},
    abstract ={The study of changes in protein–DNA interactions measured by ChIP-seq on dynamic systems, such as cell differentiation, response to treatments or the comparison of healthy and diseased individuals, is still an open challenge. There are few computational methods comparing changes in ChIP-seq signals with replicates. Moreover, none of these previous approaches addresses ChIP-seq specific experimental artefacts arising from studies with biological replicates. We propose THOR, a Hidden Markov Model based approach, to detect differential peaks between pairs of biological conditions with replicates. THOR provides all pre- and post-processing steps required in ChIP-seq analyses. Moreover, we propose a novel normalization approach based on housekeeping genes to deal with cases where replicates have distinct signal-to-noise ratios. To evaluate differential peak calling methods, we delineate a methodology using both biological and simulated data. This includes an evaluation procedure that associates differential peaks with changes in gene expression as well as histone modifications close to these peaks. We evaluate THOR and seven competing methods on data sets with distinct characteristics from in vitro studies with technical replicates to clinical studies of cancer patients. Our evaluation analysis comprises of 13 comparisons between pairs of biological conditions. We show that THOR performs best in all scenarios.},
    URL = {//nar.oxfordjournals.org/content/early/2016/08/01/nar.gkw680.abstract},
    eprint = {//nar.oxfordjournals.org/content/early/2016/08/01/nar.gkw680.full.pdf+html},
    journal = {Nucleic Acids Research}
    }

  • Y. Li and M. Kellis, “Joint bayesian inference of risk variants and tissue-specific epigenomic enrichments across multiple complex human diseases,” Nucleic acids research, 2016. doi:10.1093/nar/gkw627
    [BibTeX] [Abstract] [Download PDF]

    Genome wide association studies (GWAS) provide a powerful approach for uncovering disease-associated variants in human, but fine-mapping the causal variants remains a challenge. This is partly remedied by prioritization of disease-associated variants that overlap GWAS-enriched epigenomic annotations. Here, we introduce a new Bayesian model RiVIERA (Risk Variant Inference using Epigenomic Reference Annotations) for inference of driver variants from summary statistics across multiple traits using hundreds of epigenomic annotations. In simulation, RiVIERA promising power in detecting causal variants and causal annotations, the multi-trait joint inference further improved the detection power. We applied RiVIERA to model the existing GWAS summary statistics of 9 autoimmune diseases and Schizophrenia by jointly harnessing the potential causal enrichments among 848 tissue-specific epigenomics annotations from ENCODE/Roadmap consortium covering 127 cell/tissue types and 8 major epigenomic marks. RiVIERA identified meaningful tissue-specific enrichments for enhancer regions defined by H3K4me1 and H3K27ac for Blood T-Cell specifically in the nine autoimmune diseases and Brain-specific enhancer activities exclusively in Schizophrenia. Moreover, the variants from the 95% credible sets exhibited high conservation and enrichments for GTEx whole-blood eQTLs located within transcription-factor-binding-sites and DNA-hypersensitive-sites. Furthermore, joint modeling the nine immune traits by simultaneously inferring and exploiting the underlying epigenomic correlation between traits further improved the functional enrichments compared to single-trait models.

    @article{Li12072016,
    author = {Li, Yue and Kellis, Manolis},
    title = {Joint Bayesian inference of risk variants and tissue-specific epigenomic enrichments across multiple complex human diseases},
    year = {2016},
    doi = {10.1093/nar/gkw627},
    abstract ={Genome wide association studies (GWAS) provide a powerful approach for uncovering disease-associated variants in human, but fine-mapping the causal variants remains a challenge. This is partly remedied by prioritization of disease-associated variants that overlap GWAS-enriched epigenomic annotations. Here, we introduce a new Bayesian model RiVIERA (Risk Variant Inference using Epigenomic Reference Annotations) for inference of driver variants from summary statistics across multiple traits using hundreds of epigenomic annotations. In simulation, RiVIERA promising power in detecting causal variants and causal annotations, the multi-trait joint inference further improved the detection power. We applied RiVIERA to model the existing GWAS summary statistics of 9 autoimmune diseases and Schizophrenia by jointly harnessing the potential causal enrichments among 848 tissue-specific epigenomics annotations from ENCODE/Roadmap consortium covering 127 cell/tissue types and 8 major epigenomic marks. RiVIERA identified meaningful tissue-specific enrichments for enhancer regions defined by H3K4me1 and H3K27ac for Blood T-Cell specifically in the nine autoimmune diseases and Brain-specific enhancer activities exclusively in Schizophrenia. Moreover, the variants from the 95% credible sets exhibited high conservation and enrichments for GTEx whole-blood eQTLs located within transcription-factor-binding-sites and DNA-hypersensitive-sites. Furthermore, joint modeling the nine immune traits by simultaneously inferring and exploiting the underlying epigenomic correlation between traits further improved the functional enrichments compared to single-trait models.},
    URL = {//nar.oxfordjournals.org/content/early/2016/07/12/nar.gkw627.abstract},
    eprint = {//nar.oxfordjournals.org/content/early/2016/07/12/nar.gkw627.full.pdf+html},
    journal = {Nucleic Acids Research}
    }

  • X. Liu, Y. Wang, H. Ji, K. Aihara, and L. Chen, “Personalized characterization of diseases using sample-specific networks,” Nucleic acids research, 2016. doi:10.1093/nar/gkw772
    [BibTeX] [Abstract] [Download PDF]

    A complex disease generally results not from malfunction of individual molecules but from dysfunction of the relevant system or network, which dynamically changes with time and conditions. Thus, estimating a condition-specific network from a single sample is crucial to elucidating the molecular mechanisms of complex diseases at the system level. However, there is currently no effective way to construct such an individual-specific network by expression profiling of a single sample because of the requirement of multiple samples for computing correlations. We developed here with a statistical method, i.e. a sample-specific network (SSN) method, which allows us to construct individual-specific networks based on molecular expressions of a single sample. Using this method, we can characterize various human diseases at a network level. In particular, such SSNs can lead to the identification of individual-specific disease modules as well as driver genes, even without gene sequencing information. Extensive analysis by using the Cancer Genome Atlas data not only demonstrated the effectiveness of the method, but also found new individual-specific driver genes and network patterns for various types of cancer. Biological experiments on drug resistance further validated one important advantage of our method over the traditional methods, i.e. we can even identify such drug resistance genes that actually have no clear differential expression between samples with and without the resistance, due to the additional network information.

    @article{Liu04092016,
    author = {Liu, Xiaoping and Wang, Yuetong and Ji, Hongbin and Aihara, Kazuyuki and Chen, Luonan},
    title = {Personalized characterization of diseases using sample-specific networks},
    year = {2016},
    doi = {10.1093/nar/gkw772},
    abstract ={A complex disease generally results not from malfunction of individual molecules but from dysfunction of the relevant system or network, which dynamically changes with time and conditions. Thus, estimating a condition-specific network from a single sample is crucial to elucidating the molecular mechanisms of complex diseases at the system level. However, there is currently no effective way to construct such an individual-specific network by expression profiling of a single sample because of the requirement of multiple samples for computing correlations. We developed here with a statistical method, i.e. a sample-specific network (SSN) method, which allows us to construct individual-specific networks based on molecular expressions of a single sample. Using this method, we can characterize various human diseases at a network level. In particular, such SSNs can lead to the identification of individual-specific disease modules as well as driver genes, even without gene sequencing information. Extensive analysis by using the Cancer Genome Atlas data not only demonstrated the effectiveness of the method, but also found new individual-specific driver genes and network patterns for various types of cancer. Biological experiments on drug resistance further validated one important advantage of our method over the traditional methods, i.e. we can even identify such drug resistance genes that actually have no clear differential expression between samples with and without the resistance, due to the additional network information.},
    URL = {//nar.oxfordjournals.org/content/early/2016/09/04/nar.gkw772.abstract},
    eprint = {//nar.oxfordjournals.org/content/early/2016/09/04/nar.gkw772.full.pdf+html},
    journal = {Nucleic Acids Research}
    }

  • R. C. O’Malley, S. C. Huang, L. Song, M. G. Lewsey, A. Bartlett, J. R. Nery, M. Galli, A. Gallavotti, and J. R. Ecker, “Cistrome and epicistrome features shape the regulatory dna landscape,” Cell, vol. 165, iss. 5, pp. 1280-1292, 2016.
    [BibTeX] [Abstract]

    The cistrome is the complete set of transcription factor (TF) binding sites (cis-elements) in an organism, while an epicistrome incorporates tissue-specific DNA chemical modifications and TF-specific chemical sensitivities into these binding profiles. Robust methods to construct comprehensive cistrome and epicistrome maps are critical for elucidating complex transcriptional networks that underlie growth, behavior, and disease. Here, we describe DNA affinity purification sequencing (DAP-seq), a high-throughput TF binding site discovery method that interrogates genomic DNA with in-vitro-expressed TFs. Using DAP-seq, we defined the Arabidopsis cistrome by resolving motifs and peaks for 529 TFs. Because genomic DNA used in DAP-seq retains 5-methylcytosines, we determined that >75% (248/327) of Arabidopsis TFs surveyed were methylation sensitive, a property that strongly impacts the epicistrome landscape. DAP-seq datasets also yielded insight into the biology and binding site architecture of numerous TFs, demonstrating the value of DAP-seq for cost-effective cistromic and epicistromic annotation in any organism.

    @article{o2016cistrome,
    title={Cistrome and Epicistrome Features Shape the Regulatory DNA Landscape},
    author={O’Malley, Ronan C and Huang, Shao-shan Carol and Song, Liang and Lewsey, Mathew G and Bartlett, Anna and Nery, Joseph R and Galli, Mary and Gallavotti, Andrea and Ecker, Joseph R},
    journal={Cell},
    volume={165},
    number={5},
    pages={1280--1292},
    year={2016},
    publisher={Elsevier},
    abstract={The cistrome is the complete set of transcription factor (TF) binding sites (cis-elements) in an organism, while an epicistrome incorporates tissue-specific DNA chemical modifications and TF-specific chemical sensitivities into these binding profiles. Robust methods to construct comprehensive cistrome and epicistrome maps are critical for elucidating complex transcriptional networks that underlie growth, behavior, and disease. Here, we describe DNA affinity purification sequencing (DAP-seq), a high-throughput TF binding site discovery method that interrogates genomic DNA with in-vitro-expressed TFs. Using DAP-seq, we defined the Arabidopsis cistrome by resolving motifs and peaks for 529 TFs. Because genomic DNA used in DAP-seq retains 5-methylcytosines, we determined that >75% (248/327) of Arabidopsis TFs surveyed were methylation sensitive, a property that strongly impacts the epicistrome landscape. DAP-seq datasets also yielded insight into the biology and binding site architecture of numerous TFs, demonstrating the value of DAP-seq for cost-effective cistromic and epicistromic annotation in any organism.}
    }

  • P. Cuscó and G. Filion, “Zerone: a chip-seq discretizer for multiple replicates with built-in quality control,” Bioinformatics, p. btw336, 2016.
    [BibTeX] [Abstract]

    Motivation: Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) is the standard method to investigate chromatin protein composition. As the number of community-available ChIP-seq profiles increases, it becomes more common to use data from different sources, which makes joint analysis challenging. Issues such as lack of reproducibility, heterogeneous quality and conflicts between replicates become evident when comparing datasets, especially when they are produced by different laboratories. Results: Here, we present Zerone, a ChIP-seq discretizer with built-in quality control. Zerone is powered by a Hidden Markov Model with zero-inflated negative multinomial emissions, which allows it to merge several replicates into a single discretized profile. To identify low quality or irreproducible data, we trained a Support Vector Machine and integrated it as part of the discretization process. The result is a classifier reaching 95% accuracy in detecting low quality profiles. We also introduce a graphical representation to compare discretization quality and we show that Zerone achieves outstanding accuracy. Finally, on current hardware, Zerone discretizes a ChIP-seq experiment on mammalian genomes in about 5 min using less than 700 MB of memory.

    @article{cusco2016zerone,
    title={Zerone: a ChIP-seq discretizer for multiple replicates with built-in quality control},
    author={Cusc{\'o}, Pol and Filion, Guillaume},
    journal={Bioinformatics},
    pages={btw336},
    year={2016},
    publisher={Oxford Univ Press},
    abstract={Motivation: Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) is the standard method to investigate chromatin protein composition. As the number of community-available ChIP-seq profiles increases, it becomes more common to use data from different sources, which makes joint analysis challenging. Issues such as lack of reproducibility, heterogeneous quality and conflicts between replicates become evident when comparing datasets, especially when they are produced by different laboratories.
    Results: Here, we present Zerone, a ChIP-seq discretizer with built-in quality control. Zerone is powered by a Hidden Markov Model with zero-inflated negative multinomial emissions, which allows it to merge several replicates into a single discretized profile. To identify low quality or irreproducible data, we trained a Support Vector Machine and integrated it as part of the discretization process. The result is a classifier reaching 95% accuracy in detecting low quality profiles. We also introduce a graphical representation to compare discretization quality and we show that Zerone achieves outstanding accuracy. Finally, on current hardware, Zerone discretizes a ChIP-seq experiment on mammalian genomes in about 5 min using less than 700 MB of memory.}
    }

  • W. L. Chew, M. Tabebordbar, J. K. Cheng, P. Mali, E. Y. Wu, A. H. Ng, K. Zhu, A. J. Wagers, and G. M. Church, “A multifunctional aav-crispr-cas9 and its host response,” Nature methods, 2016.
    [BibTeX]
    @article{chew2016multifunctional,
    title={A multifunctional AAV-CRISPR-Cas9 and its host response},
    author={Chew, Wei Leong and Tabebordbar, Mohammadsharif and Cheng, Jason KW and Mali, Prashant and Wu, Elizabeth Y and Ng, Alex HM and Zhu, Kexian and Wagers, Amy J and Church, George M},
    journal={Nature Methods},
    year={2016},
    publisher={Nature Research}
    }

  • Y. Liu, Y. Zhan, Z. Chen, A. He, J. Li, H. Wu, L. Liu, C. Zhuang, J. Lin, X. Guo, and others, “Directing cellular information flow via crispr signal conductors,” Nature methods, 2016.
    [BibTeX]
    @article{liu2016directing,
    title={Directing cellular information flow via CRISPR signal conductors},
    author={Liu, Yuchen and Zhan, Yonghao and Chen, Zhicong and He, Anbang and Li, Jianfa and Wu, Hanwei and Liu, Li and Zhuang, Chengle and Lin, Junhao and Guo, Xiaoqiang and others},
    journal={Nature Methods},
    year={2016},
    publisher={Nature Research}
    }

  • T. Fu, G. Hong, T. Zhou, T. G. Schuhmann, R. D. Viveros, and C. M. Lieber, “Stable long-term chronic brain mapping at the single-neuron level,” Nature methods, 2016.
    [BibTeX]
    @article{fu2016stable,
    title={Stable long-term chronic brain mapping at the single-neuron level},
    author={Fu, Tian-Ming and Hong, Guosong and Zhou, Tao and Schuhmann, Thomas G and Viveros, Robert D and Lieber, Charles M},
    journal={Nature Methods},
    year={2016},
    publisher={Nature Research}
    }

  • L. C. Xia, S. Sakshuwong, E. S. Hopmans, J. M. Bell, S. M. Grimes, D. O. Siegmund, H. P. Ji, and N. R. Zhang, “A genome-wide approach for detecting novel insertion-deletion variants of mid-range size,” Nucleic acids research, vol. 44, iss. 15, p. e126–e126, 2016.
    [BibTeX]
    @article{xia2016genome,
    title={A genome-wide approach for detecting novel insertion-deletion variants of mid-range size},
    author={Xia, Li C and Sakshuwong, Sukolsak and Hopmans, Erik S and Bell, John M and Grimes, Susan M and Siegmund, David O and Ji, Hanlee P and Zhang, Nancy R},
    journal={Nucleic Acids Research},
    volume={44},
    number={15},
    pages={e126--e126},
    year={2016},
    publisher={Oxford Univ Press}
    }

  • S. W. Hartley and J. C. Mullikin, “Detection and visualization of differential splicing in rna-seq data with junctionseq,” Nucleic acids research, p. gkw501, 2016.
    [BibTeX]
    @article{hartley2016detection,
    title={Detection and visualization of differential splicing in RNA-Seq data with JunctionSeq},
    author={Hartley, Stephen W and Mullikin, James C},
    journal={Nucleic acids research},
    pages={gkw501},
    year={2016},
    publisher={Oxford Univ Press}
    }

  • D. Incarnato, F. Anselmi, E. Morandi, F. Neri, M. Maldotti, S. Rapelli, C. Parlato, G. Basile, and S. Oliviero, “High-throughput single-base resolution mapping of rna 2′-o-methylated residues,” Nucleic acids research, p. gkw810, 2016.
    [BibTeX]
    @article{incarnato2016high,
    title={High-throughput single-base resolution mapping of RNA 2′-O-methylated residues},
    author={Incarnato, Danny and Anselmi, Francesca and Morandi, Edoardo and Neri, Francesco and Maldotti, Mara and Rapelli, Stefania and Parlato, Caterina and Basile, Giulia and Oliviero, Salvatore},
    journal={Nucleic Acids Research},
    pages={gkw810},
    year={2016},
    publisher={Oxford Univ Press}
    }

  • O. M. Din, T. Danino, A. Prindle, M. Skalak, J. Selimkhanov, K. Allen, E. Julio, E. Atolia, L. S. Tsimring, S. N. Bhatia, and J. Hasty, “Synchronized cycles of bacterial lysis for in vivo delivery,” Nature, vol. 536, iss. 7614, pp. 81-5, 2016. doi:10.1038/nature18930
    [BibTeX] [Abstract]

    The widespread view of bacteria as strictly pathogenic has given way to an appreciation of the prevalence of some beneficial microbes within the human body. It is perhaps inevitable that some bacteria would evolve to preferentially grow in environments that harbor disease and thus provide a natural platform for the development of engineered therapies. Such therapies could benefit from bacteria that are programmed to limit bacterial growth while continually producing and releasing cytotoxic agents in situ. Here we engineer a clinically relevant bacterium to lyse synchronously ata threshold population density and to release genetically encoded cargo. Following quorum lysis, a small number of surviving bacteria reseed the growing population, thus leading to pulsatile delivery cycles. We used microfluidic devices to characterize the engineered lysis strain and we demonstrate its potential as a drug delivery platform via co-culture with human cancer cells in vitro. Asa proof of principle, we tracked the bacterial population dynamics in ectopic syngeneic colorectal tumours in mice via a luminescent reporter. The lysis strain exhibits pulsatile population dynamics in vivo, with mean bacterial luminescence that remained two orders of magnitude lower than an unmodified strain. Finally, guided by previous findings that certain bacteria can enhance the efficacy of standard therapies, we orally administered the lysis strain alone or in combination with a clinical chemotherapeutic to a syngeneic mouse transplantation model of hepatic colorectal metastases. We found that the combination of both circuit-engineered bacteria and chemotherapy leads to a notable reduction of tumour activity along with a marked survival benefit over either therapy alone.Our approach establishes a methodology for leveraging the tools of synthetic biology to exploit the natural propensity for certain bacteria to colonize disease sites.

    @article{Din:2016ve,
    Abstract = {The widespread view of bacteria as strictly pathogenic has given way to an appreciation of the prevalence of some beneficial microbes within the human body. It is perhaps inevitable that some bacteria would evolve to preferentially grow in environments that harbor disease and thus provide a natural platform for the development of engineered therapies. Such therapies could benefit from bacteria that are programmed to limit bacterial growth while continually producing and releasing cytotoxic agents in situ. Here we engineer a clinically relevant bacterium to lyse synchronously ata threshold population density and to release genetically encoded cargo. Following quorum lysis, a small number of surviving bacteria reseed the growing population, thus leading to pulsatile delivery cycles. We used microfluidic devices to characterize the engineered lysis strain and we demonstrate its potential as a drug delivery platform via co-culture with human cancer cells in vitro. Asa proof of principle, we tracked the bacterial population dynamics in ectopic syngeneic colorectal tumours in mice via a luminescent reporter. The lysis strain exhibits pulsatile population dynamics in vivo, with mean bacterial luminescence that remained two orders of magnitude lower than an unmodified strain. Finally, guided by previous findings that certain bacteria can enhance the efficacy of standard therapies, we orally administered the lysis strain alone or in combination with a clinical chemotherapeutic to a syngeneic mouse transplantation model of hepatic colorectal metastases. We found that the combination of both circuit-engineered bacteria and chemotherapy leads to a notable reduction of tumour activity along with a marked survival benefit over either therapy alone.Our approach establishes a methodology for leveraging the tools of synthetic biology to exploit the natural propensity for certain bacteria to colonize disease sites.},
    Author = {Din, M Omar and Danino, Tal and Prindle, Arthur and Skalak, Matt and Selimkhanov, Jangir and Allen, Kaitlin and Julio, Ellixis and Atolia, Eta and Tsimring, Lev S and Bhatia, Sangeeta N and Hasty, Jeff},
    Date-Added = {2016-09-12 19:51:15 +0000},
    Date-Modified = {2016-09-12 19:51:15 +0000},
    Doi = {10.1038/nature18930},
    Journal = {Nature},
    Journal-Full = {Nature},
    Mesh = {Administration, Oral; Animals; Bacteriolysis; Coculture Techniques; Colorectal Neoplasms; Computer Simulation; Drug Delivery Systems; Female; Liver Neoplasms; Luminescence; Mice; Neoplasm Metastasis; Neoplasm Transplantation; Quorum Sensing; Salmonella; Synthetic Biology; Transplantation, Isogeneic},
    Month = {Aug},
    Number = {7614},
    Pages = {81-5},
    Pmid = {27437587},
    Pst = {ppublish},
    Title = {Synchronized cycles of bacterial lysis for in vivo delivery},
    Volume = {536},
    Year = {2016},
    Bdsk-Url-1 = {//dx.doi.org/10.1038/nature18930}}

  • N. Olsman and L. Goentoro, “Allosteric proteins as logarithmic sensors,” Proc natl acad sci u s a, vol. 113, iss. 30, p. E4423-30, 2016. doi:10.1073/pnas.1601791113
    [BibTeX] [Abstract]

    Many sensory systems, from vision and hearing in animals to signal transduction in cells, respond to fold changes in signal relative to background. Responding to fold change requires that the system senses signal on a logarithmic scale, responding identically to a change in signal level from 1 to 3, or from 10 to 30. It is an ongoing search in the field to understand the ways in which a logarithmic sensor can be implemented at the molecular level. In this work, we present evidence that logarithmic sensing can be implemented with a single protein, by means of allosteric regulation. Specifically, we find that mathematical models show that allosteric proteins can respond to stimuli on a logarithmic scale. Next, we present evidence from measurements in the literature that some allosteric proteins do operate in a parameter regime that permits logarithmic sensing. Finally, we present examples suggesting that allosteric proteins are indeed used in this capacity: allosteric proteins play a prominent role in systems where fold-change detection has been proposed. This finding suggests a role as logarithmic sensors for the many allosteric proteins across diverse biological processes.

    @article{Olsman:2016ly,
    Abstract = {Many sensory systems, from vision and hearing in animals to signal transduction in cells, respond to fold changes in signal relative to background. Responding to fold change requires that the system senses signal on a logarithmic scale, responding identically to a change in signal level from 1 to 3, or from 10 to 30. It is an ongoing search in the field to understand the ways in which a logarithmic sensor can be implemented at the molecular level. In this work, we present evidence that logarithmic sensing can be implemented with a single protein, by means of allosteric regulation. Specifically, we find that mathematical models show that allosteric proteins can respond to stimuli on a logarithmic scale. Next, we present evidence from measurements in the literature that some allosteric proteins do operate in a parameter regime that permits logarithmic sensing. Finally, we present examples suggesting that allosteric proteins are indeed used in this capacity: allosteric proteins play a prominent role in systems where fold-change detection has been proposed. This finding suggests a role as logarithmic sensors for the many allosteric proteins across diverse biological processes.},
    Author = {Olsman, Noah and Goentoro, Lea},
    Date-Added = {2016-09-12 19:34:23 +0000},
    Date-Modified = {2016-09-12 19:34:23 +0000},
    Doi = {10.1073/pnas.1601791113},
    Journal = {Proc Natl Acad Sci U S A},
    Journal-Full = {Proceedings of the National Academy of Sciences of the United States of America},
    Keywords = {allosteric regulation; fold-change detection; logarithmic sensing},
    Month = {Jul},
    Number = {30},
    Pages = {E4423-30},
    Pmc = {PMC4968753},
    Pmid = {27410043},
    Pst = {ppublish},
    Title = {Allosteric proteins as logarithmic sensors},
    Volume = {113},
    Year = {2016},
    Bdsk-Url-1 = {//dx.doi.org/10.1073/pnas.1601791113}}

  • M. Castellana, S. Hsin-Jung Li, and N. S. Wingreen, “Spatial organization of bacterial transcription and translation,” Proc natl acad sci u s a, vol. 113, iss. 33, pp. 9286-91, 2016. doi:10.1073/pnas.1604995113
    [BibTeX] [Abstract]

    In bacteria such as Escherichia coli, DNA is compacted into a nucleoid near the cell center, whereas ribosomes-molecular complexes that translate mRNAs into proteins-are mainly localized to the poles. We study the impact of this spatial organization using a minimal reaction-diffusion model for the cellular transcriptional-translational machinery. Although genome-wide mRNA-nucleoid segregation still lacks experimental validation, our model predicts that [Formula: see text] of mRNAs are segregated to the poles. In addition, our analysis reveals a "circulation" of ribosomes driven by the flux of mRNAs, from synthesis in the nucleoid to degradation at the poles. We show that our results are robust with respect to multiple, biologically relevant factors, such as mRNA degradation by RNase enzymes, different phases of the cell division cycle and growth rates, and the existence of nonspecific, transient interactions between ribosomes and mRNAs. Finally, we confirm that the observed nucleoid size stems from a balance between the forces that the chromosome and mRNAs exert on each other. This suggests a potential global feedback circuit in which gene expression feeds back on itself via nucleoid compaction.

    @article{Castellana:2016zr,
    Abstract = {In bacteria such as Escherichia coli, DNA is compacted into a nucleoid near the cell center, whereas ribosomes-molecular complexes that translate mRNAs into proteins-are mainly localized to the poles. We study the impact of this spatial organization using a minimal reaction-diffusion model for the cellular transcriptional-translational machinery. Although genome-wide mRNA-nucleoid segregation still lacks experimental validation, our model predicts that [Formula: see text] of mRNAs are segregated to the poles. In addition, our analysis reveals a "circulation" of ribosomes driven by the flux of mRNAs, from synthesis in the nucleoid to degradation at the poles. We show that our results are robust with respect to multiple, biologically relevant factors, such as mRNA degradation by RNase enzymes, different phases of the cell division cycle and growth rates, and the existence of nonspecific, transient interactions between ribosomes and mRNAs. Finally, we confirm that the observed nucleoid size stems from a balance between the forces that the chromosome and mRNAs exert on each other. This suggests a potential global feedback circuit in which gene expression feeds back on itself via nucleoid compaction.},
    Author = {Castellana, Michele and Hsin-Jung Li, Sophia and Wingreen, Ned S},
    Date-Added = {2016-09-12 19:29:46 +0000},
    Date-Modified = {2016-09-12 19:29:46 +0000},
    Doi = {10.1073/pnas.1604995113},
    Journal = {Proc Natl Acad Sci U S A},
    Journal-Full = {Proceedings of the National Academy of Sciences of the United States of America},
    Keywords = {bacteria; experiments; localization; modeling; translation},
    Month = {Aug},
    Number = {33},
    Pages = {9286-91},
    Pmc = {PMC4995950},
    Pmid = {27486246},
    Pst = {ppublish},
    Title = {Spatial organization of bacterial transcription and translation},
    Volume = {113},
    Year = {2016},
    Bdsk-Url-1 = {//dx.doi.org/10.1073/pnas.1604995113}}

  • C. Angermueller, T. Pärnamaa, L. Parts, and O. Stegle, “Deep learning for computational biology,” Mol syst biol, vol. 12, iss. 7, p. 878, 2016.
    [BibTeX] [Abstract]

    Technological advances in genomics and imaging have led to an explosion of molecular and cellular profiling data from large numbers of samples. This rapid increase in biological data dimension and acquisition rate is challenging conventional analysis strategies. Modern machine learning methods, such as deep learning, promise to leverage very large data sets for finding hidden structure within them, and for making accurate predictions. In this review, we discuss applications of this new breed of analysis approaches in regulatory genomics and cellular imaging. We provide background of what deep learning is, and the settings in which it can be successfully applied to derive biological insights. In addition to presenting specific applications and providing tips for practical use, we also highlight possible pitfalls and limitations to guide computational biologists when and how to make the most use of this new technology.

    @article{Angermueller:2016ys,
    Abstract = {Technological advances in genomics and imaging have led to an explosion of molecular and cellular profiling data from large numbers of samples. This rapid increase in biological data dimension and acquisition rate is challenging conventional analysis strategies. Modern machine learning methods, such as deep learning, promise to leverage very large data sets for finding hidden structure within them, and for making accurate predictions. In this review, we discuss applications of this new breed of analysis approaches in regulatory genomics and cellular imaging. We provide background of what deep learning is, and the settings in which it can be successfully applied to derive biological insights. In addition to presenting specific applications and providing tips for practical use, we also highlight possible pitfalls and limitations to guide computational biologists when and how to make the most use of this new technology.},
    Author = {Angermueller, Christof and P{\"a}rnamaa, Tanel and Parts, Leopold and Stegle, Oliver},
    Date-Added = {2016-09-12 19:24:04 +0000},
    Date-Modified = {2016-09-12 19:24:04 +0000},
    Journal = {Mol Syst Biol},
    Journal-Full = {Molecular systems biology},
    Keywords = {cellular imaging; computational biology; deep learning; machine learning; regulatory genomics},
    Number = {7},
    Pages = {878},
    Pmc = {PMC4965871},
    Pmid = {27474269},
    Pst = {epublish},
    Title = {Deep learning for computational biology},
    Volume = {12},
    Year = {2016}}

  • D. Healey, K. Axelrod, and J. Gore, “Negative frequency-dependent interactions can underlie phenotypic heterogeneity in a clonal microbial population,” Mol syst biol, vol. 12, iss. 8, p. 877, 2016.
    [BibTeX] [Abstract]

    Genetically identical cells in microbial populations often exhibit a remarkable degree of phenotypic heterogeneity even in homogenous environments. Such heterogeneity is commonly thought to represent a bet-hedging strategy against environmental uncertainty. However, evolutionary game theory predicts that phenotypic heterogeneity may also be a response to negative frequency-dependent interactions that favor rare phenotypes over common ones. Here we provide experimental evidence for this alternative explanation in the context of the well-studied yeast GAL network. In an environment containing the two sugars glucose and galactose, the yeast GAL network displays stochastic bimodal activation. We show that in this mixed sugar environment, GAL-ON and GAL-OFF phenotypes can each invade the opposite phenotype when rare and that there exists a resulting stable mix of phenotypes. Consistent with theoretical predictions, the resulting stable mix of phenotypes is not necessarily optimal for population growth. We find that the wild-type mixed strategist GAL network can invade populations of both pure strategists while remaining uninvasible by either. Lastly, using laboratory evolution we show that this mixed resource environment can directly drive the de novo evolution of clonal phenotypic heterogeneity from a pure strategist population. Taken together, our results provide experimental evidence that negative frequency-dependent interactions can underlie the phenotypic heterogeneity found in clonal microbial populations.

    @article{Healey:2016vn,
    Abstract = {Genetically identical cells in microbial populations often exhibit a remarkable degree of phenotypic heterogeneity even in homogenous environments. Such heterogeneity is commonly thought to represent a bet-hedging strategy against environmental uncertainty. However, evolutionary game theory predicts that phenotypic heterogeneity may also be a response to negative frequency-dependent interactions that favor rare phenotypes over common ones. Here we provide experimental evidence for this alternative explanation in the context of the well-studied yeast GAL network. In an environment containing the two sugars glucose and galactose, the yeast GAL network displays stochastic bimodal activation. We show that in this mixed sugar environment, GAL-ON and GAL-OFF phenotypes can each invade the opposite phenotype when rare and that there exists a resulting stable mix of phenotypes. Consistent with theoretical predictions, the resulting stable mix of phenotypes is not necessarily optimal for population growth. We find that the wild-type mixed strategist GAL network can invade populations of both pure strategists while remaining uninvasible by either. Lastly, using laboratory evolution we show that this mixed resource environment can directly drive the de novo evolution of clonal phenotypic heterogeneity from a pure strategist population. Taken together, our results provide experimental evidence that negative frequency-dependent interactions can underlie the phenotypic heterogeneity found in clonal microbial populations.},
    Author = {Healey, David and Axelrod, Kevin and Gore, Jeff},
    Date-Added = {2016-09-12 19:22:40 +0000},
    Date-Modified = {2016-09-12 19:22:40 +0000},
    Journal = {Mol Syst Biol},
    Journal-Full = {Molecular systems biology},
    Keywords = {ecology; evolution; frequency dependence; phenotypic heterogeneity; stochastic gene expression},
    Number = {8},
    Pages = {877},
    Pmid = {27487817},
    Pst = {epublish},
    Title = {Negative frequency-dependent interactions can underlie phenotypic heterogeneity in a clonal microbial population},
    Volume = {12},
    Year = {2016}}

  • G. Corrado, T. Tebaldi, F. Costa, P. Frasconi, and A. Passerini, “Rnacommender: genome-wide recommendation of rna-protein interactions,” Bioinformatics, 2016. doi:10.1093/bioinformatics/btw517
    [BibTeX] [Abstract]

    MOTIVATION: Information about RNA-protein interactions is a vital pre-requisite to tackle the dissection of RNA regulatory processes. Despite the recent advances of the experimental techniques, the currently available RNA interactome involves a small portion of the known RNA binding proteins. The importance of determining RNA-protein interactions, coupled with the scarcity of the available information, calls for in silico prediction of such interactions. RESULTS: We present RNAcommender, a recommender system capable of suggesting RNA targets to unexplored RNA binding proteins, by propagating the available interaction information taking into account the protein domain composition and the RNA predicted secondary structure. Our results show that RNAcommender is able to successfully suggest RNA interactors for RNA binding proteins using little or no interaction evidence. RNAcommender was tested on a large dataset of human RBP-RNA interactions, showing a good ranking performance (average AUC ROC of 0.75) and significant enrichment of correct recommendations for 75% of the tested RBPs. RNAcommender can be a valid tool to assist researchers in identifying potential interacting candidates for the majority of RBPs with uncharacterized binding preferences. AVAILABILITY AND IMPLEMENTATION: The software is freely available at //rnacommender.disi.unitn.it CONTACT: gianluca.corrado@unitn.it or andrea.passerini@unitn.itSupplementary information: Supplementary data are available at Bioinformatics online.

    @article{Corrado:2016kx,
    Abstract = {MOTIVATION: Information about RNA-protein interactions is a vital pre-requisite to tackle the dissection of RNA regulatory processes. Despite the recent advances of the experimental techniques, the currently available RNA interactome involves a small portion of the known RNA binding proteins. The importance of determining RNA-protein interactions, coupled with the scarcity of the available information, calls for in silico prediction of such interactions.
    RESULTS: We present RNAcommender, a recommender system capable of suggesting RNA targets to unexplored RNA binding proteins, by propagating the available interaction information taking into account the protein domain composition and the RNA predicted secondary structure. Our results show that RNAcommender is able to successfully suggest RNA interactors for RNA binding proteins using little or no interaction evidence. RNAcommender was tested on a large dataset of human RBP-RNA interactions, showing a good ranking performance (average AUC ROC of 0.75) and significant enrichment of correct recommendations for 75% of the tested RBPs. RNAcommender can be a valid tool to assist researchers in identifying potential interacting candidates for the majority of RBPs with uncharacterized binding preferences.
    AVAILABILITY AND IMPLEMENTATION: The software is freely available at //rnacommender.disi.unitn.it CONTACT: gianluca.corrado@unitn.it or andrea.passerini@unitn.itSupplementary information: Supplementary data are available at Bioinformatics online.},
    Author = {Corrado, Gianluca and Tebaldi, Toma and Costa, Fabrizio and Frasconi, Paolo and Passerini, Andrea},
    Date-Added = {2016-09-12 11:32:50 +0000},
    Date-Modified = {2016-09-12 11:32:50 +0000},
    Doi = {10.1093/bioinformatics/btw517},
    Journal = {Bioinformatics},
    Journal-Full = {Bioinformatics (Oxford, England)},
    Month = {Aug},
    Pmid = {27503225},
    Pst = {aheadofprint},
    Title = {RNAcommender: genome-wide recommendation of RNA-protein interactions},
    Year = {2016},
    Bdsk-Url-1 = {//dx.doi.org/10.1093/bioinformatics/btw517}}

  • E. Yang and T. Jiang, “Sdeap: a splice graph based differential transcript expression analysis tool for population data,” Bioinformatics, 2016. doi:10.1093/bioinformatics/btw513
    [BibTeX] [Abstract]

    MOTIVATION: Differential transcript expression (DTE) analysis without predefined conditions is critical to biological studies. For example, it can be used to discover biomarkers to classify cancer samples into previously unknown subtypes such that better diagnosis and therapy methods can be developed for the subtypes. Although several DTE tools for population data, i.e. data without known biological conditions, have been published, these tools either assume binary conditions in the input population or require the number of conditions as a part of the input. Fixing the number of conditions to binary is unrealistic and may distort the results of a DTE analysis. Estimating the correct number of conditions in a population could also be challenging for a routine user. Moreover, the existing tools only provide differential usages of exons, which may be insufficient to interpret the patterns of alternative splicing across samples and restrains the applications of the tools from many biology studies. RESULTS: We propose a novel DTE analysis algorithm, called SDEAP, that estimates the number of conditions directly from the input samples using a Dirichlet mixture model and discovers alternative splicing events using a new graph modular decomposition algorithm. By taking advantage of the above technical improvement, SDEAP was able to outperform the other DTE analysis methods in our extensive experiments on simulated data and real data with qPCR validation. The prediction of SDEAP also allowed us to classify the samples of cancer subtypes and cell-cycle phases more accurately. AVAILABILITY AND IMPLEMENTATION: SDEAP is publicly available for free at //github.com/ewyang089/SDEAP/wiki CONTACT: yyang027@cs.ucr.edu; jiang@cs.ucr.eduSupplementary information: Supplementary data are available at Bioinformatics online.

    @article{Yang:2016uq,
    Abstract = {MOTIVATION: Differential transcript expression (DTE) analysis without predefined conditions is critical to biological studies. For example, it can be used to discover biomarkers to classify cancer samples into previously unknown subtypes such that better diagnosis and therapy methods can be developed for the subtypes. Although several DTE tools for population data, i.e. data without known biological conditions, have been published, these tools either assume binary conditions in the input population or require the number of conditions as a part of the input. Fixing the number of conditions to binary is unrealistic and may distort the results of a DTE analysis. Estimating the correct number of conditions in a population could also be challenging for a routine user. Moreover, the existing tools only provide differential usages of exons, which may be insufficient to interpret the patterns of alternative splicing across samples and restrains the applications of the tools from many biology studies.
    RESULTS: We propose a novel DTE analysis algorithm, called SDEAP, that estimates the number of conditions directly from the input samples using a Dirichlet mixture model and discovers alternative splicing events using a new graph modular decomposition algorithm. By taking advantage of the above technical improvement, SDEAP was able to outperform the other DTE analysis methods in our extensive experiments on simulated data and real data with qPCR validation. The prediction of SDEAP also allowed us to classify the samples of cancer subtypes and cell-cycle phases more accurately.
    AVAILABILITY AND IMPLEMENTATION: SDEAP is publicly available for free at //github.com/ewyang089/SDEAP/wiki CONTACT: yyang027@cs.ucr.edu; jiang@cs.ucr.eduSupplementary information: Supplementary data are available at Bioinformatics online.},
    Author = {Yang, Ei-Wen and Jiang, Tao},
    Date-Added = {2016-09-12 11:31:17 +0000},
    Date-Modified = {2016-09-12 11:31:17 +0000},
    Doi = {10.1093/bioinformatics/btw513},
    Journal = {Bioinformatics},
    Journal-Full = {Bioinformatics (Oxford, England)},
    Month = {Aug},
    Pmid = {27522083},
    Pst = {aheadofprint},
    Title = {SDEAP: a splice graph based differential transcript expression analysis tool for population data},
    Year = {2016},
    Bdsk-Url-1 = {//dx.doi.org/10.1093/bioinformatics/btw513}}

  • A. Gupta, K. I. Jordan, and L. Rishishwar, “Stringmlst: a fast k-mer based tool for multi locus sequence typing,” Bioinformatics, 2016. doi:10.1093/bioinformatics/btw586
    [BibTeX] [Abstract]

    Rapid and accurate identification of the sequence type (ST) of bacterial pathogens is critical for epidemiological surveillance and outbreak control. Cheaper and faster next-generation sequencing (NGS) technologies have taken preference over the traditional method of amplicon sequencing for multi locus sequence typing (MLST). But data generated by NGS platforms necessitate quality control, genome assembly and sequence similarity searching before an isolate’s ST can be determined. These are computationally intensive and time consuming steps, which are not ideally suited for real-time molecular epidemiology. Here, we present stringMLST, an assembly- and alignment-free, lightweight, platform-independent program capable of rapidly typing bacterial isolates directly from raw sequence reads. The program implements a simple hash table data structure to find exact matches between short sequence strings (k-mers) and an MLST allele library. We show that stringMLST is more accurate, and order of magnitude faster, than its contemporary genome-based ST detection tools. AVAILABILITY: The source code and documentations are available at //jordan.biology.gatech.edu/page/software/stringMLST CONTACT: lavanya.rishishwar@gatech.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

    @article{Gupta:2016fk,
    Abstract = {Rapid and accurate identification of the sequence type (ST) of bacterial pathogens is critical for epidemiological surveillance and outbreak control. Cheaper and faster next-generation sequencing (NGS) technologies have taken preference over the traditional method of amplicon sequencing for multi locus sequence typing (MLST). But data generated by NGS platforms necessitate quality control, genome assembly and sequence similarity searching before an isolate's ST can be determined. These are computationally intensive and time consuming steps, which are not ideally suited for real-time molecular epidemiology. Here, we present stringMLST, an assembly- and alignment-free, lightweight, platform-independent program capable of rapidly typing bacterial isolates directly from raw sequence reads. The program implements a simple hash table data structure to find exact matches between short sequence strings (k-mers) and an MLST allele library. We show that stringMLST is more accurate, and order of magnitude faster, than its contemporary genome-based ST detection tools.
    AVAILABILITY: The source code and documentations are available at //jordan.biology.gatech.edu/page/software/stringMLST CONTACT: lavanya.rishishwar@gatech.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.},
    Author = {Gupta, Anuj and Jordan, I King and Rishishwar, Lavanya},
    Date-Added = {2016-09-12 11:26:10 +0000},
    Date-Modified = {2016-09-12 11:26:10 +0000},
    Doi = {10.1093/bioinformatics/btw586},
    Journal = {Bioinformatics},
    Journal-Full = {Bioinformatics (Oxford, England)},
    Month = {Sep},
    Pmid = {27605103},
    Pst = {aheadofprint},
    Title = {stringMLST: a fast k-mer based tool for multi locus sequence typing},
    Year = {2016},
    Bdsk-Url-1 = {//dx.doi.org/10.1093/bioinformatics/btw586}}

publications 2016-05-31

In this group meeting, we quickly discussed these latest papers:

  • Y. Zhou and P. X. -K. Song, “Regression analysis of networked data,” Biometrika, vol. 103, iss. 2, pp. 287-301, 2016. doi:10.1093/biomet/asw003
    [BibTeX] [Abstract] [Download PDF]

    This paper concerns regression methodology for assessing relationships between multi-dimensional response variables and covariates that are correlated within a network. To address analytical challenges associated with the integration of network topology into the regression analysis, we propose a hybrid quadratic inference method that uses both prior and data-driven correlations among network nodes. A Godambe information-based tuning strategy is developed to allocate weights between the prior and data-driven network structures, so the estimator is efficient. The proposed method is conceptually simple and computationally fast, and has appealing large-sample properties. It is evaluated by simulation, and its application is illustrated using neuroimaging data from an association study of the effects of iron deficiency on auditory recognition memory in infants.

    @article{Zhou01062016,
    author = {Zhou, Yan and Song, Peter X.-K.},
    title = {Regression analysis of networked data},
    volume = {103},
    number = {2},
    pages = {287-301},
    year = {2016},
    doi = {10.1093/biomet/asw003},
    abstract ={This paper concerns regression methodology for assessing relationships between multi-dimensional response variables and covariates that are correlated within a network. To address analytical challenges associated with the integration of network topology into the regression analysis, we propose a hybrid quadratic inference method that uses both prior and data-driven correlations among network nodes. A Godambe information-based tuning strategy is developed to allocate weights between the prior and data-driven network structures, so the estimator is efficient. The proposed method is conceptually simple and computationally fast, and has appealing large-sample properties. It is evaluated by simulation, and its application is illustrated using neuroimaging data from an association study of the effects of iron deficiency on auditory recognition memory in infants.},
    URL = {//biomet.oxfordjournals.org/content/103/2/287.abstract},
    journal = {Biometrika}
    }

  • C. Dombry, S. Engelke, and M. Oesting, “Exact simulation of max-stable processes,” Biometrika, vol. 103, iss. 2, pp. 303-317, 2016. doi:10.1093/biomet/asw008
    [BibTeX] [Abstract] [Download PDF]

    Max-stable processes play an important role as models for spatial extreme events. Their complex structure as the pointwise maximum over an infinite number of random functions makes their simulation difficult. Algorithms based on finite approximations are often inexact and computationally inefficient. We present a new algorithm for exact simulation of a max-stable process at a finite number of locations. It relies on the idea of simulating only the extremal functions, that is, those functions in the construction of a max-stable process that effectively contribute to the pointwise maximum. We further generalize the algorithm by Dieker & Mikosch (2015) for Brown–Resnick processes and use it for exact simulation via the spectral measure. We study the complexity of both algorithms, prove that our new approach via extremal functions is always more efficient, and provide closed-form expressions for their implementation that cover most popular models for max-stable processes and multivariate extreme value distributions. For simulation on dense grids, an adaptive design of the extremal function algorithm is proposed.

    @article{Dombry01062016,
    author = {Dombry, Clément and Engelke, Sebastian and Oesting, Marco},
    title = {Exact simulation of max-stable processes},
    volume = {103},
    number = {2},
    pages = {303-317},
    year = {2016},
    doi = {10.1093/biomet/asw008},
    abstract ={Max-stable processes play an important role as models for spatial extreme events. Their complex structure as the pointwise maximum over an infinite number of random functions makes their simulation difficult. Algorithms based on finite approximations are often inexact and computationally inefficient. We present a new algorithm for exact simulation of a max-stable process at a finite number of locations. It relies on the idea of simulating only the extremal functions, that is, those functions in the construction of a max-stable process that effectively contribute to the pointwise maximum. We further generalize the algorithm by Dieker & Mikosch (2015) for Brown–Resnick processes and use it for exact simulation via the spectral measure. We study the complexity of both algorithms, prove that our new approach via extremal functions is always more efficient, and provide closed-form expressions for their implementation that cover most popular models for max-stable processes and multivariate extreme value distributions. For simulation on dense grids, an adaptive design of the extremal function algorithm is proposed.},
    URL = {//biomet.oxfordjournals.org/content/103/2/303.abstract},
    eprint = {//biomet.oxfordjournals.org/content/103/2/303.full.pdf+html},
    journal = {Biometrika}
    }

  • D. Turek, P. de Valpine, C. J. Paciorek, and C. Anderson-Bergman, “Automated parameter blocking for efficient markov-chain monte carlo sampling,” Bayesian analysis, 2016. doi:10.1214/16-BA1008
    [BibTeX] [Abstract] [Download PDF]

    Markov chain Monte Carlo (MCMC) sampling is an important and commonly used tool for the analysis of hierarchical models. Nevertheless, practitioners generally have two options for MCMC: utilize existing software that generates a black-box “one size fits all" algorithm, or the challenging (and time consuming) task of implementing a problem-specific MCMC algorithm. Either choice may result in inefficient sampling, and hence researchers have become accustomed to MCMC runtimes on the order of days (or longer) for large models. We propose an automated procedure to determine an efficient MCMC block-sampling algorithm for a given model and computing platform. Our procedure dynamically determines blocks of parameters for joint sampling that result in efficient MCMC sampling of the entire model. We test this procedure using a diverse suite of example models, and observe non-trivial improvements in MCMC efficiency for many models. Our procedure is the first attempt at such, and may be generalized to a broader space of MCMC algorithms. Our results suggest that substantive improvements in MCMC efficiency may be practically realized using our automated blocking procedure, or variants thereof, which warrants additional study and application.

    @article{turek2015automated,
    title={Automated Parameter Blocking for Efficient Markov-Chain Monte Carlo Sampling},
    author={Turek, Daniel and de Valpine, Perry and Paciorek, Christopher J and Anderson-Bergman, Clifford},
    journal={Bayesian Analysis},
    year={2016},
    publisher={International Society for Bayesian Analysis},
    url={//projecteuclid.org/euclid.ba/1464266500},
    doi={10.1214/16-BA1008},
    abstract={Markov chain Monte Carlo (MCMC) sampling is an important and commonly used tool for the analysis of hierarchical models. Nevertheless, practitioners generally have two options for MCMC: utilize existing software that generates a black-box “one size fits all" algorithm, or the challenging (and time consuming) task of implementing a problem-specific MCMC algorithm. Either choice may result in inefficient sampling, and hence researchers have become accustomed to MCMC runtimes on the order of days (or longer) for large models. We propose an automated procedure to determine an efficient MCMC block-sampling algorithm for a given model and computing platform. Our procedure dynamically determines blocks of parameters for joint sampling that result in efficient MCMC sampling of the entire model. We test this procedure using a diverse suite of example models, and observe non-trivial improvements in MCMC efficiency for many models. Our procedure is the first attempt at such, and may be generalized to a broader space of MCMC algorithms. Our results suggest that substantive improvements in MCMC efficiency may be practically realized using our automated blocking procedure, or variants thereof, which warrants additional study and application.}
    }

  • R. Bar-Ziv, Y. Voichek, and N. Barkai, “Chromatin dynamics during dna replication,” Genome research, 2016. doi:10.1101/gr.201244.115
    [BibTeX] [Abstract] [Download PDF]

    Chromatin is composed of DNA and histones, which provide a unified platform for regulating DNA-related processes, mostly through their post translational modification. During DNA replication, histone arrangement is perturbed, to first allow progression of DNA polymerase and then during repackaging of the replicated DNA. To study how DNA replication influences the pattern of histone modification, we followed the cell-cycle dynamics of ten histone marks in budding yeast. We find that histones deposited on newly replicated DNA are modified at different rates; while some marks appear immediately upon replication (e.g. H4K16ac, H3K4me1), others increase with transcription-dependent delays (e.g. H3K4me3, H3K36me3). Notably, H3K9ac was deposited as a wave preceding the replication fork by ~5-6 kb. This replication-guided H3K9ac was fully dependent on the acetyltransferase Rtt109, while expression-guided H3K9ac was deposited by Gcn5. Further, topoisomerase depletion intensified H3K9ac in front of the replication fork, and in sites where RNA polymerase II was trapped, suggesting supercoiling stresses trigger H3K9 acetylation. Our results assign complementary roles for DNA replication and gene expression in defining the pattern of histone modification.

    @article{Bar-Ziv25052016,
    author = {Bar-Ziv, Raz and Voichek, Yoav and Barkai, Naama},
    title = {Chromatin dynamics during DNA replication},
    year = {2016},
    doi = {10.1101/gr.201244.115},
    abstract ={Chromatin is composed of DNA and histones, which provide a unified platform for regulating DNA-related processes, mostly through their post translational modification. During DNA replication, histone arrangement is perturbed, to first allow progression of DNA polymerase and then during repackaging of the replicated DNA. To study how DNA replication influences the pattern of histone modification, we followed the cell-cycle dynamics of ten histone marks in budding yeast. We find that histones deposited on newly replicated DNA are modified at different rates; while some marks appear immediately upon replication (e.g. H4K16ac, H3K4me1), others increase with transcription-dependent delays (e.g. H3K4me3, H3K36me3). Notably, H3K9ac was deposited as a wave preceding the replication fork by ~5-6 kb. This replication-guided H3K9ac was fully dependent on the acetyltransferase Rtt109, while expression-guided H3K9ac was deposited by Gcn5. Further, topoisomerase depletion intensified H3K9ac in front of the replication fork, and in sites where RNA polymerase II was trapped, suggesting supercoiling stresses trigger H3K9 acetylation. Our results assign complementary roles for DNA replication and gene expression in defining the pattern of histone modification.},
    URL = {//genome.cshlp.org/content/early/2016/05/25/gr.201244.115.abstract},
    eprint = {//genome.cshlp.org/content/early/2016/05/25/gr.201244.115.full.pdf+html},
    journal = {Genome Research}
    }

  • B. J. Callahan, P. J. McMurdie, M. J. Rosen, A. W. Han, A. J. A. Johnson, and S. P. Holmes, “Dada2: high-resolution sample inference from illumina amplicon data,” Nature methods, 2016.
    [BibTeX] [Abstract]

    We present the open-source software package DADA2 for modeling and correcting Illumina-sequenced amplicon errors (//github.com/benjjneb/dada2). DADA2 infers sample sequences exactly and resolves differences of as little as 1 nucleotide. In several mock communities, DADA2 identified more real variants and output fewer spurious sequences than other methods. We applied DADA2 to vaginal samples from a cohort of pregnant women, revealing a diversity of previously undetected Lactobacillus crispatus variants.

    @article{callahan2016dada2,
    title={DADA2: High-resolution sample inference from Illumina amplicon data},
    author={Callahan, Benjamin J and McMurdie, Paul J and Rosen, Michael J and Han, Andrew W and Johnson, Amy Jo A and Holmes, Susan P},
    abstract ={We present the open-source software package DADA2 for modeling and correcting Illumina-sequenced amplicon errors (//github.com/benjjneb/dada2). DADA2 infers sample sequences exactly and resolves differences of as little as 1 nucleotide. In several mock communities, DADA2 identified more real variants and output fewer spurious sequences than other methods. We applied DADA2 to vaginal samples from a cohort of pregnant women, revealing a diversity of previously undetected Lactobacillus crispatus variants.},
    journal={Nature methods},
    year={2016},
    publisher={Nature Publishing Group}
    }

  • V. Hsiao, Y. Hori, P. W. Rothemund, and R. M. Murray, “A population-based temporal logic gate for timing and recording chemical events,” Mol syst biol, vol. 12, iss. 5, p. 869, 2016.
    [BibTeX] [Abstract]

    Engineered bacterial sensors have potential applications in human health monitoring, environmental chemical detection, and materials biosynthesis. While such bacterial devices have long been engineered to differentiate between combinations of inputs, their potential to process signal timing and duration has been overlooked. In this work, we present a two-input temporal logic gate that can sense and record the order of the inputs, the timing between inputs, and the duration of input pulses. Our temporal logic gate design relies on unidirectional DNA recombination mediated by bacteriophage integrases to detect and encode sequences of input events. For an E. coli strain engineered to contain our temporal logic gate, we compare predictions of Markov model simulations with laboratory measurements of final population distributions for both step and pulse inputs. Although single cells were engineered to have digital outputs, stochastic noise created heterogeneous single-cell responses that translated into analog population responses. Furthermore, when single-cell genetic states were aggregated into population-level distributions, these distributions contained unique information not encoded in individual cells. Thus, final differentiated sub-populations could be used to deduce order, timing, and duration of transient chemical events.

    @article{Hsiao:2016ys,
    Abstract = {Engineered bacterial sensors have potential applications in human health monitoring, environmental chemical detection, and materials biosynthesis. While such bacterial devices have long been engineered to differentiate between combinations of inputs, their potential to process signal timing and duration has been overlooked. In this work, we present a two-input temporal logic gate that can sense and record the order of the inputs, the timing between inputs, and the duration of input pulses. Our temporal logic gate design relies on unidirectional DNA recombination mediated by bacteriophage integrases to detect and encode sequences of input events. For an E. coli strain engineered to contain our temporal logic gate, we compare predictions of Markov model simulations with laboratory measurements of final population distributions for both step and pulse inputs. Although single cells were engineered to have digital outputs, stochastic noise created heterogeneous single-cell responses that translated into analog population responses. Furthermore, when single-cell genetic states were aggregated into population-level distributions, these distributions contained unique information not encoded in individual cells. Thus, final differentiated sub-populations could be used to deduce order, timing, and duration of transient chemical events.},
    Author = {Hsiao, Victoria and Hori, Yutaka and Rothemund, Paul Wk and Murray, Richard M},
    Date-Added = {2016-05-31 08:25:53 +0000},
    Date-Modified = {2016-05-31 08:25:53 +0000},
    Journal = {Mol Syst Biol},
    Journal-Full = {Molecular systems biology},
    Keywords = {DNA memory; event detectors; integrases; population analysis; stochastic biomolecular models},
    Number = {5},
    Pages = {869},
    Pmid = {27193783},
    Pst = {epublish},
    Title = {A population-based temporal logic gate for timing and recording chemical events},
    Volume = {12},
    Year = {2016}}

  • P. Giehr, C. Kyriakopoulos, G. Ficz, V. Wolf, and J. Walter, “The influence of hydroxylation on maintaining cpg methylation patterns: a hidden markov model approach,” Plos comput biol, vol. 12, iss. 5, p. e1004905, 2016. doi:10.1371/journal.pcbi.1004905
    [BibTeX] [Abstract]

    DNA methylation and demethylation are opposing processes that when in balance create stable patterns of epigenetic memory. The control of DNA methylation pattern formation by replication dependent and independent demethylation processes has been suggested to be influenced by Tet mediated oxidation of 5mC. Several alternative mechanisms have been proposed suggesting that 5hmC influences either replication dependent maintenance of DNA methylation or replication independent processes of active demethylation. Using high resolution hairpin oxidative bisulfite sequencing data, we precisely determine the amount of 5mC and 5hmC and model the contribution of 5hmC to processes of demethylation in mouse ESCs. We develop an extended hidden Markov model capable of accurately describing the regional contribution of 5hmC to demethylation dynamics. Our analysis shows that 5hmC has a strong impact on replication dependent demethylation, mainly by impairing methylation maintenance.

    @article{Giehr:2016vn,
    Abstract = {DNA methylation and demethylation are opposing processes that when in balance create stable patterns of epigenetic memory. The control of DNA methylation pattern formation by replication dependent and independent demethylation processes has been suggested to be influenced by Tet mediated oxidation of 5mC. Several alternative mechanisms have been proposed suggesting that 5hmC influences either replication dependent maintenance of DNA methylation or replication independent processes of active demethylation. Using high resolution hairpin oxidative bisulfite sequencing data, we precisely determine the amount of 5mC and 5hmC and model the contribution of 5hmC to processes of demethylation in mouse ESCs. We develop an extended hidden Markov model capable of accurately describing the regional contribution of 5hmC to demethylation dynamics. Our analysis shows that 5hmC has a strong impact on replication dependent demethylation, mainly by impairing methylation maintenance.},
    Author = {Giehr, Pascal and Kyriakopoulos, Charalampos and Ficz, Gabriella and Wolf, Verena and Walter, J{\"o}rn},
    Date-Added = {2016-05-31 08:21:14 +0000},
    Date-Modified = {2016-05-31 08:21:14 +0000},
    Doi = {10.1371/journal.pcbi.1004905},
    Journal = {PLoS Comput Biol},
    Journal-Full = {PLoS computational biology},
    Month = {May},
    Number = {5},
    Pages = {e1004905},
    Pmid = {27224554},
    Pst = {epublish},
    Title = {The Influence of Hydroxylation on Maintaining CpG Methylation Patterns: A Hidden Markov Model Approach},
    Volume = {12},
    Year = {2016},
    Bdsk-Url-1 = {//dx.doi.org/10.1371/journal.pcbi.1004905}}

  • E. Z. Chen and H. Li, “A two-part mixed-effects model for analyzing longitudinal microbiome compositional data,” Bioinformatics, 2016. doi:10.1093/bioinformatics/btw308
    [BibTeX] [Abstract]

    MOTIVATION: The human microbial communities are associated with many human diseases such as obesity, diabetes and inflammatory bowel disease. High-throughput sequencing technology has been widely used to quantify the microbial composition in order to understand its impacts on human health. Longitudinal measurements of microbial communities are commonly obtained in many microbiome studies. A key question in such microbiome studies is to identify the microbes that are associated with clinical outcomes or environmental factors. However, microbiome compositional data are highly skewed, bounded in [0,1), and often sparse with many zeros. In addition, the observations from repeated measures in longitudinal studies are correlated. A method that takes into account these features is needed for association analysis in longitudinal microbiome data. RESULTS: In this paper, we propose a two-part zero-inflated Beta regression model with random effects (ZIBR) for testing the association between microbial abundance and clinical covariates for longitudinal microbiome data. The model includes a logistic regression component to model presence/absence of a microbe in the samples and a Beta regression component to model non-zero microbial abundance, where each component includes a random effect to account for the correlations among the repeated measurements on the same subject. Both simulation studies and the application to real microbiome data have shown that ZIBR model outperformed the previously used methods. The method provides a useful tool for identifying the relevant taxa based on longitudinal or repeated measures in microbiome research. AVAILABILITY: //github.com/chvlyl/ZIBR CONTACT: hongzhe@upenn.edu.

    @article{Chen:2016kx,
    Abstract = {MOTIVATION: The human microbial communities are associated with many human diseases such as obesity, diabetes and inflammatory bowel disease. High-throughput sequencing technology has been widely used to quantify the microbial composition in order to understand its impacts on human health. Longitudinal measurements of microbial communities are commonly obtained in many microbiome studies. A key question in such microbiome studies is to identify the microbes that are associated with clinical outcomes or environmental factors. However, microbiome compositional data are highly skewed, bounded in [0,1), and often sparse with many zeros. In addition, the observations from repeated measures in longitudinal studies are correlated. A method that takes into account these features is needed for association analysis in longitudinal microbiome data.
    RESULTS: In this paper, we propose a two-part zero-inflated Beta regression model with random effects (ZIBR) for testing the association between microbial abundance and clinical covariates for longitudinal microbiome data. The model includes a logistic regression component to model presence/absence of a microbe in the samples and a Beta regression component to model non-zero microbial abundance, where each component includes a random effect to account for the correlations among the repeated measurements on the same subject. Both simulation studies and the application to real microbiome data have shown that ZIBR model outperformed the previously used methods. The method provides a useful tool for identifying the relevant taxa based on longitudinal or repeated measures in microbiome research.
    AVAILABILITY: //github.com/chvlyl/ZIBR CONTACT: hongzhe@upenn.edu.},
    Author = {Chen, Eric Z and Li, Hongzhe},
    Date-Added = {2016-05-31 08:19:08 +0000},
    Date-Modified = {2016-05-31 08:19:08 +0000},
    Doi = {10.1093/bioinformatics/btw308},
    Journal = {Bioinformatics},
    Journal-Full = {Bioinformatics (Oxford, England)},
    Month = {May},
    Pmid = {27187200},
    Pst = {aheadofprint},
    Title = {A two-part mixed-effects model for analyzing longitudinal microbiome compositional data},
    Year = {2016},
    Bdsk-Url-1 = {//dx.doi.org/10.1093/bioinformatics/btw308}}

  • A. C. Komor, Y. B. Kim, M. S. Packer, J. A. Zuris, and D. R. Liu, “Programmable editing of a target base in genomic dna without double-stranded dna cleavage,” Nature, vol. 533, iss. 7603, pp. 420-4, 2016. doi:10.1038/nature17946
    [BibTeX] [Abstract]

    Current genome-editing technologies introduce double-stranded (ds) DNA breaks at a target locus as the first step to gene correction. Although most genetic diseases arise from point mutations, current approaches to point mutation correction are inefficient and typically induce an abundance of random insertions and deletions (indels) at the target locus resulting from the cellular response to dsDNA breaks. Here we report the development of ‘base editing’, a new approach to genome editing that enables the direct, irreversible conversion of one target DNA base into another in a programmable manner, without requiring dsDNA backbone cleavage or a donor template. We engineered fusions of CRISPR/Cas9 and a cytidine deaminase enzyme that retain the ability to be programmed with a guide RNA, do not induce dsDNA breaks, and mediate the direct conversion of cytidine to uridine, thereby effecting a C→T (or G→A) substitution. The resulting ‘base editors’ convert cytidines within a window of approximately five nucleotides, and can efficiently correct a variety of point mutations relevant to human disease. In four transformed human and murine cell lines, second- and third-generation base editors that fuse uracil glycosylase inhibitor, and that use a Cas9 nickase targeting the non-edited strand, manipulate the cellular DNA repair response to favour desired base-editing outcomes, resulting in permanent correction of ~15-75% of total cellular DNA with minimal (typically ≤1%) indel formation. Base editing expands the scope and efficiency of genome editing of point mutations.

    @article{Komor:2016uq,
    Abstract = {Current genome-editing technologies introduce double-stranded (ds) DNA breaks at a target locus as the first step to gene correction. Although most genetic diseases arise from point mutations, current approaches to point mutation correction are inefficient and typically induce an abundance of random insertions and deletions (indels) at the target locus resulting from the cellular response to dsDNA breaks. Here we report the development of 'base editing', a new approach to genome editing that enables the direct, irreversible conversion of one target DNA base into another in a programmable manner, without requiring dsDNA backbone cleavage or a donor template. We engineered fusions of CRISPR/Cas9 and a cytidine deaminase enzyme that retain the ability to be programmed with a guide RNA, do not induce dsDNA breaks, and mediate the direct conversion of cytidine to uridine, thereby effecting a C→T (or G→A) substitution. The resulting 'base editors' convert cytidines within a window of approximately five nucleotides, and can efficiently correct a variety of point mutations relevant to human disease. In four transformed human and murine cell lines, second- and third-generation base editors that fuse uracil glycosylase inhibitor, and that use a Cas9 nickase targeting the non-edited strand, manipulate the cellular DNA repair response to favour desired base-editing outcomes, resulting in permanent correction of ~15-75% of total cellular DNA with minimal (typically ≤1%) indel formation. Base editing expands the scope and efficiency of genome editing of point mutations.},
    Author = {Komor, Alexis C and Kim, Yongjoo B and Packer, Michael S and Zuris, John A and Liu, David R},
    Date-Added = {2016-05-31 08:16:43 +0000},
    Date-Modified = {2016-05-31 08:16:43 +0000},
    Doi = {10.1038/nature17946},
    Journal = {Nature},
    Journal-Full = {Nature},
    Month = {May},
    Number = {7603},
    Pages = {420-4},
    Pmc = {PMC4873371},
    Pmid = {27096365},
    Pst = {epublish},
    Title = {Programmable editing of a target base in genomic DNA without double-stranded DNA cleavage},
    Volume = {533},
    Year = {2016},
    Bdsk-Url-1 = {//dx.doi.org/10.1038/nature17946}}

  • K. S. Sarkisyan, D. A. Bolotin, M. V. Meer, D. R. Usmanova, A. S. Mishin, G. V. Sharonov, D. N. Ivankov, N. G. Bozhanova, M. S. Baranov, O. Soylemez, N. S. Bogatyreva, P. K. Vlasov, E. S. Egorov, M. D. Logacheva, A. S. Kondrashov, D. M. Chudakov, E. V. Putintseva, I. Z. Mamedov, D. S. Tawfik, K. A. Lukyanov, and F. A. Kondrashov, “Local fitness landscape of the green fluorescent protein,” Nature, vol. 533, iss. 7603, pp. 397-401, 2016. doi:10.1038/nature17995
    [BibTeX] [Abstract]

    Fitness landscapes depict how genotypes manifest at the phenotypic level and form the basis of our understanding of many areas of biology, yet their properties remain elusive. Previous studies have analysed specific genes, often using their function as a proxy for fitness, experimentally assessing the effect on function of single mutations and their combinations in a specific sequence or in different sequences. However, systematic high-throughput studies of the local fitness landscape of an entire protein have not yet been reported. Here we visualize an extensive region of the local fitness landscape of the green fluorescent protein from Aequorea victoria (avGFP) by measuring the native function (fluorescence) of tens of thousands of derivative genotypes of avGFP. We show that the fitness landscape of avGFP is narrow, with 3/4 of the derivatives with a single mutation showing reduced fluorescence and half of the derivatives with four mutations being completely non-fluorescent. The narrowness is enhanced by epistasis, which was detected in up to 30% of genotypes with multiple mutations and mostly occurred through the cumulative effect of slightly deleterious mutations causing a threshold-like decrease in protein stability and a concomitant loss of fluorescence. A model of orthologous sequence divergence spanning hundreds of millions of years predicted the extent of epistasis in our data, indicating congruence between the fitness landscape properties at the local and global scales. The characterization of the local fitness landscape of avGFP has important implications for several fields including molecular evolution, population genetics and protein design.

    @article{Sarkisyan:2016fk,
    Abstract = {Fitness landscapes depict how genotypes manifest at the phenotypic level and form the basis of our understanding of many areas of biology, yet their properties remain elusive. Previous studies have analysed specific genes, often using their function as a proxy for fitness, experimentally assessing the effect on function of single mutations and their combinations in a specific sequence or in different sequences. However, systematic high-throughput studies of the local fitness landscape of an entire protein have not yet been reported. Here we visualize an extensive region of the local fitness landscape of the green fluorescent protein from Aequorea victoria (avGFP) by measuring the native function (fluorescence) of tens of thousands of derivative genotypes of avGFP. We show that the fitness landscape of avGFP is narrow, with 3/4 of the derivatives with a single mutation showing reduced fluorescence and half of the derivatives with four mutations being completely non-fluorescent. The narrowness is enhanced by epistasis, which was detected in up to 30% of genotypes with multiple mutations and mostly occurred through the cumulative effect of slightly deleterious mutations causing a threshold-like decrease in protein stability and a concomitant loss of fluorescence. A model of orthologous sequence divergence spanning hundreds of millions of years predicted the extent of epistasis in our data, indicating congruence between the fitness landscape properties at the local and global scales. The characterization of the local fitness landscape of avGFP has important implications for several fields including molecular evolution, population genetics and protein design.},
    Author = {Sarkisyan, Karen S and Bolotin, Dmitry A and Meer, Margarita V and Usmanova, Dinara R and Mishin, Alexander S and Sharonov, George V and Ivankov, Dmitry N and Bozhanova, Nina G and Baranov, Mikhail S and Soylemez, Onuralp and Bogatyreva, Natalya S and Vlasov, Peter K and Egorov, Evgeny S and Logacheva, Maria D and Kondrashov, Alexey S and Chudakov, Dmitry M and Putintseva, Ekaterina V and Mamedov, Ilgar Z and Tawfik, Dan S and Lukyanov, Konstantin A and Kondrashov, Fyodor A},
    Date-Added = {2016-05-31 08:14:18 +0000},
    Date-Modified = {2016-05-31 08:14:18 +0000},
    Doi = {10.1038/nature17995},
    Journal = {Nature},
    Journal-Full = {Nature},
    Month = {May},
    Number = {7603},
    Pages = {397-401},
    Pmid = {27193686},
    Pst = {epublish},
    Title = {Local fitness landscape of the green fluorescent protein},
    Volume = {533},
    Year = {2016},
    Bdsk-Url-1 = {//dx.doi.org/10.1038/nature17995}}

publications 2016-05-17

In this group meeting, we quickly discussed these latest papers:

  • P. Raccuglia, K. C. Elbert, P. D. F. Adler, C. Falk, M. B. Wenny, A. Mollo, M. Zeller, S. A. Friedler, J. Schrier, and A. J. Norquist, “Machine-learning-assisted materials discovery using failed experiments,” Nature, vol. 533, iss. 7601, pp. 73-6, 2016. doi:10.1038/nature17439
    [BibTeX] [Abstract]

    Inorganic-organic hybrid materials such as organically templated metal oxides, metal-organic frameworks (MOFs) and organohalide perovskites have been studied for decades, and hydrothermal and (non-aqueous) solvothermal syntheses have produced thousands of new materials that collectively contain nearly all the metals in the periodic table. Nevertheless, the formation of these compounds is not fully understood, and development of new compounds relies primarily on exploratory syntheses. Simulation- and data-driven approaches (promoted by efforts such as the Materials Genome Initiative) provide an alternative to experimental trial-and-error. Three major strategies are: simulation-based predictions of physical properties (for example, charge mobility, photovoltaic properties, gas adsorption capacity or lithium-ion intercalation) to identify promising target candidates for synthetic efforts; determination of the structure-property relationship from large bodies of experimental data, enabled by integration with high-throughput synthesis and measurement tools; and clustering on the basis of similar crystallographic structure (for example, zeolite structure classification or gas adsorption properties). Here we demonstrate an alternative approach that uses machine-learning algorithms trained on reaction data to predict reaction outcomes for the crystallization of templated vanadium selenites. We used information on ‘dark’ reactions–failed or unsuccessful hydrothermal syntheses–collected from archived laboratory notebooks from our laboratory, and added physicochemical property descriptions to the raw notebook information using cheminformatics techniques. We used the resulting data to train a machine-learning model to predict reaction success. When carrying out hydrothermal synthesis experiments using previously untested, commercially available organic building blocks, our machine-learning model outperformed traditional human strategies, and successfully predicted conditions for new organically templated inorganic product formation with a success rate of 89 per cent. Inverting the machine-learning model reveals new hypotheses regarding the conditions for successful product formation.

    @article{Raccuglia:2016kl,
    Abstract = {Inorganic-organic hybrid materials such as organically templated metal oxides, metal-organic frameworks (MOFs) and organohalide perovskites have been studied for decades, and hydrothermal and (non-aqueous) solvothermal syntheses have produced thousands of new materials that collectively contain nearly all the metals in the periodic table. Nevertheless, the formation of these compounds is not fully understood, and development of new compounds relies primarily on exploratory syntheses. Simulation- and data-driven approaches (promoted by efforts such as the Materials Genome Initiative) provide an alternative to experimental trial-and-error. Three major strategies are: simulation-based predictions of physical properties (for example, charge mobility, photovoltaic properties, gas adsorption capacity or lithium-ion intercalation) to identify promising target candidates for synthetic efforts; determination of the structure-property relationship from large bodies of experimental data, enabled by integration with high-throughput synthesis and measurement tools; and clustering on the basis of similar crystallographic structure (for example, zeolite structure classification or gas adsorption properties). Here we demonstrate an alternative approach that uses machine-learning algorithms trained on reaction data to predict reaction outcomes for the crystallization of templated vanadium selenites. We used information on 'dark' reactions--failed or unsuccessful hydrothermal syntheses--collected from archived laboratory notebooks from our laboratory, and added physicochemical property descriptions to the raw notebook information using cheminformatics techniques. We used the resulting data to train a machine-learning model to predict reaction success. When carrying out hydrothermal synthesis experiments using previously untested, commercially available organic building blocks, our machine-learning model outperformed traditional human strategies, and successfully predicted conditions for new organically templated inorganic product formation with a success rate of 89 per cent. Inverting the machine-learning model reveals new hypotheses regarding the conditions for successful product formation.},
    Author = {Raccuglia, Paul and Elbert, Katherine C and Adler, Philip D F and Falk, Casey and Wenny, Malia B and Mollo, Aurelio and Zeller, Matthias and Friedler, Sorelle A and Schrier, Joshua and Norquist, Alexander J},
    Date-Added = {2016-05-17 08:26:20 +0000},
    Date-Modified = {2016-05-17 08:26:20 +0000},
    Doi = {10.1038/nature17439},
    Journal = {Nature},
    Journal-Full = {Nature},
    Month = {May},
    Number = {7601},
    Pages = {73-6},
    Pmid = {27147027},
    Pst = {ppublish},
    Title = {Machine-learning-assisted materials discovery using failed experiments},
    Volume = {533},
    Year = {2016},
    Bdsk-Url-1 = {//dx.doi.org/10.1038/nature17439}}

  • J. Zhao, Y. Zhou, X. Zhang, and L. Chen, “Part mutual information for quantifying direct associations in networks,” Proc natl acad sci u s a, vol. 113, iss. 18, pp. 5130-5, 2016. doi:10.1073/pnas.1522586113
    [BibTeX] [Abstract]

    Quantitatively identifying direct dependencies between variables is an important task in data analysis, in particular for reconstructing various types of networks and causal relations in science and engineering. One of the most widely used criteria is partial correlation, but it can only measure linearly direct association and miss nonlinear associations. However, based on conditional independence, conditional mutual information (CMI) is able to quantify nonlinearly direct relationships among variables from the observed data, superior to linear measures, but suffers from a serious problem of underestimation, in particular for those variables with tight associations in a network, which severely limits its applications. In this work, we propose a new concept, "partial independence," with a new measure, "part mutual information" (PMI), which not only can overcome the problem of CMI but also retains the quantification properties of both mutual information (MI) and CMI. Specifically, we first defined PMI to measure nonlinearly direct dependencies between variables and then derived its relations with MI and CMI. Finally, we used a number of simulated data as benchmark examples to numerically demonstrate PMI features and further real gene expression data from Escherichia coli and yeast to reconstruct gene regulatory networks, which all validated the advantages of PMI for accurately quantifying nonlinearly direct associations in networks.

    @article{Zhao:2016oq,
    Abstract = {Quantitatively identifying direct dependencies between variables is an important task in data analysis, in particular for reconstructing various types of networks and causal relations in science and engineering. One of the most widely used criteria is partial correlation, but it can only measure linearly direct association and miss nonlinear associations. However, based on conditional independence, conditional mutual information (CMI) is able to quantify nonlinearly direct relationships among variables from the observed data, superior to linear measures, but suffers from a serious problem of underestimation, in particular for those variables with tight associations in a network, which severely limits its applications. In this work, we propose a new concept, "partial independence," with a new measure, "part mutual information" (PMI), which not only can overcome the problem of CMI but also retains the quantification properties of both mutual information (MI) and CMI. Specifically, we first defined PMI to measure nonlinearly direct dependencies between variables and then derived its relations with MI and CMI. Finally, we used a number of simulated data as benchmark examples to numerically demonstrate PMI features and further real gene expression data from Escherichia coli and yeast to reconstruct gene regulatory networks, which all validated the advantages of PMI for accurately quantifying nonlinearly direct associations in networks.},
    Author = {Zhao, Juan and Zhou, Yiwei and Zhang, Xiujun and Chen, Luonan},
    Date-Added = {2016-05-17 07:45:05 +0000},
    Date-Modified = {2016-05-17 07:45:05 +0000},
    Doi = {10.1073/pnas.1522586113},
    Journal = {Proc Natl Acad Sci U S A},
    Journal-Full = {Proceedings of the National Academy of Sciences of the United States of America},
    Keywords = {conditional independence; conditional mutual information; network inference; systems biology},
    Month = {May},
    Number = {18},
    Pages = {5130-5},
    Pmid = {27092000},
    Pst = {ppublish},
    Title = {Part mutual information for quantifying direct associations in networks},
    Volume = {113},
    Year = {2016},
    Bdsk-Url-1 = {//dx.doi.org/10.1073/pnas.1522586113}}

  • G. Qian, C. R. Rao, X. Sun, and Y. Wu, “Boosting association rule mining in large datasets via gibbs sampling,” Proc natl acad sci u s a, vol. 113, iss. 18, pp. 4958-63, 2016. doi:10.1073/pnas.1604553113
    [BibTeX] [Abstract]

    Current algorithms for association rule mining from transaction data are mostly deterministic and enumerative. They can be computationally intractable even for mining a dataset containing just a few hundred transaction items, if no action is taken to constrain the search space. In this paper, we develop a Gibbs-sampling-induced stochastic search procedure to randomly sample association rules from the itemset space, and perform rule mining from the reduced transaction dataset generated by the sample. Also a general rule importance measure is proposed to direct the stochastic search so that, as a result of the randomly generated association rules constituting an ergodic Markov chain, the overall most important rules in the itemset space can be uncovered from the reduced dataset with probability 1 in the limit. In the simulation study and a real genomic data example, we show how to boost association rule mining by an integrated use of the stochastic search and the Apriori algorithm.

    @article{Qian:2016nx,
    Abstract = {Current algorithms for association rule mining from transaction data are mostly deterministic and enumerative. They can be computationally intractable even for mining a dataset containing just a few hundred transaction items, if no action is taken to constrain the search space. In this paper, we develop a Gibbs-sampling-induced stochastic search procedure to randomly sample association rules from the itemset space, and perform rule mining from the reduced transaction dataset generated by the sample. Also a general rule importance measure is proposed to direct the stochastic search so that, as a result of the randomly generated association rules constituting an ergodic Markov chain, the overall most important rules in the itemset space can be uncovered from the reduced dataset with probability 1 in the limit. In the simulation study and a real genomic data example, we show how to boost association rule mining by an integrated use of the stochastic search and the Apriori algorithm.},
    Author = {Qian, Guoqi and Rao, Calyampudi Radhakrishna and Sun, Xiaoying and Wu, Yuehua},
    Date-Added = {2016-05-17 07:43:25 +0000},
    Date-Modified = {2016-05-17 07:43:25 +0000},
    Doi = {10.1073/pnas.1604553113},
    Journal = {Proc Natl Acad Sci U S A},
    Journal-Full = {Proceedings of the National Academy of Sciences of the United States of America},
    Keywords = {Gibbs sampling; association rule; genomic data; transaction data},
    Month = {May},
    Number = {18},
    Pages = {4958-63},
    Pmid = {27091963},
    Pst = {ppublish},
    Title = {Boosting association rule mining in large datasets via Gibbs sampling},
    Volume = {113},
    Year = {2016},
    Bdsk-Url-1 = {//dx.doi.org/10.1073/pnas.1604553113}}

  • K. L. S. L. {Raivo Kolde Kaspar Märtens and J. Vilo}, “Seqlm: an mdl based method for identifying differentially methylated regions in high density methylation array data,” Bioinformatics, 2016.
    [BibTeX] [Abstract]

    Motivation: One of the main goals of large scale methylation studies is to detect differentially methylated loci. One way is to approach this problem sitewise, i.e. to find differentially methylated positions (DMPs). However, it has been shown that methylation is regulated in longer genomic regions. So it is more desirable to identify differentially methylated regions (DMRs) instead of DMPs. The new high coverage arrays, like Illuminas 450k platform, make it possible at a reasonable cost. Few tools exist for DMR identification from this type of data, but there is no standard approach. Results: We propose a novel method for DMR identification that detects the region boundaries according to the minimum description length (MDL) principle, essentially solving the problem of model selection. The significance of the regions is established using linear mixed models. Using both simulated and large publicly available methylation datasets, we compare seqlm performance to alternative approaches. We demonstrate that it is both more sensitive and specific than competing methods. This is achieved with minimal parameter tuning and, surprisingly, quickest running time of all the tried methods. Finally, we show that the regional differential methylation patterns identified on sparse array data are confirmed by higher resolution sequencing approaches. Availability: The methods have been implemented in R package seqlm that is available through Github: //github.com/raivokolde/seqlm Contact: rkolde@gmail.com

    @article{kolde:seqlm16,
    Abstract = {Motivation: One of the main goals of large scale methylation studies is to detect differentially methylated loci. One way is to approach this problem sitewise, i.e. to find differentially methylated positions (DMPs). However, it has been shown that methylation is regulated in longer genomic regions. So it is more desirable to identify differentially methylated regions (DMRs) instead of DMPs. The new high coverage arrays, like Illuminas 450k platform, make it possible at a reasonable cost. Few tools exist for DMR identification from this type of data, but there is no standard approach.
    Results: We propose a novel method for DMR identification that detects the region boundaries according to the minimum description length (MDL) principle, essentially solving the problem of model selection. The significance of the regions is established using linear mixed models. Using both simulated and large publicly available methylation datasets, we compare seqlm performance to alternative approaches. We demonstrate that it is both more sensitive and specific than competing methods. This is achieved with minimal parameter tuning and, surprisingly, quickest running time of all the tried methods. Finally, we show that the regional differential methylation patterns identified on sparse array data are confirmed by higher resolution sequencing approaches.
    Availability: The methods have been implemented in R package seqlm that is available through Github: //github.com/raivokolde/seqlm
    Contact: rkolde@gmail.com},
    Author = {{Raivo Kolde, Kaspar M{\"a}rtens, Kaie Lokk, Sven Laur and Jaak Vilo}},
    Date-Added = {2016-05-17 07:31:22 +0000},
    Date-Modified = {2016-05-17 07:32:53 +0000},
    Journal = {Bioinformatics},
    Title = {seqlm: an MDL based method for identifying differentially methylated regions in high density methylation array data},
    Year = {2016}}

  • Y. Kim, S. Madan, and T. M. Przytycka, “Wesme: uncovering mutual exclusivity of cancer drivers and beyond,” Bioinformatics, 2016. doi:10.1093/bioinformatics/btw242
    [BibTeX] [Abstract]

    MOTIVATION: Mutual exclusivity is a widely recognized property of many cancer drivers. Knowledge about these relationships can provide important insights into cancer drivers, cancer-driving pathways, and cancer subtypes. It can also be used to predict new functional interactions between cancer driving genes and uncover novel cancer drivers. Currently, most of mutual exclusivity analyses are pre-formed focusing on a limited set of genes in part due to the computational cost required to rigorously compute p-values. RESULTS: To reduce the computing cost and perform less restricted mutual exclusivity analysis, we developed an efficient method to estimate p-values while controlling the mutation rates of individual patients and genes similar to the permutation test. A comprehensive mutual exclusivity analysis allowed us to uncover mutually exclusive pairs, some of which may have relatively low mutation rates. These pairs often included likely cancer drivers that have been missed in previous analyses. More importantly, our results demonstrated that mutual exclusivity can also provide information that goes beyond the interactions between cancer drivers and can, for example, elucidate different mutagenic processes in different cancer groups. In particular, including frequently mutated, long genes such as TTN in our analysis allowed us to observe interesting patterns of APOBEC activity in breast cancer and identify a set of related driver genes that are highly predictive of patient survival. In addition, we utilized our mutual exclusivity analysis in support of a previously proposed model where APOBEC activity is the underlying process that causes TP53 mutations in a subset of breast cancer cases. AVAILABILITY: //www.ncbi.nlm.nih.gov/CBBresearch/Przytycka/index.cgi#wesme CONTACT: przytyck@ncbi.nlm.nih.gov SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

    @article{Kim:2016cr,
    Abstract = {MOTIVATION: Mutual exclusivity is a widely recognized property of many cancer drivers. Knowledge about these relationships can provide important insights into cancer drivers, cancer-driving pathways, and cancer subtypes. It can also be used to predict new functional interactions between cancer driving genes and uncover novel cancer drivers. Currently, most of mutual exclusivity analyses are pre-formed focusing on a limited set of genes in part due to the computational cost required to rigorously compute p-values.
    RESULTS: To reduce the computing cost and perform less restricted mutual exclusivity analysis, we developed an efficient method to estimate p-values while controlling the mutation rates of individual patients and genes similar to the permutation test. A comprehensive mutual exclusivity analysis allowed us to uncover mutually exclusive pairs, some of which may have relatively low mutation rates. These pairs often included likely cancer drivers that have been missed in previous analyses. More importantly, our results demonstrated that mutual exclusivity can also provide information that goes beyond the interactions between cancer drivers and can, for example, elucidate different mutagenic processes in different cancer groups. In particular, including frequently mutated, long genes such as TTN in our analysis allowed us to observe interesting patterns of APOBEC activity in breast cancer and identify a set of related driver genes that are highly predictive of patient survival. In addition, we utilized our mutual exclusivity analysis in support of a previously proposed model where APOBEC activity is the underlying process that causes TP53 mutations in a subset of breast cancer cases.
    AVAILABILITY: //www.ncbi.nlm.nih.gov/CBBresearch/Przytycka/index.cgi#wesme CONTACT: przytyck@ncbi.nlm.nih.gov SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.},
    Author = {Kim, Yoo-Ah and Madan, Sanna and Przytycka, Teresa M},
    Date-Added = {2016-05-17 07:30:14 +0000},
    Date-Modified = {2016-05-17 07:30:14 +0000},
    Doi = {10.1093/bioinformatics/btw242},
    Journal = {Bioinformatics},
    Journal-Full = {Bioinformatics (Oxford, England)},
    Month = {May},
    Pmid = {27153670},
    Pst = {aheadofprint},
    Title = {WeSME: Uncovering Mutual Exclusivity of Cancer Drivers and Beyond},
    Year = {2016},
    Bdsk-Url-1 = {//dx.doi.org/10.1093/bioinformatics/btw242}}

  • V. Rao, L. Lin, and D. Dunson, “Data augmentation for models based on rejection sampling,” Arxiv preprint arxiv:1406.6652, 2014.
    [BibTeX]
    @article{rao2014data,
    title={Data augmentation for models based on rejection sampling},
    author={Rao, Vinayak and Lin, Lizhen and Dunson, David},
    journal={arXiv preprint arXiv:1406.6652},
    year={2014}
    }

  • Y. Hu, K. Huang, Q. An, G. Du, G. Hu, J. Xue, X. Zhu, C. Wang, Z. Xue, and G. Fan, “Simultaneous profiling of transcriptome and dna methylome from a single cell,” Genome biology, vol. 17, iss. 1, p. 1, 2016.
    [BibTeX]
    @article{hu2016simultaneous,
    title={Simultaneous profiling of transcriptome and DNA methylome from a single cell},
    author={Hu, Youjin and Huang, Kevin and An, Qin and Du, Guizhen and Hu, Ganlu and Xue, Jinfeng and Zhu, Xianmin and Wang, Cun-Yu and Xue, Zhigang and Fan, Guoping},
    journal={Genome Biology},
    volume={17},
    number={1},
    pages={1},
    year={2016},
    publisher={BioMed Central}
    }