Person

Yoshua Bengio

Computer Scientist

Computer scientist known for foundational work in deep learning and for leading Mila and research at Universite de Montreal.

Website

Papers

openalex-author · ArXiv.org

Sliding Window Recurrences for Sequence Models

Multi-hybrid architectures are poised to take over language modeling due to better quality and performance. We introduce a hierarchical decomposition framework for linear recurrences that allows us to develop algorithms aligned with GPU memory hierarchies, yielding Sliding Window Recurrences. We focus specifically on truncating recurrences to hardware-aligned windows which are naturally jagged, limiting costly inter-warp communication. Using SWR, we develop Phalanx layers that serve as drop-in replacements for windowed attention or linear recurrences. In 1B parameter multi-hybrid models, Phalanx achieves over 10-40% speedup across 4K to 32K context length over optimized Transformers while matching perplexity.

openalex-author · SuperIntelligence - Robotics - Safety & Alignment

International AI Safety Report 2025: Second Key Update: Technical Safeguards and Risk Management

This is the Second Key Update to the 2025 International AI Safety Report. The First Key Update (1) discussed developments in the capabilities of general-purpose AI models and systems and associated risks. This Key Update covers how various actors, including researchers, companies, and governments, are approaching risk management and technical mitigations for AI. The past year has seen important developments in AI risk management, including better techniques for training safer models and monitoring their outputs. While this represents tangible progress, significant gaps remain. It is often uncertain how effective current measures are at preventing harms, and effectiveness varies across time and applications. There are many opportunities to further strengthen existing safeguard techniques and to develop new ones. This Key Update provides a concise overview of critical developments in risk management practices and technical risk mitigation since the publication of the 2025 AI Safety Report in January. It highlights where progress is being made and where gaps remain. Above all, it aims to support policymakers, researchers, and the public in navigating a rapidly changing environment, helping them to make informed and timely decisions about the governance of general-purpose AI. Professor Yoshua BengioUniversité de Montréal / LawZero /Mila – Quebec AI Institute & Chair

openalex-author · ArXiv.org

Adsorption energies are necessary but not sufficient to identify good catalysts

As a core technology for green chemical synthesis and electrochemical energy storage, electrocatalysis is central to decarbonization strategies aimed at combating climate change. In this context, computational and machine learning driven catalyst discovery has emerged as a major research focus. These approaches frequently use the thermodynamic overpotential, calculated from adsorption free energies of reaction intermediates, as a key parameter in their analysis. In this paper, we explore the large-scale applicability of such overpotential estimates for identifying good catalyst candidates by using datasets from the Open Catalyst Project (OC20 and OC22). We start by quantifying the uncertainty in predicting adsorption energies using \textit{ab initio} methods and find that $\sim$0.3-0.5 eV is a conservative estimate for a single adsorption energy prediction. We then compute the overpotential of all materials in the OC20 and OC22 datasets for the hydrogen and oxygen evolution reactions. We find that while the overpotential allows the identification of known good catalysts such as platinum and iridium oxides, the uncertainty is large enough to misclassify a broad fraction of the datasets as ``good'', which limits its value as a screening criterion. These results question the reliance on overpotential estimation as a primary evaluation metric to sort through catalyst candidates and calls for a shift in focus in the computational catalysis and machine learning communities towards other metrics such as synthesizability, stability, lifetime or affordability.

openalex-author · Proceedings of the 12th ACM International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation

A HOT Dataset: 150,000 Buildings for HVAC Operations Transfer Research

About 12% of global energy consumption is attributable to heating, ventilation, and air conditioning (HVAC) systems in buildings [11]. Machine learning-based intelligent HVAC control offers significant energy efficiency potential, but progress is constrained by limited data for training and evaluating performance across different kinds of buildings. Existing datasets primarily target energy prediction rather than control applications, forcing studies to rely on limited building sets or single-variable perturbations that fail to capture real-world complexity. We present HOT (HVAC Operations Transfer), the first large-scale open-source dataset purpose-built for research into transfer learning in building control. HOT contains 159,744 unique building-weather combinations with systematic variations across envelope properties, occupancy patterns, and climate conditions spanning all 19 ASHRAE climate zones across 76 global locations. We formalise a comprehensive similarity-based framework with quantitative metrics for assessing transfer feasibility between source and target buildings across multiple context dimensions. Our key contributions: (1) a large-scale, open dataset and tooling enabling systematic, multi-variable transfer studies across 19 climate zones; (2) a quantitative similarity framework spanning geometry, thermal, climate, and function; and (3) zero-shot climate transfer experiments showing why realistic context variation matters for HVAC control.

openalex-author · Nature Biotechnology

Publisher Correction: Deep-learning-based virtual screening of antibacterial compounds

No abstract available from the OpenAlex source record.

openalex-author · Trends in Cognitive Sciences

Identifying indicators of consciousness in AI systems

No abstract available from the OpenAlex source record.

openalex-author · Nature Biotechnology

Deep-learning-based virtual screening of antibacterial compounds

No abstract available from the OpenAlex source record.

openalex-author · SuperIntelligence - Robotics - Safety & Alignment

International Al Safety Report: First Key Update Capabilities and Risk Implications

The field of AI is moving too quickly for a single yearly publication to keep pace. Significant changes can occur on a timescale of months, sometimes weeks. This is why we are releasing Key Updates: shorter, focused reports that highlight the most important developments between full editions of the International AI Safety Report. With these updates, we aim to provide policymakers, researchers, and the public with up-to-date information to support wise decisions about AI governance. This first Key Update focuses on areas where especially significant changes have occurred since January 2025: advances in general-purpose AI systems' capabilities, and the implications for several critical risks. New training techniques have enabled AI systems to reason step-by-step and operate autonomously for longer periods, allowing them to tackle more kinds of work. However, these same advances create new challenges across biological risks, cyber security, and oversight of AI systems themselves. The International AI Safety Report is intended to help readers assess, anticipate, and manage risks from general-purpose AI systems. These Key Updates ensure that critical developments receive timely attention as the field rapidly evolves.

openalex-author · ArXiv.org

A Definition of AGI

The lack of a concrete definition for Artificial General Intelligence (AGI) obscures the gap between today's specialized AI and human-level cognition. This paper introduces a quantifiable framework to address this, defining AGI as matching the cognitive versatility and proficiency of a well-educated adult. To operationalize this, we ground our methodology in Cattell-Horn-Carroll theory, the most empirically validated model of human cognition. The framework dissects general intelligence into ten core cognitive domains-including reasoning, memory, and perception-and adapts established human psychometric batteries to evaluate AI systems. Application of this framework reveals a highly "jagged" cognitive profile in contemporary models. While proficient in knowledge-intensive domains, current AI systems have critical deficits in foundational cognitive machinery, particularly long-term memory storage. The resulting AGI scores (e.g., GPT-4 at 27%, GPT-5 at 57%) concretely quantify both rapid progress and the substantial gap remaining before AGI.

openalex-author · ChemRxiv

Navigating ternary doping in Li-ion cathodes with closed-loop multi-objective Bayesian optimization

No abstract available from the OpenAlex source record.

openalex-author · ArXiv.org

International AI Safety Report 2025: First Key Update: Capabilities and Risk Implications

Since the publication of the first International AI Safety Report, AI capabilities have continued to improve across key domains. New training techniques that teach AI systems to reason step-by-step and inference-time enhancements have primarily driven these advances, rather than simply training larger models. As a result, general-purpose AI systems can solve more complex problems in a range of domains, from scientific research to software development. Their performance on benchmarks that measure performance in coding, mathematics, and answering expert-level science questions has continued to improve, though reliability challenges persist, with systems excelling on some tasks while failing completely on others. These capability improvements also have implications for multiple risks, including risks from biological weapons and cyber attacks. Finally, they pose new challenges for monitoring and controllability. This update examines how AI capabilities have improved since the first Report, then focuses on key risk areas where substantial new evidence warrants updated assessments.

openalex-author · ArXiv.org

Catalyst GFlowNet for electrocatalyst design: A hydrogen evolution reaction case study

Efficient and inexpensive energy storage is essential for accelerating the adoption of renewable energy and ensuring a stable supply, despite fluctuations in sources such as wind and solar. Electrocatalysts play a key role in hydrogen energy storage (HES), allowing the energy to be stored as hydrogen. However, the development of affordable and high-performance catalysts for this process remains a significant challenge. We introduce Catalyst GFlowNet, a generative model that leverages machine learning-based predictors of formation and adsorption energy to design crystal surfaces that act as efficient catalysts. We demonstrate the performance of the model through a proof-of-concept application to the hydrogen evolution reaction, a key reaction in HES, for which we successfully identified platinum as the most efficient known catalyst. In future work, we aim to extend this approach to the oxygen evolution reaction, where current optimal catalysts are expensive metal oxides, and open the search space to discover new materials. This generative modeling framework offers a promising pathway for accelerating the search for novel and efficient catalysts.

openalex-author · Science

Illusions of AI consciousness

The belief that AI is conscious is not without risk.

openalex-author · ArXiv.org

Relative Trajectory Balance is equivalent to Trust-PCL

Recent progress in generative modeling has highlighted the importance of Reinforcement Learning (RL) for fine-tuning, with KL-regularized methods in particular proving to be highly effective for both autoregressive and diffusion models. Complementing this line of work, the Relative Trajectory Balance (RTB) objective was recently introduced in the context of Generative Flow Networks (GFlowNets) to serve the same role of improving fine-tuning in sequential generative models. Building on prior work linking GFlowNets and maximum-entropy RL, we establish in this paper an equivalence between RTB and Trust-PCL, an off-policy RL method with KL regularization. This equivalence situates RTB within the broader theoretical landscape of KL-regularized RL, and clarifies its relationship to earlier methods. Leveraging this insight, we revisit an illustrative example from the RTB paper and show that KL-regularized RL methods achieve comparable performance, offering an alternative perspective to what was previously reported.

openalex-author · SuperIntelligence - Robotics - Safety & Alignment

Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?

The leading AI companies are increasingly focused on building generalist AI agents—systems that can autonomously plan, act, and pursue goals across almost all tasks that humans can perform. Despite how useful these systems might be, unchecked AI agency poses significant risks to public safety and security, ranging from misuse by malicious actors to a potentially irreversible loss of human control. We discuss how these risks arise from current AI training methods. Indeed, various scenarios and experiments have demonstrated the possibility of AI agents engaging in deception or pursuing goals that were not specified by human operators and that conflict with human interests, such as self-preservation. Following the precautionary principle, we see a strong need for safer, yet still useful, alternatives to the current agency-driven trajectory. Accordingly, we propose as a core building block for further advances the development of a non-agentic AI system that is trustworthy and safe by design, which we call Scientist AI. This system is designed to explain the world from observations, as opposed to taking actions in it to imitate or please humans. It comprises a world model that generates theories to explain data and a question-answering inference machine. Both components operate with an explicit notion of uncertainty to mitigate the risks of overconfident predictions. In light of these considerations, a Scientist AI could be used to assist human researchers in accelerating scientific progress, including in AI safety. In particular, our system can be employed as a guardrail against AI agents that might be created despite the risks involved. Ultimately, focusing on non-agentic AI may enable the benefits of AI innovation while avoiding the risks associated with the current trajectory. We hope these arguments will motivate researchers, developers, and policymakers to favor this safer path.

openalex-author · SuperIntelligence - Robotics - Safety & Alignment

In Which Areas of Technical AI Safety Could Geopolitical Rivals Cooperate?

International cooperation is common in AI research, including between geopolitical rivals. While many experts advocate for greater international cooperation on AI safety to address shared global risks, some view cooperation on AI with suspicion, arguing that it can pose unacceptable risks to national security. However, the extent to which cooperation on AI safety poses such risks, as well as provides benefits, depends on the specific area of cooperation. In this paper, we consider technical factors that impact the risks of international cooperation on AI safety research, focusing on the degree to which such cooperation can advance dangerous capabilities, result in the sharing of sensitive information, or provide opportunities for harm. We begin by why nations historically cooperate on strategic technologies and analyse current US-China cooperation in AI as a case study. We further argue that existing frameworks for managing associated risks can be supplemented with consideration of key risks specific to cooperation on technical AI safety research. Through our analysis, we find that research into AI verification mechanisms and shared protocols may be suitable areas for such cooperation. Through this analysis we aim to help researchers and governments identify and mitigate the risks of international cooperation on AI safety research, so that the benefits of cooperation can be fully realised.

openalex-author · SuperIntelligence - Robotics - Safety & Alignment

The Singapore Consensus on Global AI Safety Research Priorities

Rapidly improving AI capabilities and autonomy hold significant promise of transformation, but are also driving vigorous debate on how to ensure that AI is safe, i.e., trustworthy, reliable, and secure. Building a trusted ecosystem is therefore essential – it helps people embrace AI with confidence and gives maximal space for innovation while avoiding backlash. This requires policymakers, industry, researchers and the broader public to collectively work toward securing positive outcomes from AI’s development. AI safety research is a key dimension. Given that the state of science today for building trustworthy AI does not fully cover all risks, accelerated investment in research is required to keep pace with commercially driven growth in system capabilities. Goals: The 2025 Singapore Conference on AI (SCAI): International Scientific Exchange on AI Safety aims to support research in this important space by bringing together AI scientists across geographies to identify and synthesise research priorities in AI safety. The result, The Singapore Consensus on Global AI Safety Research Priorities, builds on the International AI Safety Report-A (IAISR) chaired by Yoshua Bengio and backed by 33 governments. By adopting a defence-in-depth model, this document organises AI safety research domains into three types: challenges with creating trustworthy AI systems (Development), challenges with evaluating their risks (Assessment), and challenges with monitoring and intervening after deployment (Control). Through the Singapore Consensus, we hope to globally facilitate meaningful conversations between AI scientists and AI policymakers for maximally beneficial outcomes. Our goal is to enable more impactful R&D efforts to rapidly develop safety and evaluation mechanisms and foster a trusted ecosystem where AI is harnessed for the public good.

openalex-author · ArXiv.org

Expert-Guided LLM Reasoning for Battery Discovery: From AI-Driven Hypothesis to Synthesis and Characterization

Large language models (LLMs) leverage chain-of-thought (CoT) techniques to tackle complex problems, representing a transformative breakthrough in artificial intelligence (AI). However, their reasoning capabilities have primarily been demonstrated in solving math and coding problems, leaving their potential for domain-specific applications-such as battery discovery-largely unexplored. Inspired by the idea that reasoning mirrors a form of guided search, we introduce ChatBattery, a novel agentic framework that integrates domain knowledge to steer LLMs toward more effective reasoning in materials design. Using ChatBattery, we successfully identify, synthesize, and characterize three novel lithium-ion battery cathode materials, which achieve practical capacity improvements of 28.8%, 25.2%, and 18.5%, respectively, over the widely used cathode material, LiNi0.8Mn0.1Co0.1O2 (NMC811). Beyond this discovery, ChatBattery paves a new path by showing a successful LLM-driven and reasoning-based platform for battery materials invention. This complete AI-driven cycle-from design to synthesis to characterization-demonstrates the transformative potential of AI-driven reasoning in revolutionizing materials discovery.

openalex-author · ArXiv.org

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

AI systems that "think" in human language offer a unique opportunity for AI safety: we can monitor their chains of thought (CoT) for the intent to misbehave. Like all other known AI oversight methods, CoT monitoring is imperfect and allows some misbehavior to go unnoticed. Nevertheless, it shows promise and we recommend further research into CoT monitorability and investment in CoT monitoring alongside existing safety methods. Because CoT monitorability may be fragile, we recommend that frontier model developers consider the impact of development decisions on CoT monitorability.

openalex-author · ArXiv.org

Torsional-GFN: a conditional conformation generator for small molecules

Generating stable molecular conformations is crucial in several drug discovery applications, such as estimating the binding affinity of a molecule to a target. Recently, generative machine learning methods have emerged as a promising, more efficient method than molecular dynamics for sampling of conformations from the Boltzmann distribution. In this paper, we introduce Torsional-GFN, a conditional GFlowNet specifically designed to sample conformations of molecules proportionally to their Boltzmann distribution, using only a reward function as training signal. Conditioned on a molecular graph and its local structure (bond lengths and angles), Torsional-GFN samples rotations of its torsion angles. Our results demonstrate that Torsional-GFN is able to sample conformations approximately proportional to the Boltzmann distribution for multiple molecules with a single model, and allows for zero-shot generalization to unseen bond lengths and angles coming from the MD simulations for such molecules. Our work presents a promising avenue for scaling the proposed approach to larger molecular systems, achieving zero-shot generalization to unseen molecules, and including the generation of the local structure into the GFlowNet model.

openalex-author · ArXiv.org

Bringing SAM to new heights: Leveraging elevation data for tree crown segmentation from drone imagery

Information on trees at the individual level is crucial for monitoring forest ecosystems and planning forest management. Current monitoring methods involve ground measurements, requiring extensive cost, time and labor. Advances in drone remote sensing and computer vision offer great potential for mapping individual trees from aerial imagery at broad-scale. Large pre-trained vision models, such as the Segment Anything Model (SAM), represent a particularly compelling choice given limited labeled data. In this work, we compare methods leveraging SAM for the task of automatic tree crown instance segmentation in high resolution drone imagery in three use cases: 1) boreal plantations, 2) temperate forests and 3) tropical forests. We also study the integration of elevation data into models, in the form of Digital Surface Model (DSM) information, which can readily be obtained at no additional cost from RGB drone imagery. We present BalSAM, a model leveraging SAM and DSM information, which shows potential over other methods, particularly in the context of plantations. We find that methods using SAM out-of-the-box do not outperform a custom Mask R-CNN, even with well-designed prompts. However, efficiently tuning SAM end-to-end and integrating DSM information are both promising avenues for tree crown instance segmentation models.

openalex-author · SuperIntelligence - Robotics - Safety & Alignment

The First International AI Safety Report

This is the first International AI Safety Report. Following an interim publication in May 2024, a diverse group of 96 Artificial Intelligence (AI) experts contributed to this first full report, including an international Expert Advisory Panel nominated by 30 countries, the Organisation for Economic Co-operation and Development (OECD), the European Union (EU), and the United Nations (UN). The report aims to provide scientific information that will support informed policymaking. It does not recommend specific policies…. This report summarises the scientific evidence on the safety of general-purpose AI. The purpose of this report is to help create a shared international understanding of risks from advanced AI and how they can be mitigated. To achieve this, this report focuses on general-purpose AI – or AI that can perform a wide variety of tasks – since this type of AI has advanced particularly rapidly in recent years and has been deployed widely by technology companies for a range of consumer and business purposes. The report synthesises the state of scientific understanding of general-purpose AI, with a focus on understanding and managing its risks. Amid rapid advancements, research on general-purpose AI is currently in a time of scientific discovery, and – in many cases – is not yet settled science. The report provides a snapshot of the current scientific understanding of general-purpose AI and its risks. This includes identifying areas of scientific consensus and areas where there are different views or gaps in the current scientific understanding. People around the world will only be able to fully enjoy the potential benefits of general-purpose AI safely if its risks are appropriately managed. This report focuses on identifying those risks and evaluating technical methods for assessing and mitigating them, including ways that general-purpose AI itself can be used to mitigate risks. Y. Bengio, S. Mindermann, D. Privitera, T. Besiroglu, R. Bommasani, S. Casper, Y. Choi, P. Fox, B. Garfinkel, D. Goldfarb, H. Heidari, A. Ho, S. Kapoor, L. Khalatbari, S. Longpre, S. Manning, V. Mavroudis, M. Mazeika, J. Michael, J. Newman, K. Y. Ng, C. T. Okolo, D. Raji, G. Sastry, E. Seger, T. Skeadas, T. South, E. Strubell, F. Tramèr, L. Velasco, N. Wheeler, D. Acemoglu, O. Adekanmbi, D. Dalrymple, T. G. Dietterich, P. Fung, P.-O. Gourinchas, F. Heintz, G. Hinton, N. Jennings, A. Krause, S. Leavy, P. Liang, T. Ludermir, V. Marda, H. Margetts, J. McDermid, J. Munga, A. Narayanan, A. Nelson, C. Neppel, A. Oh, G. Ramchurn, S. Russell, M. Schaake, B. Schölkopf, D. Song, A. Soto, L. Tiedrich, G. Varoquaux, E. W. Felten, A. Yao, Y.-Q. Zhang, O. Ajala, F. Albalawi, M. Alserkal, G. Avrin, C. Busch, A. C. P. de L. F. de Carvalho, B. Fox, A. S. Gill, A. H. Hatip, J. Heikkilä, C. Johnson, G. Jolly, Z. Katzir, S. M. Khan, H. Kitano, A. Krüger, K. M. Lee, D. V. Ligot, J. R. López Portillo, D., O. Molchanovskyi, A. Monti, N. Mwamanzi, M. Nemer, N. Oliver, R. Pezoa Rivera, B. Ravindran, H. Riza, C. Rugege, C. Seoighe, H. Sheikh, J. Sheehan, D. Wong, Y. Zeng, “International AI Safety Report” (DSIT 2025/001, 2025); https://www.gov.uk/government/publications/international-ai-safety-report-2025

openalex-author · ArXiv.org

Structure-Aligned Protein Language Model

Protein language models (pLMs) pre-trained on vast protein sequence databases excel at various downstream tasks but often lack the structural knowledge essential for some biological applications. To address this, we introduce a method to enrich pLMs with structural knowledge by leveraging pre-trained protein graph neural networks (pGNNs). First, a latent-level contrastive learning task aligns residue representations from pLMs with those from pGNNs across multiple proteins, injecting inter-protein structural information. Additionally, a physical-level task integrates intra-protein information by training pLMs to predict structure tokens. Together, the proposed dual-task framework effectively incorporates both inter- and intra-protein structural knowledge into pLMs. Given the variability in the quality of protein structures in PDB, we further introduce a residue loss selection module that uses a small model trained on high-quality structures to select reliable yet challenging residue losses for the pLM to learn. Applying our structure alignment method as a simple, lightweight post-training step to the state-of-the-art ESM2 and AMPLIFY yields notable performance gains. These improvements are consistent across a wide range of tasks, including substantial gains in deep mutational scanning (DMS) fitness prediction and a 59% increase in P@L for ESM2 650M contact prediction on CASP16. Furthermore, we demonstrate that these performance gains are robust, scaling with model sizes from 8M to 650M and extending to different downstream tasks.

openalex-author · Nature Genetics

Causal machine learning for single-cell genomics

No abstract available from the OpenAlex source record.

openalex-author · ArXiv.org

Extendable Planning via Multiscale Diffusion

Long-horizon planning is crucial in complex environments, but diffusion-based planners like Diffuser are limited by the trajectory lengths observed during training. This creates a dilemma: long trajectories are needed for effective planning, yet they degrade model performance. In this paper, we introduce this extendable long-horizon planning challenge and propose a two-phase solution. First, Progressive Trajectory Extension incrementally constructs longer trajectories through multi-round compositional stitching. Second, the Hierarchical Multiscale Diffuser enables efficient training and inference over long horizons by reasoning across temporal scales. To avoid the need for multiple separate models, we propose Adaptive Plan Pondering and the Recursive HM-Diffuser, which unify hierarchical planning within a single model. Experiments show our approach yields strong performance gains, advancing scalable and efficient decision-making over long-horizons.

openalex-author · ArXiv.org

A scalable gene network model of regulatory dynamics in single cells

Single-cell data provide high-dimensional measurements of the transcriptional states of cells, but extracting insights into the regulatory functions of genes, particularly identifying transcriptional mechanisms affected by biological perturbations, remains a challenge. Many perturbations induce compensatory cellular responses, making it difficult to distinguish direct from indirect effects on gene regulation. Modeling how gene regulatory functions shape the temporal dynamics of these responses is key to improving our understanding of biological perturbations. Dynamical models based on differential equations offer a principled way to capture transcriptional dynamics, but their application to single-cell data has been hindered by computational constraints, stochasticity, sparsity, and noise. Existing methods either rely on low-dimensional representations or make strong simplifying assumptions, limiting their ability to model transcriptional dynamics at scale. We introduce a Functional and Learnable model of Cell dynamicS, FLeCS, that incorporates gene network structure into coupled differential equations to model gene regulatory functions. Given (pseudo)time-series single-cell data, FLeCS accurately infers cell dynamics at scale, provides improved functional insights into transcriptional mechanisms perturbed by gene knockouts, both in myeloid differentiation and K562 Perturb-seq experiments, and simulates single-cell trajectories of A549 cells following small-molecule perturbations.

openalex-author · ArXiv.org

Solving Bayesian inverse problems with diffusion priors and off-policy RL

This paper presents a practical application of Relative Trajectory Balance (RTB), a recently introduced off-policy reinforcement learning (RL) objective that can asymptotically solve Bayesian inverse problems optimally. We extend the original work by using RTB to train conditional diffusion model posteriors from pretrained unconditional priors for challenging linear and non-linear inverse problems in vision, and science. We use the objective alongside techniques such as off-policy backtracking exploration to improve training. Importantly, our results show that existing training-free diffusion posterior methods struggle to perform effective posterior inference in latent space due to inherent biases.

openalex-author · Nature Neuroscience

What makes a theory of consciousness unscientific?

No abstract available from the OpenAlex source record.

openalex-author · SuperIntelligence - Robotics - Safety & Alignment

Can a Bayesian Oracle Prevent Harm from an Agent?

Is there a way to design powerful AI systems based on machine learning methods that would satisfy probabilistic safety guarantees? With the long-term goal of obtaining a probabilistic guarantee that would apply in every context, we consider estimating a context-dependent bound on the probability of violating a given safety specification. Such a risk evaluation would need to be performed at run-time to provide a guardrail against dangerous actions of an AI. Noting that different plausible hypotheses about the world could produce very different outcomes, and because we do not know which one is right, we derive bounds on the safety violation probability predicted under the true but unknown hypothesis. Such bounds could be used to reject potentially dangerous actions. Our main results involve searching for cautious but plausible hypotheses, obtained by a maximization that involves Bayesian posteriors over hypotheses. We consider two forms of this result, in the i.i.d. case and in the non-i.i.d. case, and conclude with open problems towards turning such theoretical results into practical AI guardrails.

openalex-author · 2025 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW)

EarthView: A Large Scale Remote Sensing Dataset for Self-Supervision

This paper presents EarthView, a comprehensive dataset specifically designed for self-supervision on remote sensing data, intended to enhance deep learning applications on Earth monitoring tasks. The dataset spans 15 tera pixels of global remote-sensing data, combining imagery from a diverse range of sources, including NEON, Sentinel, and a novel release of 1m spatial resolution data from Satellogic. Our dataset provides a wide spectrum of image data with varying resolutions, harnessed from different sensors and organized coherently into an accessible HuggingFace dataset1 1Available at https://huggingface.co/datasets/satellogic/EarthView in parquet format. This data spans five years, from 2017 to 2022. Accompanying the dataset, we introduce Earth-MAE, a tailored Masked Autoencoder, developed to tackle the distinct challenges of remote sensing data. Trained in a self-supervised fashion, EarthMAE effectively processes different data modalities such as hyperspectral, multispectral, topographical data, segmentation maps, and temporal structure. This model helps us show that pre-training on Satellogic data improves performance on downstream tasks. While there is still a gap to fill in MAE for heterogeneous data, we regard this innovative combination of an expansive, diverse dataset and a versatile model adapted for self-supervised learning as a stride forward in deep learning for Earth monitoring.

openalex-author · ArXiv.org

Scientist AI Needs a Government: Structural Governance as the Missing Layer in Non-Agentic AI Safety

Bengio et al. (2025) propose Scientist AI — a non-agentic, Bayesian AI architecture — as a safer alternative to autonomous AI agents. This paper identifies structural governance — governance embedded in system architecture rather than imposed by external institutions — as the missing layer. Drawing on the Neural Knowledge Base (NKB), a production knowledge graph with 1.27 million nodes and a functioning citizenship architecture, we propose a three-layer model for AI safety: honest computation (Scientist AI), structural governance (citizenship architecture), and institutional regulation. No single layer is sufficient.

openalex-author · ArXiv.org

OBELiX: A Curated Dataset of Crystal Structures and Experimentally Measured Ionic Conductivities for Lithium Solid-State Electrolytes

Solid-state electrolyte batteries are expected to replace liquid electrolyte lithium-ion batteries in the near future thanks to their higher theoretical energy density and improved safety. However, their adoption is currently hindered by their lower effective ionic conductivity, a quantity that governs charge and discharge rates. Identifying highly ion-conductive materials using conventional theoretical calculations and experimental validation is both time-consuming and resource-intensive. While machine learning holds the promise to expedite this process, relevant ionic conductivity and structural data is scarce. Here, we present OBELiX, a database of $\sim$600 synthesized solid electrolyte materials and their experimentally measured room temperature ionic conductivities gathered from literature and curated by domain experts. Each material is described by their measured composition, space group and lattice parameters. A full-crystal description in the form of a crystallographic information file (CIF) is provided for $\sim$320 structures for which atomic positions were available. We discuss various statistics and features of the dataset and provide training and testing splits carefully designed to avoid data leakage. Finally, we benchmark seven existing ML models on the task of predicting ionic conductivity and discuss their performance. The goal of this work is to facilitate the use of machine learning for solid-state electrolyte materials discovery.

openalex-author · arXiv (Cornell University)

In-Context Parametric Inference: Point or Distribution Estimators?

Bayesian and frequentist inference are two fundamental paradigms in statistical estimation. Bayesian methods treat hypotheses as random variables, incorporating priors and updating beliefs via Bayes' theorem, whereas frequentist methods assume fixed but unknown hypotheses, relying on estimators like maximum likelihood. While extensive research has compared these approaches, the frequentist paradigm of obtaining point estimates has become predominant in deep learning, as Bayesian inference is challenging due to the computational complexity and the approximation gap of posterior estimation methods. However, a good understanding of trade-offs between the two approaches is lacking in the regime of amortized estimators, where in-context learners are trained to estimate either point values via maximum likelihood or maximum a posteriori estimation, or full posteriors using normalizing flows, score-based diffusion samplers, or diagonal Gaussian approximations, conditioned on observations. To help resolve this, we conduct a rigorous comparative analysis spanning diverse problem settings, from linear models to shallow neural networks, with a robust evaluation framework assessing both in-distribution and out-of-distribution generalization on tractable tasks. Our experiments indicate that amortized point estimators generally outperform posterior inference, though the latter remain competitive in some low-dimensional problems, and we further discuss why this might be the case.

openalex-author · ArXiv.org

Shaping Inductive Bias in Diffusion Models through Frequency-Based Noise Control

Diffusion Probabilistic Models (DPMs) are powerful generative models that have achieved unparalleled success in a number of generative tasks. In this work, we aim to build inductive biases into the training and sampling of diffusion models to better accommodate the target distribution of the data to model. For topologically structured data, we devise a frequency-based noising operator to purposefully manipulate, and set, these inductive biases. We first show that appropriate manipulations of the noising forward process can lead DPMs to focus on particular aspects of the distribution to learn. We show that different datasets necessitate different inductive biases, and that appropriate frequency-based noise control induces increased generative performance compared to standard diffusion. Finally, we demonstrate the possibility of ignoring information at particular frequencies while learning. We show this in an image corruption and recovery task, where we train a DPM to recover the original target distribution after severe noise corruption.

openalex-author · ArXiv.org

Outsourced diffusion sampling: Efficient posterior inference in latent spaces of generative models

Any well-behaved generative model over a variable $\mathbf{x}$ can be expressed as a deterministic transformation of an exogenous ('outsourced') Gaussian noise variable $\mathbf{z}$: $\mathbf{x}=f_θ(\mathbf{z})$. In such a model (\eg, a VAE, GAN, or continuous-time flow-based model), sampling of the target variable $\mathbf{x} \sim p_θ(\mathbf{x})$ is straightforward, but sampling from a posterior distribution of the form $p(\mathbf{x}\mid\mathbf{y}) \propto p_θ(\mathbf{x})r(\mathbf{x},\mathbf{y})$, where $r$ is a constraint function depending on an auxiliary variable $\mathbf{y}$, is generally intractable. We propose to amortize the cost of sampling from such posterior distributions with diffusion models that sample a distribution in the noise space ($\mathbf{z}$). These diffusion samplers are trained by reinforcement learning algorithms to enforce that the transformed samples $f_θ(\mathbf{z})$ are distributed according to the posterior in the data space ($\mathbf{x}$). For many models and constraints, the posterior in noise space is smoother than in data space, making it more suitable for amortized inference. Our method enables conditional sampling under unconditional GAN, (H)VAE, and flow-based priors, comparing favorably with other inference methods. We demonstrate the proposed outsourced diffusion sampling in several experiments with large pretrained prior models: conditional image generation, reinforcement learning with human feedback, and protein structure generation.

openalex-author · arXiv (Cornell University)

A physics-based data-driven model for CO$_2$ gas diffusion electrodes to drive automated laboratories

The electrochemical reduction of atmospheric CO$_2$ into high-energy molecules with renewable energy is a promising avenue for energy storage that can take advantage of existing infrastructure especially in areas where sustainable alternatives to fossil fuels do not exist. Automated laboratories are currently being developed and used to optimize the composition and operating conditions of gas diffusion electrodes (GDEs), the device in which this reaction takes place. Improving the efficiency of GDEs is crucial for this technology to become viable. Here we present a modeling framework to efficiently explore the high-dimensional parameter space of GDE designs in an active learning context. At the core of the framework is an uncertainty-aware physics model calibrated with experimental data. The model has the flexibility to capture various input parameter spaces and any carbon products which can be modeled with Tafel kinetics. It is interpretable, and a Gaussian process layer can capture deviations of real data from the function space of the physical model itself. We deploy the model in a simulated active learning setup with real electrochemical data gathered by the AdaCarbon automated laboratory and show that it can be used to efficiently traverse the multi-dimensional parameter space.

openalex-author · ArXiv.org

AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Document Understanding

Aligning visual features with language embeddings is a key challenge in vision-language models (VLMs). The performance of such models hinges on having a good connector that maps visual features generated by a vision encoder to a shared embedding space with the LLM while preserving semantic similarity. Existing connectors, such as multilayer perceptrons (MLPs), lack inductive bias to constrain visual features within the linguistic structure of the LLM's embedding space, making them data-hungry and prone to cross-modal misalignment. In this work, we propose a novel vision-text alignment method, AlignVLM, that maps visual features to a weighted average of LLM text embeddings. Our approach leverages the linguistic priors encoded by the LLM to ensure that visual features are mapped to regions of the space that the LLM can effectively interpret. AlignVLM is particularly effective for document understanding tasks, where visual and textual modalities are highly correlated. Our extensive experiments show that AlignVLM achieves state-of-the-art performance compared to prior alignment methods, with larger gains on document understanding tasks and under low-resource setups. We provide further analysis demonstrating its efficiency and robustness to noise.

openalex-author · ArXiv.org

International AI Safety Report

The first International AI Safety Report comprehensively synthesizes the current evidence on the capabilities, risks, and safety of advanced AI systems. The report was mandated by the nations attending the AI Safety Summit in Bletchley, UK. Thirty nations, the UN, the OECD, and the EU each nominated a representative to the report's Expert Advisory Panel. A total of 100 AI experts contributed, representing diverse perspectives and disciplines. Led by the report's Chair, these independent experts collectively had full discretion over the report's content.

openalex-author · The Astrophysical Journal

Causal Discovery in Astrophysics: Unraveling Supermassive Black Hole and Galaxy Coevolution

Abstract Correlation does not imply causation, but patterns of statistical association between variables can be exploited to infer a causal structure (even with purely observational data) with the burgeoning field of causal discovery. As a purely observational science, astrophysics has much to gain by exploiting these new methods. The supermassive black hole (SMBH)–galaxy interaction has long been constrained by observed scaling relations, which is low-scatter correlations between variables such as SMBH mass and the central velocity dispersion of stars in a host galaxy's bulge. This study, using advanced causal discovery techniques and an up-to-date data set, reveals a causal link between galaxy properties and dynamically measured SMBH masses. We apply a score-based Bayesian framework to compute the exact conditional probabilities of every causal structure that could possibly describe our galaxy sample. With the exact posterior distribution, we determine the most likely causal structures and notice a probable causal reversal when separating galaxies by morphology. In elliptical galaxies, bulge properties (built from major mergers) tend to influence SMBH growth, while, in spiral galaxies, SMBHs are seen to affect host galaxy properties, potentially through feedback in gas-rich environments. For spiral galaxies, SMBHs progressively quench star formation, whereas, in elliptical galaxies, quenching is complete, and the causal connection has reversed. Our findings support theoretical models of hierarchical assembly of galaxies and active galactic nuclei feedback regulating galaxy evolution. Our study suggests the potentiality for further exploration of causal links in astrophysical and cosmological scaling relations, as well as any other observational science.

openalex-author · Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Geometric Signatures of Compositionality Across a Language Model’s Lifetime

No abstract available from the OpenAlex source record.

openalex-author · SSRN Electronic Journal

The Future of the AI Summit Series

No abstract available from the OpenAlex source record.

openalex-author · Paper

Author response: A neuronal least-action principle for real-time learning in cortical circuits

A principle from which the neuronal dynamics and synaptic plasticity in arbitrary network architectures can be inferred, so that output errors are online minimized while simultaneously processing sensory input streams.

openalex-author · arXiv (Cornell University)

Efficient Diversity-Preserving Diffusion Alignment via Gradient-Informed GFlowNets

While one commonly trains large diffusion models by collecting datasets on target downstream tasks, it is often desired to align and finetune pretrained diffusion models with some reward functions that are either designed by experts or learned from small-scale datasets. Existing post-training methods for reward finetuning of diffusion models typically suffer from lack of diversity in generated samples, lack of prior preservation, and/or slow convergence in finetuning. In response to this challenge, we take inspiration from recent successes in generative flow networks (GFlowNets) and propose a reinforcement learning method for diffusion model finetuning, dubbed Nabla-GFlowNet (abbreviated as $\nabla$-GFlowNet), that leverages the rich signal in reward gradients for probabilistic diffusion finetuning. We show that our proposed method achieves fast yet diversity- and prior-preserving finetuning of Stable Diffusion, a large-scale text-conditioned image diffusion model, on different realistic reward functions.

openalex-author · arXiv (Cornell University)

Trajectory Flow Matching with Applications to Clinical Time Series Modeling

Modeling stochastic and irregularly sampled time series is a challenging problem found in a wide range of applications, especially in medicine. Neural stochastic differential equations (Neural SDEs) are an attractive modeling technique for this problem, which parameterize the drift and diffusion terms of an SDE with neural networks. However, current algorithms for training Neural SDEs require backpropagation through the SDE dynamics, greatly limiting their scalability and stability. To address this, we propose Trajectory Flow Matching (TFM), which trains a Neural SDE in a simulation-free manner, bypassing backpropagation through the dynamics. TFM leverages the flow matching technique from generative modeling to model time series. In this work we first establish necessary conditions for TFM to learn time series data. Next, we present a reparameterization trick which improves training stability. Finally, we adapt TFM to the clinical time series setting, demonstrating improved performance on three clinical time series datasets both in terms of absolute performance and uncertainty prediction.

openalex-author · arXiv (Cornell University)

Structure Language Models for Protein Conformation Generation

Proteins adopt multiple structural conformations to perform their diverse biological functions, and understanding these conformations is crucial for advancing drug discovery. Traditional physics-based simulation methods often struggle with sampling equilibrium conformations and are computationally expensive. Recently, deep generative models have shown promise in generating protein conformations as a more efficient alternative. However, these methods predominantly rely on the diffusion process within a 3D geometric space, which typically centers around the vicinity of metastable states and is often inefficient in terms of runtime. In this paper, we introduce Structure Language Modeling (SLM) as a novel framework for efficient protein conformation generation. Specifically, the protein structures are first encoded into a compact latent space using a discrete variational auto-encoder, followed by conditional language modeling that effectively captures sequence-specific conformation distributions. This enables a more efficient and interpretable exploration of diverse ensemble modes compared to existing methods. Based on this general framework, we instantiate SLM with various popular LM architectures as well as proposing the ESMDiff, a novel BERT-like structure language model fine-tuned from ESM3 with masked diffusion. We verify our approach in various scenarios, including the equilibrium dynamics of BPTI, conformational change pairs, and intrinsically disordered proteins. SLM provides a highly efficient solution, offering a 20-100x speedup than existing methods in generating diverse conformations, shedding light on promising avenues for future research.

openalex-author · Ethics and Information Technology

Correction: AI content detection in the emerging information ecosystem: new obligations for media and tech companies

No abstract available from the OpenAlex source record.

openalex-author · arXiv (Cornell University)

Object-Centric Temporal Consistency via Conditional Autoregressive Inductive Biases

Unsupervised object-centric learning from videos is a promising approach towards learning compositional representations that can be applied to various downstream tasks, such as prediction and reasoning. Recently, it was shown that pretrained Vision Transformers (ViTs) can be useful to learn object-centric representations on real-world video datasets. However, while these approaches succeed at extracting objects from the scenes, the slot-based representations fail to maintain temporal consistency across consecutive frames in a video, i.e. the mapping of objects to slots changes across the video. To address this, we introduce Conditional Autoregressive Slot Attention (CA-SA), a framework that enhances the temporal consistency of extracted object-centric representations in video-centric vision tasks. Leveraging an autoregressive prior network to condition representations on previous timesteps and a novel consistency loss function, CA-SA predicts future slot representations and imposes consistency across frames. We present qualitative and quantitative results showing that our proposed method outperforms the considered baselines on downstream tasks, such as video prediction and visual question-answering tasks.

openalex-author · arXiv (Cornell University)

Action abstractions for amortized sampling

As trajectories sampled by policies used by reinforcement learning (RL) and generative flow networks (GFlowNets) grow longer, credit assignment and exploration become more challenging, and the long planning horizon hinders mode discovery and generalization. The challenge is particularly pronounced in entropy-seeking RL methods, such as generative flow networks, where the agent must learn to sample from a structured distribution and discover multiple high-reward states, each of which take many steps to reach. To tackle this challenge, we propose an approach to incorporate the discovery of action abstractions, or high-level actions, into the policy optimization process. Our approach involves iteratively extracting action subsequences commonly used across many high-reward trajectories and `chunking' them into a single action that is added to the action space. In empirical evaluation on synthetic and real-world environments, our approach demonstrates improved sample efficiency performance in discovering diverse high-reward objects, especially on harder exploration problems. We also observe that the abstracted high-order actions are interpretable, capturing the latent structure of the reward landscape of the action space. This work provides a cognitively motivated approach to action abstraction in RL and is the first demonstration of hierarchical planning in amortized sequential sampling.

openalex-author · arXiv (Cornell University)

A Complexity-Based Theory of Compositionality

Compositionality is believed to be fundamental to intelligence. In humans, it underlies the structure of thought, language, and higher-level reasoning. In AI, compositional representations can enable a powerful form of out-of-distribution generalization, in which a model systematically adapts to novel combinations of known concepts. However, while we have strong intuitions about what compositionality is, we lack satisfying formal definitions for it that are measurable and mathematical. Here, we propose such a definition, which we call representational compositionality, that accounts for and extends our intuitions about compositionality. The definition is conceptually simple, quantitative, grounded in algorithmic information theory, and applicable to any representation. Intuitively, representational compositionality states that a compositional representation satisfies three properties. First, it must be expressive. Second, it must be possible to re-describe the representation as a function of discrete symbolic sequences with re-combinable parts, analogous to sentences in natural language. Third, the function that relates these symbolic sequences to the representation, analogous to semantics in natural language, must be simple. Through experiments on both synthetic and real world data, we validate our definition of compositionality and show how it unifies disparate intuitions from across the literature in both AI and cognitive science. We also show that representational compositionality, while theoretically intractable, can be readily estimated using standard deep learning tools. We hope that our definition can inspire the design of novel, theoretically-driven models that better capture the mechanisms of compositional thought. We make our code available at https://github.com/EricElmoznino/complexity_compositionality.

openalex-author · arXiv (Cornell University)

Steering Masked Discrete Diffusion Models via Discrete Denoising Posterior Prediction

Generative modeling of discrete data underlies important applications spanning text-based agents like ChatGPT to the design of the very building blocks of life in protein sequences. However, application domains need to exert control over the generated data by steering the generative process - typically via RLHF - to satisfy a specified property, reward, or affinity metric. In this paper, we study the problem of steering Masked Diffusion Models (MDMs), a recent class of discrete diffusion models that offer a compelling alternative to traditional autoregressive models. We introduce Discrete Denoising Posterior Prediction (DDPP), a novel framework that casts the task of steering pre-trained MDMs as a problem of probabilistic inference by learning to sample from a target Bayesian posterior. Our DDPP framework leads to a family of three novel objectives that are all simulation-free, and thus scalable while applying to general non-differentiable reward functions. Empirically, we instantiate DDPP by steering MDMs to perform class-conditional pixel-level image modeling, RLHF-based alignment of MDMs using text-based rewards, and finetuning protein language models to generate more diverse secondary structures and shorter proteins. We substantiate our designs via wet-lab validation, where we observe transient expression of reward-optimized protein sequences.

openalex-author · The Journal of Chemical Physics

Path-filtering in path-integral simulations of open quantum systems using GFlowNets

An important class of methods for modeling dynamics in open quantum systems is based on the well-known influence functional (IF) approach to solving path-integral equations of motion. Within this paradigm, path-filtering schemes based on the removal of IF elements that fall below a certain threshold aim to reduce the effort needed to calculate and store the influence functional, making very challenging simulations possible. A filtering protocol of this type is considered acceptable as long as the simulation remains mathematically stable. This, however, does not guarantee that the approximated dynamics preserve the physics of the simulated process. In this paper, we explore the possibility of training Generative Flow Networks (GFlowNets) to produce filtering protocols while optimizing for mathematical stability and for physical accuracy. Trained using the trajectory balance objective, the model produces sets of paths to be added to a truncated initial set; it is rewarded if the combined set of paths gives rise to solutions in which the trace of the density matrix is conserved, the populations remain real, and the dynamics approach the exact reference. Using a simple two-level system coupled to a dissipative reservoir, we perform proof-of-concept simulations and demonstrate the elegant and surprising filtering solutions proposed by the GFlowNet.

openalex-author · arXiv (Cornell University)

RL, but don't do anything I wouldn't do

In reinforcement learning, if the agent's reward differs from the designers' true utility, even only rarely, the state distribution resulting from the agent's policy can be very bad, in theory and in practice. When RL policies would devolve into undesired behavior, a common countermeasure is KL regularization to a trusted policy ("Don't do anything I wouldn't do"). All current cutting-edge language models are RL agents that are KL-regularized to a "base policy" that is purely predictive. Unfortunately, we demonstrate that when this base policy is a Bayesian predictive model of a trusted policy, the KL constraint is no longer reliable for controlling the behavior of an advanced RL agent. We demonstrate this theoretically using algorithmic information theory, and while systems today are too weak to exhibit this theorized failure precisely, we RL-finetune a language model and find evidence that our formal results are plausibly relevant in practice. We also propose a theoretical alternative that avoids this problem by replacing the "Don't do anything I wouldn't do" principle with "Don't do anything I mightn't do".

openalex-author · arXiv (Cornell University)

Were RNNs All We Needed?

The introduction of Transformers in 2017 reshaped the landscape of deep learning. Originally proposed for sequence modelling, Transformers have since achieved widespread success across various domains. However, the scalability limitations of Transformers - particularly with respect to sequence length - have sparked renewed interest in novel recurrent models that are parallelizable during training, offer comparable performance, and scale more effectively. In this work, we revisit sequence modelling from a historical perspective, focusing on Recurrent Neural Networks (RNNs), which dominated the field for two decades before the rise of Transformers. Specifically, we examine LSTMs (1997) and GRUs (2014). We demonstrate that by simplifying these models, we can derive minimal versions (minLSTMs and minGRUs) that (1) use fewer parameters than their traditional counterparts, (2) are fully parallelizable during training, and (3) achieve surprisingly competitive performance on a range of tasks, rivalling recent models including Transformers.

openalex-author · arXiv (Cornell University)

Adaptive teachers for amortized samplers

Amortized inference is the task of training a parametric model, such as a neural network, to approximate a distribution with a given unnormalized density where exact sampling is intractable. When sampling is implemented as a sequential decision-making process, reinforcement learning (RL) methods, such as generative flow networks, can be used to train the sampling policy. Off-policy RL training facilitates the discovery of diverse, high-reward candidates, but existing methods still face challenges in efficient exploration. We propose to use an adaptive training distribution (the \teacher) to guide the training of the primary amortized sampler (the \student). The \teacher, an auxiliary behavior model, is trained to sample high-loss regions of the \student and can generalize across unexplored modes, thereby enhancing mode coverage by providing an efficient training curriculum. We validate the effectiveness of this approach in a synthetic environment designed to present an exploration challenge, two diffusion-based sampling tasks, and four biochemical discovery tasks demonstrating its ability to improve sample efficiency and mode coverage. Source code is available at https://github.com/alstn12088/adaptive-teacher.

openalex-author · eLife

A neuronal least-action principle for real-time learning in cortical circuits

One of the most fundamental laws of physics is the principle of least action. Motivated by its predictive power, we introduce a neuronal least-action principle for cortical processing of sensory streams to produce appropriate behavioral outputs in real time. The principle postulates that the voltage dynamics of cortical pyramidal neurons prospectively minimizes the local somato-dendritic mismatch error within individual neurons. For output neurons, the principle implies minimizing an instantaneous behavioral error. For deep network neurons, it implies the prospective firing to overcome integration delays and correct for possible output errors right in time. The neuron-specific errors are extracted in the apical dendrites of pyramidal neurons through a cortical microcircuit that tries to explain away the feedback from the periphery, and correct the trajectory on the fly. Any motor output is in a moving equilibrium with the sensory input and the motor feedback during the ongoing sensory-motor transform. Online synaptic plasticity reduces the somatodendritic mismatch error within each cortical neuron and performs gradient descent on the output cost at any moment in time. The neuronal least-action principle offers an axiomatic framework to derive local neuronal and synaptic laws for global real-time computation and learning in the brain.

openalex-author · Ethics and Information Technology

AI content detection in the emerging information ecosystem: new obligations for media and tech companies

The world is about to be swamped by an unprecedented wave of AI-generated content. We need reliable ways of identifying such content, to supplement the many existing social institutions that enable trust between people and organisations and ensure social resilience. In this paper, we begin by highlighting an important new development: providers of AI content generators have new obligations to support the creation of reliable detectors for the content they generate. These new obligations arise mainly from the EU’s newly finalised AI Act, but they are enhanced by the US President’s recent Executive Order on AI, and by several considerations of self-interest. These new steps towards reliable detection mechanisms are by no means a panacea—but we argue they will usher in a new adversarial landscape, in which reliable methods for identifying AI-generated content are commonly available. In this landscape, many new questions arise for policymakers. Firstly, if reliable AI-content detection mechanisms are available, who should be required to use them? And how should they be used? We argue that new duties arise for media and Web search companies arise for media companies, and for Web search companies, in the deployment of AI-content detectors. Secondly, what broader regulation of the tech ecosystem will maximise the likelihood of reliable AI-content detectors? We argue for a range of new duties, relating to provenance-authentication protocols, open-source AI generators, and support for research and enforcement. Along the way, we consider how the production of AI-generated content relates to ‘free expression’, and discuss the important case of content that is generated jointly by humans and AIs.

openalex-author · bioRxiv

A high-throughput phenotypic screen combined with an ultra-large-scale deep learning-based virtual screening reveals novel scaffolds of antibacterial compounds

ABSTRACT The proliferation of multi-drug-resistant bacteria underscores an urgent need for novel antibiotics. Traditional discovery methods face challenges due to limited chemical diversity, high costs, and difficulties in identifying structurally novel compounds. Here, we explore the integration of small molecule high-throughput screening with a deep learning-based virtual screening approach to uncover new antibacterial compounds. Leveraging a diverse library of nearly 2 million small molecules, we conducted comprehensive phenotypic screening against a sensitized Escherichia coli strain that, at a low hit rate, yielded thousands of hits. We trained a deep learning model, GNEprop, to predict antibacterial activity, ensuring robustness through out-of-distribution generalization techniques. Virtual screening of over 1.4 billion compounds identified potential candidates, of which 82 exhibited antibacterial activity, illustrating a 90X improved hit rate over the high-throughput screening experiment GNEprop was trained on. Importantly, a significant portion of these newly identified compounds exhibited high dissimilarity to known antibiotics, indicating promising avenues for further exploration in antibiotic discovery.

openalex-author · arXiv (Cornell University)

Meta Flow Matching: Integrating Vector Fields on the Wasserstein Manifold

Numerous biological and physical processes can be modeled as systems of interacting entities evolving continuously over time, e.g. the dynamics of communicating cells or physical particles. Learning the dynamics of such systems is essential for predicting the temporal evolution of populations across novel samples and unseen environments. Flow-based models allow for learning these dynamics at the population level - they model the evolution of the entire distribution of samples. However, current flow-based models are limited to a single initial population and a set of predefined conditions which describe different dynamics. We argue that multiple processes in natural sciences have to be represented as vector fields on the Wasserstein manifold of probability densities. That is, the change of the population at any moment in time depends on the population itself due to the interactions between samples. In particular, this is crucial for personalized medicine where the development of diseases and their respective treatment response depends on the microenvironment of cells specific to each patient. We propose Meta Flow Matching (MFM), a practical approach to integrating along these vector fields on the Wasserstein manifold by amortizing the flow model over the initial populations. Namely, we embed the population of samples using a Graph Neural Network (GNN) and use these embeddings to train a Flow Matching model. This gives MFM the ability to generalize over the initial distributions unlike previously proposed methods. We demonstrate the ability of MFM to improve prediction of individual treatment responses on a large scale multi-patient single-cell drug screen dataset.

openalex-author · arXiv (Cornell University)

Zero-Shot Object-Centric Representation Learning

The goal of object-centric representation learning is to decompose visual scenes into a structured representation that isolates the entities. Recent successes have shown that object-centric representation learning can be scaled to real-world scenes by utilizing pre-trained self-supervised features. However, so far, object-centric methods have mostly been applied in-distribution, with models trained and evaluated on the same dataset. This is in contrast to the wider trend in machine learning towards general-purpose models directly applicable to unseen data and tasks. Thus, in this work, we study current object-centric methods through the lens of zero-shot generalization by introducing a benchmark comprising eight different synthetic and real-world datasets. We analyze the factors influencing zero-shot performance and find that training on diverse real-world images improves transferability to unseen scenarios. Furthermore, inspired by the success of task-specific fine-tuning in foundation models, we introduce a novel fine-tuning strategy to adapt pre-trained vision encoders for the task of object discovery. We find that the proposed approach results in state-of-the-art performance for unsupervised object discovery, exhibiting strong zero-shot transfer to unseen datasets.

openalex-author · arXiv (Cornell University)

Cell Morphology-Guided Small Molecule Generation with GFlowNets

High-content phenotypic screening, including high-content imaging (HCI), has gained popularity in the last few years for its ability to characterize novel therapeutics without prior knowledge of the protein target. When combined with deep learning techniques to predict and represent molecular-phenotype interactions, these advancements hold the potential to significantly accelerate and enhance drug discovery applications. This work focuses on the novel task of HCI-guided molecular design. Generative models for molecule design could be guided by HCI data, for example with a supervised model that links molecules to phenotypes of interest as a reward function. However, limited labeled data, combined with the high-dimensional readouts, can make training these methods challenging and impractical. We consider an alternative approach in which we leverage an unsupervised multimodal joint embedding to define a latent similarity as a reward for GFlowNets. The proposed model learns to generate new molecules that could produce phenotypic effects similar to those of the given image target, without relying on pre-annotated phenotypic labels. We demonstrate that the proposed method generates molecules with high morphological and structural similarity to the target, increasing the likelihood of similar biological activity, as confirmed by an independent oracle model.

openalex-author · arXiv (Cornell University)

AI-Assisted Generation of Difficult Math Questions

Current LLM training positions mathematical reasoning as a core capability. With publicly available sources fully tapped, there is unmet demand for diverse and challenging math questions. Relying solely on human experts is both time-consuming and costly, while LLM-generated questions often lack the requisite diversity and difficulty. We present a design framework that combines the strengths of LLMs with a human-in-the-loop approach to generate a diverse array of challenging math questions. We leverage LLM metacognition skills [Didolkar et al., 2024] of a strong LLM to extract core "skills" from existing math datasets. These skills serve as the basis for generating novel and difficult questions by prompting the LLM with random pairs of core skills. The use of two different skills within each question makes finding such questions an "out of distribution" task for both LLMs and humans. Our pipeline employs LLMs to iteratively generate and refine questions and solutions through multiturn prompting. Human annotators then verify and further refine the questions, with their efficiency enhanced via further LLM interactions. Applying this pipeline on skills extracted from the MATH dataset [Hendrycks et al., 2021] resulted in MATH$^2$ - a dataset of higher-quality math questions, as evidenced by: (a) Lower performance of all models on MATH$^2$ than on MATH (b) Higher performance on MATH when using MATH$^2$ questions as in-context examples. Although focused on mathematics, our methodology seems applicable to other domains requiring structured reasoning, and potentially as a component of scalable oversight. Also of interest is a striking relationship observed between models' performance on the new dataset: the success rate on MATH$^2$ is the square on MATH, suggesting that successfully solving the question in MATH$^2$ requires a nontrivial combination of two distinct math skills.

openalex-author · arXiv (Cornell University)

Open Problems in Technical AI Governance

AI progress is creating a growing range of risks and opportunities, but it is often unclear how they should be navigated. In many cases, the barriers and uncertainties faced are at least partly technical. Technical AI governance, referring to technical analysis and tools for supporting the effective governance of AI, seeks to address such challenges. It can help to (a) identify areas where intervention is needed, (b) identify and assess the efficacy of potential governance actions, and (c) enhance governance options by designing mechanisms for enforcement, incentivization, or compliance. In this paper, we explain what technical AI governance is, why it is important, and present a taxonomy and incomplete catalog of its open problems. This paper is intended as a resource for technical researchers or research funders looking to contribute to AI governance.

openalex-author · arXiv (Cornell University)

On Generalization for Generative Flow Networks

Generative Flow Networks (GFlowNets) have emerged as an innovative learning paradigm designed to address the challenge of sampling from an unnormalized probability distribution, called the reward function. This framework learns a policy on a constructed graph, which enables sampling from an approximation of the target probability distribution through successive steps of sampling from the learned policy. To achieve this, GFlowNets can be trained with various objectives, each of which can lead to the model s ultimate goal. The aspirational strength of GFlowNets lies in their potential to discern intricate patterns within the reward function and their capacity to generalize effectively to novel, unseen parts of the reward function. This paper attempts to formalize generalization in the context of GFlowNets, to link generalization with stability, and also to design experiments that assess the capacity of these models to uncover unseen parts of the reward function. The experiments will focus on length generalization meaning generalization to states that can be constructed only by longer trajectories than those seen in training.

openalex-author · arXiv (Cornell University)

MAP: Low-compute Model Merging with Amortized Pareto Fronts via Quadratic Approximation

Model merging has emerged as an effective approach to combine multiple single-task models into a multitask model. This process typically involves computing a weighted average of the model parameters without any additional training. Existing model-merging methods focus on enhancing average task accuracy. However, interference and conflicts between the objectives of different tasks can lead to trade-offs during the merging process. In real-world applications, a set of solutions with various trade-offs can be more informative, helping practitioners make decisions based on diverse preferences. In this paper, we introduce a novel and low-compute algorithm, Model Merging with Amortized Pareto Front (MAP). MAP efficiently identifies a Pareto set of scaling coefficients for merging multiple models, reflecting the trade-offs involved. It amortizes the substantial computational cost of evaluations needed to estimate the Pareto front by using quadratic approximation surrogate models derived from a pre-selected set of scaling coefficients. Experimental results on vision and natural language processing tasks demonstrate that MAP can accurately identify the Pareto front, providing practitioners with flexible solutions to balance competing task objectives. We also introduce Bayesian MAP for scenarios with a relatively low number of tasks and Nested MAP for situations with a high number of tasks, further reducing the computational cost of evaluation.

openalex-author · arXiv (Cornell University)

Baking Symmetry into GFlowNets

GFlowNets have exhibited promising performance in generating diverse candidates with high rewards. These networks generate objects incrementally and aim to learn a policy that assigns probability of sampling objects in proportion to rewards. However, the current training pipelines of GFlowNets do not consider the presence of isomorphic actions, which are actions resulting in symmetric or isomorphic states. This lack of symmetry increases the amount of samples required for training GFlowNets and can result in inefficient and potentially incorrect flow functions. As a consequence, the reward and diversity of the generated objects decrease. In this study, our objective is to integrate symmetries into GFlowNets by identifying equivalent actions during the generation process. Experimental results using synthetic data demonstrate the promising performance of our proposed approaches.

openalex-author · arXiv (Cornell University)

Attention as an RNN

The advent of Transformers marked a significant breakthrough in sequence modelling, providing a highly performant architecture capable of leveraging GPU parallelism. However, Transformers are computationally expensive at inference time, limiting their applications, particularly in low-resource settings (e.g., mobile and embedded devices). Addressing this, we (1) begin by showing that attention can be viewed as a special Recurrent Neural Network (RNN) with the ability to compute its \textit{many-to-one} RNN output efficiently. We then (2) show that popular attention-based models such as Transformers can be viewed as RNN variants. However, unlike traditional RNNs (e.g., LSTMs), these models cannot be updated efficiently with new tokens, an important property in sequence modelling. Tackling this, we (3) introduce a new efficient method of computing attention's \textit{many-to-many} RNN output based on the parallel prefix scan algorithm. Building on the new attention formulation, we (4) introduce \textbf{Aaren}, an attention-based module that can not only (i) be trained in parallel (like Transformers) but also (ii) be updated efficiently with new tokens, requiring only constant memory for inferences (like traditional RNNs). Empirically, we show Aarens achieve comparable performance to Transformers on $38$ datasets spread across four popular sequential problem settings: reinforcement learning, event forecasting, time series classification, and time series forecasting tasks while being more time and memory-efficient.

openalex-author · Science

Managing extreme AI risks amid rapid progress

Preparation requires technical research and development, as well as adaptive, proactive governance.

openalex-author · 2024 IEEE International Symposium on Circuits and Systems (ISCAS)

BitPruning: Learning Bitlengths for Aggressive and Accurate Quantization

BitPruning is a training method for minimizing inference bitlengths at any granularity while maintaining accuracy. BitPruning extends the meaning of fixed-point bitlenghts into the continuous domain by interpolating between the nearest two integers, enabling gradient descent to learn bitlengths together with other parameters. A novel regularizer penalizes large bitlength representations and can be modified to minimize other quantifiable criteria, such as number of operations or memory footprint. BitPruning learns thrifty representations while maintaining accuracy: With ImageNet, it produces an average per layer bitlength of 3.76 and 4.36 bits on ResNet18 and MobileNet V2 respectively, remaining within 0.5% of the base TOP-1 accuracy. Simple modifications of the BitPruning regularizer can be used to further reduce compute workload by up to 24%, as well as memory footprint in activation or weight-heavy tasks by up to 14% and 8% respectively.

openalex-author · Bulletin of the American Mathematical Society

Machine Learning and Information Theory Concepts towards an AI Mathematician

The current state of the art in artificial intelligence is impressive, especially in terms of mastery of language, but not so much in terms of mathematical reasoning. What could be missing? Can we learn something useful about that gap from how the brains of mathematicians go about their craft? This essay builds on the idea that current deep learning mostly succeeds at system 1 abilities—which correspond to our intuition and habitual behaviors—but still lacks something important regarding system 2 abilities—which include reasoning and robust uncertainty estimation. It takes an information-theoretical posture to ask questions about what constitutes an interesting mathematical statement, which could guide future work in crafting an AI mathematician. The focus is not on proving a given theorem but on discovering new and interesting <italic>conjectures</italic> . The central hypothesis is that a desirable body of theorems better summarizes the set of all provable statements, for example, by having a small description length while at the same time being close (in terms of number of derivation steps) to many provable statements.

openalex-author · arXiv (Cornell University)

Divergent Creativity in Humans and Large Language Models

The recent surge of Large Language Models (LLMs) has led to claims that they are approaching a level of creativity akin to human capabilities. This idea has sparked a blend of excitement and apprehension. However, a critical piece that has been missing in this discourse is a systematic evaluation of LLMs' semantic diversity, particularly in comparison to human divergent thinking. To bridge this gap, we leverage recent advances in computational creativity to analyze semantic divergence in both state-of-the-art LLMs and a substantial dataset of 100,000 humans. We found evidence that LLMs can surpass average human performance on the Divergent Association Task, and approach human creative writing abilities, though they fall short of the typical performance of highly creative humans. Notably, even the top performing LLMs are still largely surpassed by highly creative individuals, underscoring a ceiling that current LLMs still fail to surpass. Our human-machine benchmarking framework addresses the polemic surrounding the imminent replacement of human creative labour by AI, disentangling the quality of the respective creative linguistic outputs using established objective measures. While prompting deeper exploration of the distinctive elements of human inventive thought compared to those of AI systems, we lay out a series of techniques to improve their outputs with respect to semantic diversity, such as prompt design and hyper-parameter tuning.

openalex-author · arXiv (Cornell University)

Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

Ensuring that AI systems reliably and robustly avoid harmful or dangerous behaviours is a crucial challenge, especially for AI systems with a high degree of autonomy and general intelligence, or systems used in safety-critical contexts. In this paper, we will introduce and define a family of approaches to AI safety, which we will refer to as guaranteed safe (GS) AI. The core feature of these approaches is that they aim to produce AI systems which are equipped with high-assurance quantitative safety guarantees. This is achieved by the interplay of three core components: a world model (which provides a mathematical description of how the AI system affects the outside world), a safety specification (which is a mathematical description of what effects are acceptable), and a verifier (which provides an auditable proof certificate that the AI satisfies the safety specification relative to the world model). We outline a number of approaches for creating each of these three core components, describe the main technical challenges, and suggest a number of potential solutions to them. We also argue for the necessity of this approach to AI safety, and for the inadequacy of the main alternative approaches.

openalex-author · arXiv (Cornell University)

Towards DNA-Encoded Library Generation with GFlowNets

DNA-encoded libraries (DELs) are a powerful approach for rapidly screening large numbers of diverse compounds. One of the key challenges in using DELs is library design, which involves choosing the building blocks that will be combinatorially combined to produce the final library. In this paper we consider the task of protein-protein interaction (PPI) biased DEL design. To this end, we evaluate several machine learning algorithms on the PPI modulation task and use them as a reward for the proposed GFlowNet-based generative approach. We additionally investigate the possibility of using structural information about building blocks to design a hierarchical action space for the GFlowNet. The observed results indicate that GFlowNets are a promising approach for generating diverse combinatorial library candidates.

openalex-author · arXiv (Cornell University)

Foundational Challenges in Assuring Alignment and Safety of Large Language Models

This work identifies 18 foundational challenges in assuring the alignment and safety of large language models (LLMs). These challenges are organized into three different categories: scientific understanding of LLMs, development and deployment methods, and sociotechnical challenges. Based on the identified challenges, we pose $200+$ concrete research questions.

openalex-author · Harvard Data Science Review

Government Interventions to Avert Future Catastrophic AI Risks

This essay is a revised transcription of Yoshua Bengio's July 2023 testimony in front of the US Senate Subcommittee on Privacy, Technology, and the Law meeting on the topic of oversight of AI. It argues for caution and government interventions in regulation and research investments to mitigate the potentially catastrophic outcomes from future advances in AI as the technology approaches human-level cognitive abilities. It summarizes the trends in advancing capabilities and the uncertain timeline to these future advances, as well as the different types of catastrophic scenarios that could follow, including both intentional and unintentional cases, misuse by bad actors and intentional as well as unintended loss of control of powerful AIs. It makes public policy recommendations that include national regulation, international agreements, public research investments in AI safety as well as classified research investments to design aligned AI systems that can safely protect us from bad actors and uncontrolled dangerous AI systems. It highlights the need for strong democratic governance processes to control the safety and ethical use of future powerful AI systems, whether they are in private hands or under government authority.

openalex-author · Paper

Reviewer #2 (Public Review): A neuronal least-action principle for real-time learning in cortical circuits

One of the most fundamental laws of physics is the principle of least action. Motivated by its predictive power, we introduce a neuronal least-action principle for cortical processing of sensory streams to produce appropriate behavioural outputs in real time. The principle postulates that the voltage dynamics of cortical pyramidal neurons prospectively minimize the local somato-dendritic mismatch error within individual neurons. For motor output neurons, it implies minimizing an instantaneous behavioural error. For deep network neurons, it implies a prospective firing to overcome integration delays and correct for possible output errors right in time. The neuron-specific errors are extracted in the apical dendrites of pyramidal neurons through a cortical microcircuit that tries to explain away the feedback from the periphery, and correct the trajectory on the fly. Any motor output is in a moving equilibrium with the sensory inputs and the motor feedback during the whole sensory-motor trajectory. Ongoing synaptic plasticity reduces the somato-dendritic mismatch error within each cortical neuron and performs gradient descent on the output cost at any moment in time. The neuronal least-action principle offers an axiomatic framework to derive local neuronal and synaptic dynamics for global real-time computation and learning in the brain and in physical substrates in general.

openalex-author · Paper

Reviewer #1 (Public Review): A neuronal least-action principle for real-time learning in cortical circuits

One of the most fundamental laws of physics is the principle of least action. Motivated by its predictive power, we introduce a neuronal least-action principle for cortical processing of sensory streams to produce appropriate behavioural outputs in real time. The principle postulates that the voltage dynamics of cortical pyramidal neurons prospectively minimize the local somato-dendritic mismatch error within individual neurons. For motor output neurons, it implies minimizing an instantaneous behavioural error. For deep network neurons, it implies a prospective firing to overcome integration delays and correct for possible output errors right in time. The neuron-specific errors are extracted in the apical dendrites of pyramidal neurons through a cortical microcircuit that tries to explain away the feedback from the periphery, and correct the trajectory on the fly. Any motor output is in a moving equilibrium with the sensory inputs and the motor feedback during the whole sensory-motor trajectory. Ongoing synaptic plasticity reduces the somato-dendritic mismatch error within each cortical neuron and performs gradient descent on the output cost at any moment in time. The neuronal least-action principle offers an axiomatic framework to derive local neuronal and synaptic dynamics for global real-time computation and learning in the brain and in physical substrates in general.

openalex-author · Science

Regulating advanced artificial agents

Governance frameworks should address the prospect of AI systems that cannot be safely tested.

openalex-author · Observatoire international sur les impacts sociétaux de l'IA et du numérique

Interdisciplinary Dialogues: The Major Risks of Generative AI

In an exciting series of Interdisciplinary Dialogues on the societal impacts of AI, we invite a guest speaker and panellists from the fields of science and engineering, health and humanities and social sciences to discuss the advances, challenges and opportunities raised by AI. The first dialogue in this series began with Yoshua Bengio, who, concerned about developments in generative AI and the major risks they pose for society, initiated the organization of a conference on the subject. The event took place on August 14, 2023 in Montreal, and was aimed at initiating collective, interdisciplinary reflection on the issues and risks posed by recent developments in AI. The conference took the form of a panel, moderated by Juliette Powell, to which seven specialists were invited who cover a variety of disciplines, including: computer science (Yoshua Bengio and Golnoosh Farnadi), law (Caroline Lequesne and Claire Boine), philosophy (Jocelyn Maclure), communication (Sonja Solomun) and political science (Hugo Loiseau). This document is the result of this first interdisciplinary dialogue on the societal impacts of AI. The speakers were invited to respond concisely, in the language of their choice, to questions raised during the event. Immerse yourself in reading these fascinating conversations, presented in a Q&A format that transcends disciplinary boundaries. The aim of these dialogues is to offer a critical and diverse perspective on the impact of AI on our everchanging world.

openalex-author · Observatoire international sur les impacts sociétaux de l'IA et du numérique

Dialogues interdisciplinaires : les risques majeurs de l'IA générative

À travers une série captivante de Dialogues interdisciplinaires sur les impacts sociétaux de l’IA, nous convions une ou un invité et des intervenantes et intervenants, provenant des sciences et génies, de la santé et des sciences humaines et sociales, à venir discuter des avancées, des défis et des opportunités soulevés par l’IA. Le premier dialogue de cette série débute avec Yoshua Bengio, qui, préoccupé par les développements de l’IA générative et des risques majeurs qu’ils engendrent pour la société, a initié l’organisation d’une conférence à ce sujet. Cette activité s’est déroulée le 14 août 2023 à Montréal et avait pour but d’engager une réflexion collective et interdisciplinaire sur les enjeux et risques posés par les récents développements de l’IA. La conférence a pris la forme d’un panel, animé par Juliette Powell, auquel était convié sept spécialistes provenant de disciplines variées : informatique (Yoshua Bengio et Golnoosh Farnadi), droit (Caroline Lequesne et Claire Boine), philosophie (Jocelyn Maclure), communication (Sonja Solomun) et science politique (Hugo Loiseau). Le présent document est ainsi issu de ce premier dialogue interdisciplinaire sur les impacts sociétaux de l’IA. Par la suite, les intervenantes et intervenants ont été invités à répondre de manière concise, dans la langue de leur choix, à des questions soulevées lors de cette activité.

openalex-author · Proceedings of the AAAI Conference on Artificial Intelligence

Regeneration Learning: A Learning Paradigm for Data Generation

Machine learning methods for conditional data generation usually build a mapping from source conditional data X to target data Y. The target Y (e.g., text, speech, music, image, video) is usually high-dimensional and complex, and contains information that does not exist in source data, which hinders effective and efficient learning on the source-target mapping. In this paper, we present a learning paradigm called regeneration learning for data generation, which first generates Y' (an abstraction/representation of Y) from X and then generates Y from Y'. During training, Y' is obtained from Y through either handcrafted rules or self-supervised learning and is used to learn X-->Y' and Y'-->Y. Regeneration learning extends the concept of representation learning to data generation tasks, and can be regarded as a counterpart of traditional representation learning, since 1) regeneration learning handles the abstraction (Y') of the target data Y for data generation while traditional representation learning handles the abstraction (X') of source data X for data understanding; 2) both the processes of Y'-->Y in regeneration learning and X-->X' in representation learning can be learned in a self-supervised way (e.g., pre-training); 3) both the mappings from X to Y' in regeneration learning and from X' to Y in representation learning are simpler than the direct mapping from X to Y. We show that regeneration learning can be a widely-used paradigm for data generation (e.g., text generation, speech recognition, speech synthesis, music composition, image generation, and video generation) and can provide valuable insights into developing data generation methods.

openalex-author · arXiv (Cornell University)

Language Models Can Reduce Asymmetry in Information Markets

This work addresses the buyer's inspection paradox for information markets. The paradox is that buyers need to access information to determine its value, while sellers need to limit access to prevent theft. To study this, we introduce an open-source simulated digital marketplace where intelligent agents, powered by language models, buy and sell information on behalf of external participants. The central mechanism enabling this marketplace is the agents' dual capabilities: they not only have the capacity to assess the quality of privileged information but also come equipped with the ability to forget. This ability to induce amnesia allows vendors to grant temporary access to proprietary information, significantly reducing the risk of unauthorized retention while enabling agents to accurately gauge the information's relevance to specific queries or tasks. To perform well, agents must make rational decisions, strategically explore the marketplace through generated sub-queries, and synthesize answers from purchased information. Concretely, our experiments (a) uncover biases in language models leading to irrational behavior and evaluate techniques to mitigate these biases, (b) investigate how price affects demand in the context of informational goods, and (c) show that inspection and higher budgets both lead to higher quality outcomes.

openalex-author · arXiv (Cornell University)

Ant Colony Sampling with GFlowNets for Combinatorial Optimization

We present the Generative Flow Ant Colony Sampler (GFACS), a novel meta-heuristic method that hierarchically combines amortized inference and parallel stochastic search. Our method first leverages Generative Flow Networks (GFlowNets) to amortize a \emph{multi-modal} prior distribution over combinatorial solution space that encompasses both high-reward and diversified solutions. This prior is iteratively updated via parallel stochastic search in the spirit of Ant Colony Optimization (ACO), leading to the posterior distribution that generates near-optimal solutions. Extensive experiments across seven combinatorial optimization problems demonstrate GFACS's promising performances.

openalex-author · arXiv (Cornell University)

Discrete Probabilistic Inference as Control in Multi-path Environments

We consider the problem of sampling from a discrete and structured distribution as a sequential decision problem, where the objective is to find a stochastic policy such that objects are sampled at the end of this sequential process proportionally to some predefined reward. While we could use maximum entropy Reinforcement Learning (MaxEnt RL) to solve this problem for some distributions, it has been shown that in general, the distribution over states induced by the optimal policy may be biased in cases where there are multiple ways to generate the same object. To address this issue, Generative Flow Networks (GFlowNets) learn a stochastic policy that samples objects proportionally to their reward by approximately enforcing a conservation of flows across the whole Markov Decision Process (MDP). In this paper, we extend recent methods correcting the reward in order to guarantee that the marginal distribution induced by the optimal MaxEnt RL policy is proportional to the original reward, regardless of the structure of the underlying MDP. We also prove that some flow-matching objectives found in the GFlowNet literature are in fact equivalent to well-established MaxEnt RL algorithms with a corrected reward. Finally, we study empirically the performance of multiple MaxEnt RL and GFlowNet algorithms on multiple problems involving sampling from discrete distributions.

openalex-author · arXiv (Cornell University)

Computing Power and the Governance of Artificial Intelligence

Computing power, or "compute," is crucial for the development and deployment of artificial intelligence (AI) capabilities. As a result, governments and companies have started to leverage compute as a means to govern AI. For example, governments are investing in domestic compute capacity, controlling the flow of compute to competing countries, and subsidizing compute access to certain sectors. However, these efforts only scratch the surface of how compute can be used to govern AI development and deployment. Relative to other key inputs to AI (data and algorithms), AI-relevant compute is a particularly effective point of intervention: it is detectable, excludable, and quantifiable, and is produced via an extremely concentrated supply chain. These characteristics, alongside the singular importance of compute for cutting-edge AI models, suggest that governing compute can contribute to achieving common policy objectives, such as ensuring the safety and beneficial use of AI. More precisely, policymakers could use compute to facilitate regulatory visibility of AI, allocate resources to promote beneficial outcomes, and enforce restrictions against irresponsible or malicious AI development and usage. However, while compute-based policies and technologies have the potential to assist in these areas, there is significant variation in their readiness for implementation. Some ideas are currently being piloted, while others are hindered by the need for fundamental research. Furthermore, naive or poorly scoped approaches to compute governance carry significant risks in areas like privacy, economic impacts, and centralization of power. We end by suggesting guardrails to minimize these risks from compute governance.

openalex-author · arXiv (Cornell University)

Iterated Denoising Energy Matching for Sampling from Boltzmann Densities

Efficiently generating statistically independent samples from an unnormalized probability distribution, such as equilibrium samples of many-body systems, is a foundational problem in science. In this paper, we propose Iterated Denoising Energy Matching (iDEM), an iterative algorithm that uses a novel stochastic score matching objective leveraging solely the energy function and its gradient -- and no data samples -- to train a diffusion-based sampler. Specifically, iDEM alternates between (I) sampling regions of high model density from a diffusion-based sampler and (II) using these samples in our stochastic matching objective to further improve the sampler. iDEM is scalable to high dimensions as the inner matching objective, is simulation-free, and requires no MCMC samples. Moreover, by leveraging the fast mode mixing behavior of diffusion, iDEM smooths out the energy landscape enabling efficient exploration and learning of an amortized sampler. We evaluate iDEM on a suite of tasks ranging from standard synthetic energy functions to invariant $n$-body particle systems. We show that the proposed approach achieves state-of-the-art performance on all metrics and trains $2-5\times$ faster, which allows it to be the first method to train using energy on the challenging $55$-particle Lennard-Jones system.

openalex-author · arXiv (Cornell University)

Efficient Causal Graph Discovery Using Large Language Models

We propose a novel framework that leverages LLMs for full causal graph discovery. While previous LLM-based methods have used a pairwise query approach, this requires a quadratic number of queries which quickly becomes impractical for larger causal graphs. In contrast, the proposed framework uses a breadth-first search (BFS) approach which allows it to use only a linear number of queries. We also show that the proposed method can easily incorporate observational data when available, to improve performance. In addition to being more time and data-efficient, the proposed framework achieves state-of-the-art results on real-world causal graphs of varying sizes. The results demonstrate the effectiveness and efficiency of the proposed method in discovering causal relationships, showcasing its potential for broad applicability in causal graph discovery tasks across different domains.

openalex-author · Advances in Neural Information Processing Systems 37

Amortizing intractable inference in diffusion models for vision, language, and control

No abstract available from the OpenAlex source record.

openalex-author · Advances in Neural Information Processing Systems 37

Metacognitive Capabilities of LLMs: An Exploration in Mathematical Problem Solving

No abstract available from the OpenAlex source record.

openalex-author · Advances in Neural Information Processing Systems 37

Improved off-policy training of diffusion samplers

No abstract available from the OpenAlex source record.

openalex-author · Advances in Neural Information Processing Systems 37

Trajectory Flow Matching with Applications to Clinical Time Series Modelling

No abstract available from the OpenAlex source record.

openalex-author · Neuroscience of Consciousness

Sources of richness and ineffability for phenomenally conscious states

Conscious states-state that there is something it is like to be in-seem both rich or full of detail and ineffable or hard to fully describe or recall. The problem of ineffability, in particular, is a longstanding issue in philosophy that partly motivates the explanatory gap: the belief that consciousness cannot be reduced to underlying physical processes. Here, we provide an information theoretic dynamical systems perspective on the richness and ineffability of consciousness. In our framework, the richness of conscious experience corresponds to the amount of information in a conscious state and ineffability corresponds to the amount of information lost at different stages of processing. We describe how attractor dynamics in working memory would induce impoverished recollections of our original experiences, how the discrete symbolic nature of language is insufficient for describing the rich and high-dimensional structure of experiences, and how similarity in the cognitive function of two individuals relates to improved communicability of their experiences to each other. While our model may not settle all questions relating to the explanatory gap, it makes progress toward a fully physicalist explanation of the richness and ineffability of conscious experience-two important aspects that seem to be part of what makes qualitative character so puzzling.

openalex-author · Digital Discovery

Towards equilibrium molecular conformation generation with GFlowNets

GFlowNets allow for sampling diverse, thermodynamically feasible molecular conformations from the Boltzmann distribution.

openalex-author · arXiv (Cornell University)

A Hitchhiker's Guide to Geometric GNNs for 3D Atomic Systems

Recent advances in computational modelling of atomic systems, spanning molecules, proteins, and materials, represent them as geometric graphs with atoms embedded as nodes in 3D Euclidean space. In these graphs, the geometric attributes transform according to the inherent physical symmetries of 3D atomic systems, including rotations and translations in Euclidean space, as well as node permutations. In recent years, Geometric Graph Neural Networks have emerged as the preferred machine learning architecture powering applications ranging from protein structure prediction to molecular simulations and material generation. Their specificity lies in the inductive biases they leverage - such as physical symmetries and chemical properties - to learn informative representations of these geometric graphs. In this opinionated paper, we provide a comprehensive and self-contained overview of the field of Geometric GNNs for 3D atomic systems. We cover fundamental background material and introduce a pedagogical taxonomy of Geometric GNN architectures: (1) invariant networks, (2) equivariant networks in Cartesian basis, (3) equivariant networks in spherical basis, and (4) unconstrained networks. Additionally, we outline key datasets and application areas and suggest future research directions. The objective of this work is to present a structured perspective on the field, making it accessible to newcomers and aiding practitioners in gaining an intuition for its mathematical abstractions.

openalex-author · arXiv (Cornell University)

Improving Gradient-guided Nested Sampling for Posterior Inference

We present a performant, general-purpose gradient-guided nested sampling algorithm, ${\tt GGNS}$, combining the state of the art in differentiable programming, Hamiltonian slice sampling, clustering, mode separation, dynamic nested sampling, and parallelization. This unique combination allows ${\tt GGNS}$ to scale well with dimensionality and perform competitively on a variety of synthetic and real-world problems. We also show the potential of combining nested sampling with generative flow networks to obtain large amounts of high-quality samples from the posterior distribution. This combination leads to faster mode discovery and more accurate estimates of the partition function.

openalex-author · arXiv (Cornell University)

Unlearning via Sparse Representations

Machine \emph{unlearning}, which involves erasing knowledge about a \emph{forget set} from a trained model, can prove to be costly and infeasible by existing techniques. We propose a nearly compute-free zero-shot unlearning technique based on a discrete representational bottleneck. We show that the proposed technique efficiently unlearns the forget set and incurs negligible damage to the model's performance on the rest of the data set. We evaluate the proposed technique on the problem of \textit{class unlearning} using three datasets: CIFAR-10, CIFAR-100, and LACUNA-100. We compare the proposed technique to SCRUB, a state-of-the-art approach which uses knowledge distillation for unlearning. Across all three datasets, the proposed technique performs as well as, if not better than SCRUB while incurring almost no computational cost.

openalex-author · arXiv (Cornell University)

Mitigating Shortcut Learning with Diffusion Counterfactuals and Diverse Ensembles

Spurious correlations in the data, where multiple cues are predictive of the target labels, often lead to a phenomenon known as shortcut learning, where a model relies on erroneous, easy-to-learn cues while ignoring reliable ones. In this work, we propose DiffDiv an ensemble diversification framework exploiting Diffusion Probabilistic Models (DPMs) to mitigate this form of bias. We show that at particular training intervals, DPMs can generate images with novel feature combinations, even when trained on samples displaying correlated input features. We leverage this crucial property to generate synthetic counterfactuals to increase model diversity via ensemble disagreement. We show that DPM-guided diversification is sufficient to remove dependence on shortcut cues, without a need for additional supervised signals. We further empirically quantify its efficacy on several diversification objectives, and finally show improved generalization and diversification on par with prior work that relies on auxiliary data collection.

openalex-author · The Journal of Neuroscience

Responses to Pattern-Violating Visual Stimuli Evolve Differently Over Days in Somata and Distal Apical Dendrites

Scientists have long conjectured that the neocortex learns patterns in sensory data to generate top-down predictions of upcoming stimuli. In line with this conjecture, different responses to pattern-matching vs pattern-violating visual stimuli have been observed in both spiking and somatic calcium imaging data. However, it remains unknown whether these pattern-violation signals are different between the distal apical dendrites, which are heavily targeted by top-down signals, and the somata, where bottom-up information is primarily integrated. Furthermore, it is unknown how responses to pattern-violating stimuli evolve over time as an animal gains more experience with them. Here, we address these unanswered questions by analyzing responses of individual somata and dendritic branches of layer 2/3 and layer 5 pyramidal neurons tracked over multiple days in primary visual cortex of awake, behaving female and male mice. We use sequences of Gabor patches with patterns in their orientations to create pattern-matching and pattern-violating stimuli, and two-photon calcium imaging to record neuronal responses. Many neurons in both layers show large differences between their responses to pattern-matching and pattern-violating stimuli. Interestingly, these responses evolve in opposite directions in the somata and distal apical dendrites, with somata becoming less sensitive to pattern-violating stimuli and distal apical dendrites more sensitive. These differences between the somata and distal apical dendrites may be important for hierarchical computation of sensory predictions and learning, since these two compartments tend to receive bottom-up and top-down information, respectively.

openalex-author · arXiv (Cornell University)

SatBird: Bird Species Distribution Modeling with Remote Sensing and Citizen Science Data

Biodiversity is declining at an unprecedented rate, impacting ecosystem services necessary to ensure food, water, and human health and well-being. Understanding the distribution of species and their habitats is crucial for conservation policy planning. However, traditional methods in ecology for species distribution models (SDMs) generally focus either on narrow sets of species or narrow geographical areas and there remain significant knowledge gaps about the distribution of species. A major reason for this is the limited availability of data traditionally used, due to the prohibitive amount of effort and expertise required for traditional field monitoring. The wide availability of remote sensing data and the growing adoption of citizen science tools to collect species observations data at low cost offer an opportunity for improving biodiversity monitoring and enabling the modelling of complex ecosystems. We introduce a novel task for mapping bird species to their habitats by predicting species encounter rates from satellite images, and present SatBird, a satellite dataset of locations in the USA with labels derived from presence-absence observation data from the citizen science database eBird, considering summer (breeding) and winter seasons. We also provide a dataset in Kenya representing low-data regimes. We additionally provide environmental data and species range maps for each location. We benchmark a set of baselines on our dataset, including SOTA models for remote sensing tasks. SatBird opens up possibilities for scalably modelling properties of ecosystems worldwide.

openalex-author · arXiv (Cornell University)

Object-centric architectures enable efficient causal representation learning

Causal representation learning has showed a variety of settings in which we can disentangle latent variables with identifiability guarantees (up to some reasonable equivalence class). Common to all of these approaches is the assumption that (1) the latent variables are represented as $d$-dimensional vectors, and (2) that the observations are the output of some injective generative function of these latent variables. While these assumptions appear benign, we show that when the observations are of multiple objects, the generative function is no longer injective and disentanglement fails in practice. We can address this failure by combining recent developments in object-centric learning and causal representation learning. By modifying the Slot Attention architecture arXiv:2006.15055, we develop an object-centric architecture that leverages weak supervision from sparse perturbations to disentangle each object's properties. This approach is more data-efficient in the sense that it requires significantly fewer perturbations than a comparable approach that encodes to a Euclidean space and we show that this approach successfully disentangles the properties of a set of objects in a series of simple image-based disentanglement experiments.

openalex-author · Ethics and Information Technology

Generative AI models should include detection mechanisms as a condition for public release

Abstract The new wave of ‘foundation models’—general-purpose generative AI models, for production of text (e.g., ChatGPT) or images (e.g., MidJourney)—represent a dramatic advance in the state of the art for AI. But their use also introduces a range of new risks, which has prompted an ongoing conversation about possible regulatory mechanisms. Here we propose a specific principle that should be incorporated into legislation: that any organization developing a foundation model intended for public use must demonstrate a reliable detection mechanism for the content it generates, as a condition of its public release. The detection mechanism should be made publicly available in a tool that allows users to query, for an arbitrary item of content, whether the item was generated (wholly or partly) by the model. In this paper, we argue that this requirement is technically feasible and would play an important role in reducing certain risks from new AI models in many domains. We also outline a number of options for the tool’s design, and summarize a number of points where further input from policymakers and researchers would be required.

openalex-author · Massaroli, S, Poli, M, Fu, D Y, Kumbong, H, Parnichkun, R N, Timalsina, A, Romero, D W, McIntyre, Q, Chen, B, Rudra, A, Zhang, C, Re, C, Ermon, S & Bengio, Y 20

Laughing Hyena Distillery: Extracting Compact Recurrences From Convolutions

Recent advances in attention-free sequence models rely on convolutions as alternatives to the attention operator at the core of Transformers. In particular, long convolution sequence models have achieved state-of-the-art performance in many domains, but incur a significant cost during auto-regressive inference workloads -- naively requiring a full pass (or caching of activations) over the input sequence for each generated token -- similarly to attention-based models. In this paper, we seek to enable $\mathcal O(1)$ compute and memory cost per token in any pre-trained long convolution architecture to reduce memory footprint and increase throughput during generation. Concretely, our methods consist in extracting low-dimensional linear state-space models from each convolution layer, building upon rational interpolation and model-order reduction techniques. We further introduce architectural improvements to convolution-based layers such as Hyena: by weight-tying the filters across channels into heads, we achieve higher pre-training quality and reduce the number of filters to be distilled. The resulting model achieves 10x higher throughput than Transformers and 1.5x higher than Hyena at 1.3B parameters, without any loss in quality after distillation.

openalex-author · arXiv (Cornell University)

OC-NMN: Object-centric Compositional Neural Module Network for Generative Visual Analogical Reasoning

A key aspect of human intelligence is the ability to imagine -- composing learned concepts in novel ways -- to make sense of new scenarios. Such capacity is not yet attained for machine learning systems. In this work, in the context of visual reasoning, we show how modularity can be leveraged to derive a compositional data augmentation framework inspired by imagination. Our method, denoted Object-centric Compositional Neural Module Network (OC-NMN), decomposes visual generative reasoning tasks into a series of primitives applied to objects without using a domain-specific language. We show that our modular architectural choices can be used to generate new training tasks that lead to better out-of-distribution generalization. We compare our model to existing and new baselines in proposed visual reasoning benchmark that consists of applying arithmetic operations to MNIST digits.

openalex-author · arXiv (Cornell University)

PhyloGFN: Phylogenetic inference with generative flow networks

Phylogenetics is a branch of computational biology that studies the evolutionary relationships among biological entities. Its long history and numerous applications notwithstanding, inference of phylogenetic trees from sequence data remains challenging: the high complexity of tree space poses a significant obstacle for the current combinatorial and probabilistic techniques. In this paper, we adopt the framework of generative flow networks (GFlowNets) to tackle two core problems in phylogenetics: parsimony-based and Bayesian phylogenetic inference. Because GFlowNets are well-suited for sampling complex combinatorial structures, they are a natural choice for exploring and sampling from the multimodal posterior distribution over tree topologies and evolutionary distances. We demonstrate that our amortized posterior sampler, PhyloGFN, produces diverse and high-quality evolutionary hypotheses on real benchmark datasets. PhyloGFN is competitive with prior works in marginal likelihood estimation and achieves a closer fit to the target distribution than state-of-the-art variational inference methods. Our code is available at https://github.com/zmy1116/phylogfn.

openalex-author · arXiv (Cornell University)

A cry for help: Early detection of brain injury in newborns

Since the 1960s, neonatal clinicians have known that newborns suffering from certain neurological conditions exhibit altered crying patterns such as the high-pitched cry in birth asphyxia. Despite an annual burden of over 1.5 million infant deaths and disabilities, early detection of neonatal brain injuries due to asphyxia remains a challenge, particularly in developing countries where the majority of births are not attended by a trained physician. Here we report on the first inter-continental clinical study to demonstrate that neonatal brain injury can be reliably determined from recorded infant cries using an AI algorithm we call Roseline. Previous and recent work has been limited by the lack of a large, high-quality clinical database of cry recordings, constraining the application of state-of-the-art machine learning. We develop a new training methodology for audio-based pathology detection models and evaluate this system on a large database of newborn cry sounds acquired from geographically diverse settings -- 5 hospitals across 3 continents. Our system extracts interpretable acoustic biomarkers that support clinical decisions and is able to accurately detect neurological injury from newborns' cries with an AUC of 92.5% (88.7% sensitivity at 80% specificity). Cry-based neurological monitoring opens the door for low-cost, easy-to-use, non-invasive and contact-free screening of at-risk babies, especially when integrated into simple devices like smartphones or neonatal ICU monitors. This would provide a reliable tool where there are no alternatives, but also curtail the need to regularly exert newborns to physically-exhausting or radiation-exposing assessments such as brain CT scans. This work sets the stage for embracing the infant cry as a vital sign and indicates the potential of AI-driven sound monitoring for the future of affordable healthcare.

openalex-author · arXiv (Cornell University)

On the importance of catalyst-adsorbate 3D interactions for relaxed energy predictions

The use of machine learning for material property prediction and discovery has traditionally centered on graph neural networks that incorporate the geometric configuration of all atoms. However, in practice not all this information may be readily available, e.g.~when evaluating the potentially unknown binding of adsorbates to catalyst. In this paper, we investigate whether it is possible to predict a system's relaxed energy in the OC20 dataset while ignoring the relative position of the adsorbate with respect to the electro-catalyst. We consider SchNet, DimeNet++ and FAENet as base architectures and measure the impact of four modifications on model performance: removing edges in the input graph, pooling independent representations, not sharing the backbone weights and using an attention mechanism to propagate non-geometric relative information. We find that while removing binding site information impairs accuracy as expected, modified models are able to predict relaxed energies with remarkably decent MAE. Our work suggests future research directions in accelerated materials discovery where information on reactant configurations can be reduced or altogether omitted.

openalex-author · ArXiv.org

Crystal-GFN: sampling crystals with desirable properties and constraints

The discovery of novel solid-state materials, such as electrocatalysts, super-ionic conductors, or photovoltaic materials, plays a critical role in addressing various global challenges. It has, for instance, the potential to significantly improve the efficiency of renewable energy production and storage, thereby making substantial contributions to climate crisis mitigation strategies. In this paper, we introduce Crystal-GFN, a generative model of crystal structures possessing desirable properties and constraints. Operating as a multi-environment, continuous-discrete GFlowNet, it sequentially samples structural attributes of crystalline materials, namely space group, composition and lattice parameters. This domain-inspired approach enables the flexible incorporation of physicochemical and geometric hard constraints. We demonstrate the capabilities of Crystal-GFN to efficiently discover diverse and valid crystals with various properties: low predicted formation energy (median -3.2 eV/atom), band gap close to a target value and high density. Overall, Crystal-GFN is a crystal generation method that addresses several existing challenges in the literature and opens promising paths for accelerating materials discovery with machine learning.

openalex-author · arXiv (Cornell University)

Amortizing intractable inference in large language models

Autoregressive large language models (LLMs) compress knowledge from their training data through next-token conditional distributions. This limits tractable querying of this knowledge to start-to-end autoregressive sampling. However, many tasks of interest -- including sequence continuation, infilling, and other forms of constrained generation -- involve sampling from intractable posterior distributions. We address this limitation by using amortized Bayesian inference to sample from these intractable posteriors. Such amortization is algorithmically achieved by fine-tuning LLMs via diversity-seeking reinforcement learning algorithms: generative flow networks (GFlowNets). We empirically demonstrate that this distribution-matching paradigm of LLM fine-tuning can serve as an effective alternative to maximum-likelihood training and reward-maximizing policy optimization. As an important application, we interpret chain-of-thought reasoning as a latent variable modeling problem and demonstrate that our approach enables data-efficient adaptation of LLMs to tasks that require multi-step rationalization and tool use.

openalex-author · arXiv (Cornell University)

Causal Inference in Gene Regulatory Networks with GFlowNet: Towards Scalability in Large Systems

Understanding causal relationships within Gene Regulatory Networks (GRNs) is essential for unraveling the gene interactions in cellular processes. However, causal discovery in GRNs is a challenging problem for multiple reasons including the existence of cyclic feedback loops and uncertainty that yields diverse possible causal structures. Previous works in this area either ignore cyclic dynamics (assume acyclic structure) or struggle with scalability. We introduce Swift-DynGFN as a novel framework that enhances causal structure learning in GRNs while addressing scalability concerns. Specifically, Swift-DynGFN exploits gene-wise independence to boost parallelization and to lower computational cost. Experiments on real single-cell RNA velocity and synthetic GRN datasets showcase the advancement in learning causal structure in GRNs and scalability in larger systems.

openalex-author · arXiv (Cornell University)

Pre-Training and Fine-Tuning Generative Flow Networks

Generative Flow Networks (GFlowNets) are amortized samplers that learn stochastic policies to sequentially generate compositional objects from a given unnormalized reward distribution. They can generate diverse sets of high-reward objects, which is an important consideration in scientific discovery tasks. However, as they are typically trained from a given extrinsic reward function, it remains an important open challenge about how to leverage the power of pre-training and train GFlowNets in an unsupervised fashion for efficient adaptation to downstream tasks. Inspired by recent successes of unsupervised pre-training in various domains, we introduce a novel approach for reward-free pre-training of GFlowNets. By framing the training as a self-supervised problem, we propose an outcome-conditioned GFlowNet (OC-GFN) that learns to explore the candidate space. Specifically, OC-GFN learns to reach any targeted outcomes, akin to goal-conditioned policies in reinforcement learning. We show that the pre-trained OC-GFN model can allow for a direct extraction of a policy capable of sampling from any new reward functions in downstream tasks. Nonetheless, adapting OC-GFN on a downstream task-specific reward involves an intractable marginalization over possible outcomes. We propose a novel way to approximate this marginalization by learning an amortized predictor enabling efficient fine-tuning. Extensive experimental results validate the efficacy of our approach, demonstrating the effectiveness of pre-training the OC-GFN, and its ability to swiftly adapt to downstream tasks and discover modes more efficiently. This work may serve as a foundation for further exploration of pre-training strategies in the context of GFlowNets.

openalex-author · arXiv (Cornell University)

Learning to Scale Logits for Temperature-Conditional GFlowNets

GFlowNets are probabilistic models that sequentially generate compositional structures through a stochastic policy. Among GFlowNets, temperature-conditional GFlowNets can introduce temperature-based controllability for exploration and exploitation. We propose \textit{Logit-scaling GFlowNets} (Logit-GFN), a novel architectural design that greatly accelerates the training of temperature-conditional GFlowNets. It is based on the idea that previously proposed approaches introduced numerical challenges in the deep network training, since different temperatures may give rise to very different gradient profiles as well as magnitudes of the policy's logits. We find that the challenge is greatly reduced if a learned function of the temperature is used to scale the policy's logits directly. Also, using Logit-GFN, GFlowNets can be improved by having better generalization capabilities in offline learning and mode discovery capabilities in online learning, which is empirically verified in various biological and chemical tasks. Our code is available at \url{https://github.com/dbsxodud-11/logit-gfn}

openalex-author · arXiv (Cornell University)

Local Search GFlowNets

Generative Flow Networks (GFlowNets) are amortized sampling methods that learn a distribution over discrete objects proportional to their rewards. GFlowNets exhibit a remarkable ability to generate diverse samples, yet occasionally struggle to consistently produce samples with high rewards due to over-exploration on wide sample space. This paper proposes to train GFlowNets with local search, which focuses on exploiting high-rewarded sample space to resolve this issue. Our main idea is to explore the local neighborhood via backtracking and reconstruction guided by backward and forward policies, respectively. This allows biasing the samples toward high-reward solutions, which is not possible for a typical GFlowNet solution generation scheme, which uses the forward policy to generate the solution from scratch. Extensive experiments demonstrate a remarkable performance improvement in several biochemical tasks. Source code is available: \url{https://github.com/dbsxodud-11/ls_gfn}.

openalex-author · arXiv (Cornell University)

Diffusion Generative Flow Samplers: Improving learning signals through partial trajectory optimization

We tackle the problem of sampling from intractable high-dimensional density functions, a fundamental task that often appears in machine learning and statistics. We extend recent sampling-based approaches that leverage controlled stochastic processes to model approximate samples from these target densities. The main drawback of these approaches is that the training objective requires full trajectories to compute, resulting in sluggish credit assignment issues due to use of entire trajectories and a learning signal present only at the terminal time. In this work, we present Diffusion Generative Flow Samplers (DGFS), a sampling-based framework where the learning process can be tractably broken down into short partial trajectory segments, via parameterizing an additional "flow function". Our method takes inspiration from the theory developed for generative flow networks (GFlowNets), allowing us to make use of intermediate learning signals. Through various challenging experiments, we demonstrate that DGFS achieves more accurate estimates of the normalization constant than closely-related prior methods.

openalex-author · arXiv (Cornell University)

Expected flow networks in stochastic environments and two-player zero-sum games

Generative flow networks (GFlowNets) are sequential sampling models trained to match a given distribution. GFlowNets have been successfully applied to various structured object generation tasks, sampling a diverse set of high-reward objects quickly. We propose expected flow networks (EFlowNets), which extend GFlowNets to stochastic environments. We show that EFlowNets outperform other GFlowNet formulations in stochastic tasks such as protein design. We then extend the concept of EFlowNets to adversarial environments, proposing adversarial flow networks (AFlowNets) for two-player zero-sum games. We show that AFlowNets learn to find above 80% of optimal moves in Connect-4 via self-play and outperform AlphaZero in tournaments.

openalex-author · arXiv (Cornell University)

Leveraging Diffusion Disentangled Representations to Mitigate Shortcuts in Underspecified Visual Tasks

Spurious correlations in the data, where multiple cues are predictive of the target labels, often lead to shortcut learning phenomena, where a model may rely on erroneous, easy-to-learn, cues while ignoring reliable ones. In this work, we propose an ensemble diversification framework exploiting the generation of synthetic counterfactuals using Diffusion Probabilistic Models (DPMs). We discover that DPMs have the inherent capability to represent multiple visual cues independently, even when they are largely correlated in the training data. We leverage this characteristic to encourage model diversity and empirically show the efficacy of the approach with respect to several diversification objectives. We show that diffusion-guided diversification can lead models to avert attention from shortcut cues, achieving ensemble diversity performance comparable to previous methods requiring additional data collection.

openalex-author · arXiv (Cornell University)

Discrete, compositional, and symbolic representations through attractor dynamics

Symbolic systems are powerful frameworks for modeling cognitive processes as they encapsulate the rules and relationships fundamental to many aspects of human reasoning and behavior. Central to these models are systematicity, compositionality, and productivity, making them invaluable in both cognitive science and artificial intelligence. However, certain limitations remain. For instance, the integration of structured symbolic processes and latent sub-symbolic processes has been implemented at the computational level through fiat methods such as quantization or softmax sampling, which assume, rather than derive, the operations underpinning discretization and symbolicization. In this work, we introduce a novel neural stochastic dynamical systems model that integrates attractor dynamics with symbolic representations to model cognitive processes akin to the probabilistic language of thought (PLoT). Our model segments the continuous representational space into discrete basins, with attractor states corresponding to symbolic sequences, that reflect the semanticity and compositionality characteristic of symbolic systems through unsupervised learning, rather than relying on pre-defined primitives. Moreover, like PLoT, our model learns to sample a diverse distribution of attractor states that reflect the mutual information between the input data and the symbolic encodings. This approach establishes a unified framework that integrates both symbolic and sub-symbolic processing through neural dynamics, a neuro-plausible substrate with proven expressivity in AI, offering a more comprehensive model that mirrors the complex duality of cognitive operations.

openalex-author · arXiv (Cornell University)

Delta-AI: Local objectives for amortized inference in sparse graphical models

We present a new algorithm for amortized inference in sparse probabilistic graphical models (PGMs), which we call $Δ$-amortized inference ($Δ$-AI). Our approach is based on the observation that when the sampling of variables in a PGM is seen as a sequence of actions taken by an agent, sparsity of the PGM enables local credit assignment in the agent's policy learning objective. This yields a local constraint that can be turned into a local loss in the style of generative flow networks (GFlowNets) that enables off-policy training but avoids the need to instantiate all the random variables for each parameter update, thus speeding up training considerably. The $Δ$-AI objective matches the conditional distribution of a variable given its Markov blanket in a tractable learned sampler, which has the structure of a Bayesian network, with the same conditional distribution under the target PGM. As such, the trained sampler recovers marginals and conditional distributions of interest and enables inference of partial subsets of variables. We illustrate $Δ$-AI's effectiveness for sampling from synthetic PGMs and training latent variable models with sparse factor structure.

openalex-author · Cell Reports Methods

RECOVER identifies synergistic drug combinations in vitro through sequential model optimization

openalex-author · arXiv (Cornell University)

Consciousness-Inspired Spatio-Temporal Abstractions for Better Generalization in Reinforcement Learning

Inspired by human conscious planning, we propose Skipper, a model-based reinforcement learning framework utilizing spatio-temporal abstractions to generalize better in novel situations. It automatically decomposes the given task into smaller, more manageable subtasks, and thus enables sparse decision-making and focused computation on the relevant parts of the environment. The decomposition relies on the extraction of an abstracted proxy problem represented as a directed graph, in which vertices and edges are learned end-to-end from hindsight. Our theoretical analyses provide performance guarantees under appropriate assumptions and establish where our approach is expected to be helpful. Generalization-focused experiments validate Skipper's significant advantage in zero-shot generalization, compared to some existing state-of-the-art hierarchical planning methods.

openalex-author · arXiv (Cornell University)

Tree Cross Attention

Cross Attention is a popular method for retrieving information from a set of context tokens for making predictions. At inference time, for each prediction, Cross Attention scans the full set of $\mathcal{O}(N)$ tokens. In practice, however, often only a small subset of tokens are required for good performance. Methods such as Perceiver IO are cheap at inference as they distill the information to a smaller-sized set of latent tokens $L < N$ on which cross attention is then applied, resulting in only $\mathcal{O}(L)$ complexity. However, in practice, as the number of input tokens and the amount of information to distill increases, the number of latent tokens needed also increases significantly. In this work, we propose Tree Cross Attention (TCA) - a module based on Cross Attention that only retrieves information from a logarithmic $\mathcal{O}(\log(N))$ number of tokens for performing inference. TCA organizes the data in a tree structure and performs a tree search at inference time to retrieve the relevant tokens for prediction. Leveraging TCA, we introduce ReTreever, a flexible architecture for token-efficient inference. We show empirically that Tree Cross Attention (TCA) performs comparable to Cross Attention across various classification and uncertainty regression tasks while being significantly more token-efficient. Furthermore, we compare ReTreever against Perceiver IO, showing significant gains while using the same number of tokens for inference.

openalex-author · Nature

Publisher Correction: Scientific discovery in the age of artificial intelligence

No abstract available from the OpenAlex source record.

openalex-author · Zenodo (CERN European Organization for Nuclear Research)

GEO-Bench: m-chesapeake-lancover

GEO-Bench: m-chesapeake-landcover dataset This dataset has been modified to be included in the GEO-Bench dataset. All changes with respect to the original version are documented at https://github.com/ServiceNow/geo-bench. The original version of this dataset is due to Robinson et al. (2019) and is available at: https://mlhub.earth/data/microsoft_chesapeake. See the LICENSE file provided alongside this dataset for applicable licensing information.

openalex-author · Zenodo (CERN European Organization for Nuclear Research)

GEO-Bench: m-NeonTree

GEO-Bench: m-NeonTree dataset This dataset has been modified to be included in the GEO-Bench dataset. All changes with respect to the original version are documented at https://github.com/ServiceNow/geo-bench. The original version of this dataset is due to Weinstein et al. (2019) and is available at: https://github.com/weecology/NeonTreeEvaluation. See the LICENSE file provided alongside this dataset for applicable licensing information.

openalex-author · Zenodo (CERN European Organization for Nuclear Research)

GEO-Bench: test-dataset

GEO-Bench: test dataset

openalex-author · Zenodo (CERN European Organization for Nuclear Research)

GEO-Bench: m-pv4ger-seg

GEO-Bench: m-pv4ger-seg dataset This dataset has been modified to be included in the GEO-Bench dataset. All changes with respect to the original version are documented at https://github.com/ServiceNow/geo-bench. The original version of this dataset is due to Mayer et al. (2022) and is available at: https://github.com/kdmayer/3D-PV-Locator. See the LICENSE file provided alongside this dataset for applicable licensing information.

openalex-author · Zenodo (CERN European Organization for Nuclear Research)

GEO-Bench: m-SA-crop-type

GEO-Bench: m-SA-crop-type dataset This dataset has been modified to be included in the GEO-Bench dataset. All changes with respect to the original version are documented at https://github.com/ServiceNow/geo-bench. The original version of this dataset is due to the Western Cape Department of Agriculture and the Radiant Earth Foundation (2021) and is available at: https://mlhub.earth/data/ref_south_africa_crops_competition_v1. See the LICENSE file provided alongside this dataset for applicable licensing information.

openalex-author · Zenodo (CERN European Organization for Nuclear Research)

GEO-Bench: m-pv4ger

GEO-Bench: m-pv4ger dataset This dataset has been modified to be included in the GEO-Bench dataset. All changes with respect to the original version are documented at https://github.com/ServiceNow/geo-bench. The original version of this dataset is due to Mayer et al. (2022) and is available at: https://github.com/kdmayer/3D-PV-Locator. See the LICENSE file provided alongside this dataset for applicable licensing information.

openalex-author · Zenodo (CERN European Organization for Nuclear Research)

GEO-Bench: m-so2sat

GEO-Bench: m-so2sat dataset This dataset has been modified to be included in the GEO-Bench dataset. All changes with respect to the original version are documented at https://github.com/ServiceNow/geo-bench. The original version of this dataset is due to Zhu et al. (2019) and is available at: https://doi.org/10.14459/2018mp1483140. See the LICENSE file provided alongside this dataset for applicable licensing information.

openalex-author · Zenodo (CERN European Organization for Nuclear Research)

GEO-Bench: m-forestnet

GEO-Bench: m-forestnet dataset This dataset has been modified to be included in the GEO-Bench dataset. All changes with respect to the original version are documented at https://github.com/ServiceNow/geo-bench. The original version of this dataset is due to Irvin et al. (2020) and is available at: https://stanfordmlgroup.github.io/projects/forestnet/. See the LICENSE file provided alongside this dataset for applicable licensing information.

openalex-author · Zenodo (CERN European Organization for Nuclear Research)

GEO-Bench: m-bigearthnet

GEO-Bench: m-bigearthnet dataset This dataset has been modified to be included in the GEO-Bench dataset. All changes with respect to the original version are documented at https://github.com/ServiceNow/geo-bench. The original version of this dataset is due to Sumbul et al. (2019) and is available at: https://mlhub.earth/10.14279/depositonce-10149. See the LICENSE file provided alongside this dataset for applicable licensing information.

openalex-author · Zenodo (CERN European Organization for Nuclear Research)

GEO-Bench: m-nz-cattle

GEO-Bench: m-nz-cattle dataset This dataset has been modified to be included in the GEO-Bench dataset. All changes with respect to the original version are documented at https://github.com/ServiceNow/geo-bench. The original version of this dataset is due to Abuaiadah et al. (2022) and is available at: https://zenodo.org/record/5908869. See the LICENSE file provided alongside this dataset for applicable licensing information.

openalex-author · Zenodo (CERN European Organization for Nuclear Research)

GEO-Bench: m-cashew-plantation

GEO-Bench: m-cashew-plantation dataset This dataset has been modified to be included in the GEO-Bench dataset. All changes with respect to the original version are documented at https://github.com/ServiceNow/geo-bench. The original version of this dataset is due to Jin et al. (2021) and is available at: https://mlhub.earth/data/ts_cashew_benin. See the LICENSE file provided alongside this dataset for applicable licensing information.

openalex-author · Zenodo (CERN European Organization for Nuclear Research)

GEO-Bench: m-brick-kiln

GEO-Bench: m-brick-kiln dataset This dataset has been modified to be included in the GEO-Bench dataset. All changes with respect to the original version are documented at https://github.com/ServiceNow/geo-bench. The original version of this dataset is due to Lee et al. (2021) and is available at: https://sustainlab-group.github.io/sustainbench/docs/datasets/sdg13/brick_kiln.html. See the LICENSE file provided alongside this dataset for applicable licensing information.

openalex-author · Zenodo (CERN European Organization for Nuclear Research)

GEO-Bench: m-eurosat

GEO-Bench: m-eurosat dataset This dataset has been modified to be included in the GEO-Bench dataset. All changes with respect to the original version are documented at https://github.com/ServiceNow/geo-bench. The original version of this dataset is due to Helber et al. (2019) and is available at: https://github.com/phelber/eurosat. See the LICENSE file provided alongside this dataset for applicable licensing information.

openalex-author · arXiv (Cornell University)

Consciousness in Artificial Intelligence: Insights from the Science of Consciousness

Whether current or near-term AI systems could be conscious is a topic of scientific interest and increasing public concern. This report argues for, and exemplifies, a rigorous and empirically grounded approach to AI consciousness: assessing existing AI systems in detail, in light of our best-supported neuroscientific theories of consciousness. We survey several prominent scientific theories of consciousness, including recurrent processing theory, global workspace theory, higher-order theories, predictive processing, and attention schema theory. From these theories we derive "indicator properties" of consciousness, elucidated in computational terms that allow us to assess AI systems for these properties. We use these indicator properties to assess several recent AI systems, and we discuss how future systems might implement them. Our analysis suggests that no current AI systems are conscious, but also suggests that there are no obvious technical barriers to building AI systems which satisfy these indicators.

openalex-author · Nature

Scientific discovery in the age of artificial intelligence

No abstract available from the OpenAlex source record.

openalex-author · arXiv (Cornell University)

Benchmarking Bayesian Causal Discovery Methods for Downstream Treatment Effect Estimation

The practical utility of causality in decision-making is widespread and brought about by the intertwining of causal discovery and causal inference. Nevertheless, a notable gap exists in the evaluation of causal discovery methods, where insufficient emphasis is placed on downstream inference. To address this gap, we evaluate seven established baseline causal discovery methods including a newly proposed method based on GFlowNets, on the downstream task of treatment effect estimation. Through the implementation of a distribution-level evaluation, we offer valuable and unique insights into the efficacy of these causal discovery methods for treatment effect estimation, considering both synthetic and real-world scenarios, as well as low-data scenarios. The results of our study demonstrate that some of the algorithms studied are able to effectively capture a wide range of useful and diverse ATE modes, while some tend to learn many low-probability modes which impacts the (unrelaxed) recall and precision.

openalex-author · arXiv (Cornell University)

International Institutions for Advanced AI

International institutions may have an important role to play in ensuring advanced AI systems benefit humanity. International collaborations can unlock AI's ability to further sustainable development, and coordination of regulatory efforts can reduce obstacles to innovation and the spread of benefits. Conversely, the potential dangerous capabilities of powerful and general-purpose AI systems create global externalities in their development and deployment, and international efforts to further responsible AI practices could help manage the risks they pose. This paper identifies a set of governance functions that could be performed at an international level to address these challenges, ranging from supporting access to frontier AI systems to setting international safety standards. It groups these functions into four institutional models that exhibit internal synergies and have precedents in existing organizations: 1) a Commission on Frontier AI that facilitates expert consensus on opportunities and risks from advanced AI, 2) an Advanced AI Governance Organization that sets international standards to manage global threats from advanced models, supports their implementation, and possibly monitors compliance with a future governance regime, 3) a Frontier AI Collaborative that promotes access to cutting-edge AI, and 4) an AI Safety Project that brings together leading researchers and engineers to further AI safety research. We explore the utility of these models and identify open questions about their viability.

openalex-author · arXiv (Cornell University)

Simulation-free Schrödinger bridges via score and flow matching

We present simulation-free score and flow matching ([SF]$^2$M), a simulation-free objective for inferring stochastic dynamics given unpaired samples drawn from arbitrary source and target distributions. Our method generalizes both the score-matching loss used in the training of diffusion models and the recently proposed flow matching loss used in the training of continuous normalizing flows. [SF]$^2$M interprets continuous-time stochastic generative modeling as a Schrödinger bridge problem. It relies on static entropy-regularized optimal transport, or a minibatch approximation, to efficiently learn the SB without simulating the learned stochastic process. We find that [SF]$^2$M is more efficient and gives more accurate solutions to the SB problem than simulation-based methods from prior work. Finally, we apply [SF]$^2$M to the problem of learning cell dynamics from snapshot data. Notably, [SF]$^2$M is the first method to accurately model cell dynamics in high dimensions and can recover known gene regulatory networks from simulated data. Our code is available in the TorchCFM package at https://github.com/atong01/conditional-flow-matching.

openalex-author · arXiv (Cornell University)

Generative Flow Networks: a Markov Chain Perspective

While Markov chain Monte Carlo methods (MCMC) provide a general framework to sample from a probability distribution defined up to normalization, they often suffer from slow convergence to the target distribution when the latter is highly multi-modal. Recently, Generative Flow Networks (GFlowNets) have been proposed as an alternative framework to mitigate this issue when samples have a clear compositional structure, by treating sampling as a sequential decision making problem. Although they were initially introduced from the perspective of flow networks, the recent advances of GFlowNets draw more and more inspiration from the Markov chain literature, bypassing completely the need for flows. In this paper, we formalize this connection and offer a new perspective for GFlowNets using Markov chains, showing a unifying view for GFlowNets regardless of the nature of the state space as recurrent Markov chains. Positioning GFlowNets under the same theoretical framework as MCMC methods also allows us to identify the similarities between both frameworks, and most importantly to highlight their

openalex-author · arXiv (Cornell University)

Thompson sampling for improved exploration in GFlowNets

Generative flow networks (GFlowNets) are amortized variational inference algorithms that treat sampling from a distribution over compositional objects as a sequential decision-making problem with a learnable action policy. Unlike other algorithms for hierarchical sampling that optimize a variational bound, GFlowNet algorithms can stably run off-policy, which can be advantageous for discovering modes of the target distribution. Despite this flexibility in the choice of behaviour policy, the optimal way of efficiently selecting trajectories for training has not yet been systematically explored. In this paper, we view the choice of trajectories for training as an active learning problem and approach it using Bayesian techniques inspired by methods for multi-armed bandits. The proposed algorithm, Thompson sampling GFlowNets (TS-GFN), maintains an approximate posterior distribution over policies and samples trajectories from this posterior for training. We show in two domains that TS-GFN yields improved exploration and thus faster convergence to the target distribution than the off-policy exploration strategies used in past work.

openalex-author · arXiv (Cornell University)

HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution

Genomic (DNA) sequences encode an enormous amount of information for gene regulation and protein synthesis. Similar to natural language models, researchers have proposed foundation models in genomics to learn generalizable features from unlabeled genome data that can then be fine-tuned for downstream tasks such as identifying regulatory elements. Due to the quadratic scaling of attention, previous Transformer-based genomic models have used 512 to 4k tokens as context (<0.001% of the human genome), significantly limiting the modeling of long-range interactions in DNA. In addition, these methods rely on tokenizers or fixed k-mers to aggregate meaningful DNA units, losing single nucleotide resolution where subtle genetic variations can completely alter protein function via single nucleotide polymorphisms (SNPs). Recently, Hyena, a large language model based on implicit convolutions was shown to match attention in quality while allowing longer context lengths and lower time complexity. Leveraging Hyena's new long-range capabilities, we present HyenaDNA, a genomic foundation model pretrained on the human reference genome with context lengths of up to 1 million tokens at the single nucleotide-level - an up to 500x increase over previous dense attention-based models. HyenaDNA scales sub-quadratically in sequence length (training up to 160x faster than Transformer), uses single nucleotide tokens, and has full global context at each layer. We explore what longer context enables - including the first use of in-context learning in genomics. On fine-tuned benchmarks from the Nucleotide Transformer, HyenaDNA reaches state-of-the-art (SotA) on 12 of 18 datasets using a model with orders of magnitude less parameters and pretraining data. On the GenomicBenchmarks, HyenaDNA surpasses SotA on 7 of 8 datasets on average by +10 accuracy points. Code at https://github.com/HazyResearch/hyena-dna.

openalex-author · Proceedings of the AAAI Conference on Artificial Intelligence

The Effect of Diversity in Meta-Learning

Recent studies show that task distribution plays a vital role in the meta-learner's performance. Conventional wisdom is that task diversity should improve the performance of meta-learning. In this work, we find evidence to the contrary; (i) our experiments draw into question the efficacy of our learned models: similar manifolds can be learned with a subset of the data (lower task diversity). This finding questions the advantage of providing more data to the model, and (ii) adding diversity to the task distribution (higher task diversity) sometimes hinders the model and does not lead to a significant improvement in performance as previously believed. To strengthen our findings, we provide both empirical and theoretical evidence.

openalex-author · Proceedings of the AAAI Conference on Artificial Intelligence

Adaptive Discrete Communication Bottlenecks with Dynamic Vector Quantization for Heterogeneous Representational Coarseness

Vector Quantization (VQ) is a method for discretizing latent representations and has become a major part of the deep learning toolkit. It has been theoretically and empirically shown that discretization of representations leads to improved generalization, including in reinforcement learning where discretization can be used to bottleneck multi-agent communication to promote agent specialization and robustness. The discretization tightness of most VQ-based methods is defined by the number of discrete codes in the representation vector and the codebook size, which are fixed as hyperparameters. In this work, we propose learning to dynamically select discretization tightness conditioned on inputs, based on the hypothesis that data naturally contains variations in complexity that call for different levels of representational coarseness which is observed in many heterogeneous data sets. We show that dynamically varying tightness in communication bottlenecks can improve model performance on visual reasoning and reinforcement learning tasks with heterogeneity in representations.

openalex-author · arXiv (Cornell University)

BatchGFN: Generative Flow Networks for Batch Active Learning

We introduce BatchGFN -- a novel approach for pool-based active learning that uses generative flow networks to sample sets of data points proportional to a batch reward. With an appropriate reward function to quantify the utility of acquiring a batch, such as the joint mutual information between the batch and the model parameters, BatchGFN is able to construct highly informative batches for active learning in a principled way. We show our approach enables sampling near-optimal utility batches at inference time with a single forward pass per point in the batch in toy regression problems. This alleviates the computational complexity of batch-aware algorithms and removes the need for greedy approximations to find maximizers for the batch reward. We also present early results for amortizing training across acquisition steps, which will enable scaling to real-world tasks.

openalex-author · arXiv (Cornell University)

Constant Memory Attention Block

Modern foundation model architectures rely on attention mechanisms to effectively capture context. However, these methods require linear or quadratic memory in terms of the number of inputs/datapoints, limiting their applicability in low-compute domains. In this work, we propose Constant Memory Attention Block (CMAB), a novel general-purpose attention block that computes its output in constant memory and performs updates in constant computation. Highlighting CMABs efficacy, we introduce methods for Neural Processes and Temporal Point Processes. Empirically, we show our proposed methods achieve results competitive with state-of-the-art while being significantly more memory efficient.

openalex-author · arXiv (Cornell University)

Multi-Fidelity Active Learning with GFlowNets

In the last decades, the capacity to generate large amounts of data in science and engineering applications has been growing steadily. Meanwhile, machine learning has progressed to become a suitable tool to process and utilise the available data. Nonetheless, many relevant scientific and engineering problems present challenges where current machine learning methods cannot yet efficiently leverage the available data and resources. For example, in scientific discovery, we are often faced with the problem of exploring very large, structured and high-dimensional spaces. Moreover, the high fidelity, black-box objective function is often very expensive to evaluate. Progress in machine learning methods that can efficiently tackle such challenges would help accelerate currently crucial areas such as drug and materials discovery. In this paper, we propose a multi-fidelity active learning algorithm with GFlowNets as a sampler, to efficiently discover diverse, high-scoring candidates where multiple approximations of the black-box function are available at lower fidelity and cost. Our evaluation on molecular discovery tasks shows that multi-fidelity active learning with GFlowNets can discover high-scoring candidates at a fraction of the budget of its single-fidelity counterpart while maintaining diversity, unlike RL-based alternatives. These results open new avenues for multi-fidelity active learning to accelerate scientific discovery and engineering design.

openalex-author · arXiv (Cornell University)

GEO-Bench: Toward Foundation Models for Earth Monitoring

Recent progress in self-supervision has shown that pre-training large neural networks on vast amounts of unsupervised data can lead to substantial increases in generalization to downstream tasks. Such models, recently coined foundation models, have been transformational to the field of natural language processing. Variants have also been proposed for image data, but their applicability to remote sensing tasks is limited. To stimulate the development of foundation models for Earth monitoring, we propose a benchmark comprised of six classification and six segmentation tasks, which were carefully curated and adapted to be both relevant to the field and well-suited for model evaluation. We accompany this benchmark with a robust methodology for evaluating models and reporting aggregated results to enable a reliable assessment of progress. Finally, we report results for 20 baselines to gain information about the performance of existing models. We believe that this benchmark will be a driver of progress across a variety of Earth monitoring tasks.

openalex-author · arXiv (Cornell University)

Cycle Consistency Driven Object Discovery

Developing deep learning models that effectively learn object-centric representations, akin to human cognition, remains a challenging task. Existing approaches facilitate object discovery by representing objects as fixed-size vectors, called ``slots'' or ``object files''. While these approaches have shown promise in certain scenarios, they still exhibit certain limitations. First, they rely on architectural priors which can be unreliable and usually require meticulous engineering to identify the correct objects. Second, there has been a notable gap in investigating the practical utility of these representations in downstream tasks. To address the first limitation, we introduce a method that explicitly optimizes the constraint that each object in a scene should be associated with a distinct slot. We formalize this constraint by introducing consistency objectives which are cyclic in nature. By integrating these consistency objectives into various existing slot-based object-centric methods, we showcase substantial improvements in object-discovery performance. These enhancements consistently hold true across both synthetic and real-world scenes, underscoring the effectiveness and adaptability of the proposed approach. To tackle the second limitation, we apply the learned object-centric representations from the proposed method to two downstream reinforcement learning tasks, demonstrating considerable performance enhancements compared to conventional slot-based and monolithic representation learning methods. Our results suggest that the proposed approach not only improves object discovery, but also provides richer features for downstream tasks.

openalex-author · arXiv (Cornell University)

Improving day-ahead Solar Irradiance Time Series Forecasting by Leveraging Spatio-Temporal Context

Solar power harbors immense potential in mitigating climate change by substantially reducing CO$_{2}$ emissions. Nonetheless, the inherent variability of solar irradiance poses a significant challenge for seamlessly integrating solar power into the electrical grid. While the majority of prior research has centered on employing purely time series-based methodologies for solar forecasting, only a limited number of studies have taken into account factors such as cloud cover or the surrounding physical context. In this paper, we put forth a deep learning architecture designed to harness spatio-temporal context using satellite data, to attain highly accurate \textit{day-ahead} time-series forecasting for any given station, with a particular emphasis on forecasting Global Horizontal Irradiance (GHI). We also suggest a methodology to extract a distribution for each time step prediction, which can serve as a very valuable measure of uncertainty attached to the forecast. When evaluating models, we propose a testing scheme in which we separate particularly difficult examples from easy ones, in order to capture the model performances in crucial situations, which in the case of this study are the days suffering from varying cloudy conditions. Furthermore, we present a new multi-modal dataset gathering satellite imagery over a large zone and time series for solar irradiance and other related physical variables from multiple geographically diverse solar stations. Our approach exhibits robust performance in solar irradiance forecasting, including zero-shot generalization tests at unobserved solar stations, and holds great promise in promoting the effective integration of solar power into the grid.

openalex-author · arXiv (Cornell University)

Spotlight Attention: Robust Object-Centric Learning With a Spatial Locality Prior

The aim of object-centric vision is to construct an explicit representation of the objects in a scene. This representation is obtained via a set of interchangeable modules called \emph{slots} or \emph{object files} that compete for local patches of an image. The competition has a weak inductive bias to preserve spatial continuity; consequently, one slot may claim patches scattered diffusely throughout the image. In contrast, the inductive bias of human vision is strong, to the degree that attention has classically been described with a spotlight metaphor. We incorporate a spatial-locality prior into state-of-the-art object-centric vision models and obtain significant improvements in segmenting objects in both synthetic and real-world datasets. Similar to human visual attention, the combination of image content and spatial constraints yield robust unsupervised object-centric learning, including less sensitivity to model hyperparameters.

openalex-author · arXiv (Cornell University)

Joint Bayesian Inference of Graphical Structure and Parameters with a Single Generative Flow Network

Generative Flow Networks (GFlowNets), a class of generative models over discrete and structured sample spaces, have been previously applied to the problem of inferring the marginal posterior distribution over the directed acyclic graph (DAG) of a Bayesian Network, given a dataset of observations. Based on recent advances extending this framework to non-discrete sample spaces, we propose in this paper to approximate the joint posterior over not only the structure of a Bayesian Network, but also the parameters of its conditional probability distributions. We use a single GFlowNet whose sampling policy follows a two-phase process: the DAG is first generated sequentially one edge at a time, and then the corresponding parameters are picked once the full structure is known. Since the parameters are included in the posterior distribution, this leaves more flexibility for the local probability models of the Bayesian Network, making our approach applicable even to non-linear models parametrized by neural networks. We show that our method, called JSP-GFN, offers an accurate approximation of the joint posterior, while comparing favorably against existing methods on both simulated and real data.

openalex-author · arXiv (Cornell University)

Attention Schema in Neural Agents

Attention has become a common ingredient in deep learning architectures. It adds a dynamical selection of information on top of the static selection of information supported by weights. In the same way, we can imagine a higher-order informational filter built on top of attention: an Attention Schema (AS), namely, a descriptive and predictive model of attention. In cognitive neuroscience, Attention Schema Theory (AST) supports this idea of distinguishing attention from AS. A strong prediction of this theory is that an agent can use its own AS to also infer the states of other agents' attention and consequently enhance coordination with other agents. As such, multi-agent reinforcement learning would be an ideal setting to experimentally test the validity of AST. We explore different ways in which attention and AS interact with each other. Our preliminary results indicate that agents that implement the AS as a recurrent internal control achieve the best performance. In general, these exploratory experiments suggest that equipping artificial agents with a model of attention can enhance their social intelligence.

openalex-author · arXiv (Cornell University)

Let the Flows Tell: Solving Graph Combinatorial Optimization Problems with GFlowNets

Combinatorial optimization (CO) problems are often NP-hard and thus out of reach for exact algorithms, making them a tempting domain to apply machine learning methods. The highly structured constraints in these problems can hinder either optimization or sampling directly in the solution space. On the other hand, GFlowNets have recently emerged as a powerful machinery to efficiently sample from composite unnormalized densities sequentially and have the potential to amortize such solution-searching processes in CO, as well as generate diverse solution candidates. In this paper, we design Markov decision processes (MDPs) for different combinatorial problems and propose to train conditional GFlowNets to sample from the solution space. Efficient training techniques are also developed to benefit long-range credit assignment. Through extensive experiments on a variety of different CO tasks with synthetic and realistic data, we demonstrate that GFlowNet policies can efficiently find high-quality solutions. Our implementation is open-sourced at https://github.com/zdhNarsil/GFlowNet-CombOpt.

openalex-author · arXiv (Cornell University)

Model evaluation for extreme risks

Current approaches to building general-purpose AI systems tend to produce systems with both beneficial and harmful capabilities. Further progress in AI development could lead to capabilities that pose extreme risks, such as offensive cyber capabilities or strong manipulation skills. We explain why model evaluation is critical for addressing extreme risks. Developers must be able to identify dangerous capabilities (through "dangerous capability evaluations") and the propensity of models to apply their capabilities for harm (through "alignment evaluations"). These evaluations will become critical for keeping policymakers and other stakeholders informed, and for making responsible decisions about model training, deployment, and security.

openalex-author · arXiv (Cornell University)

Memory Efficient Neural Processes via Constant Memory Attention Block

Neural Processes (NPs) are popular meta-learning methods for efficiently modelling predictive uncertainty. Recent state-of-the-art methods, however, leverage expensive attention mechanisms, limiting their applications, particularly in low-resource settings. In this work, we propose Constant Memory Attentive Neural Processes (CMANPs), an NP variant that only requires constant memory. To do so, we first propose an efficient update operation for Cross Attention. Leveraging the update operation, we propose Constant Memory Attention Block (CMAB), a novel attention block that (i) is permutation invariant, (ii) computes its output in constant memory, and (iii) performs constant computation updates. Finally, building on CMAB, we detail Constant Memory Attentive Neural Processes. Empirically, we show CMANPs achieve state-of-the-art results on popular NP benchmarks while being significantly more memory efficient than prior methods.

openalex-author · Scientific Data

Responses of pyramidal cell somata and apical dendrites in mouse visual cortex over multiple days

The apical dendrites of pyramidal neurons in sensory cortex receive primarily top-down signals from associative and motor regions, while cell bodies and nearby dendrites are heavily targeted by locally recurrent or bottom-up inputs from the sensory periphery. Based on these differences, a number of theories in computational neuroscience postulate a unique role for apical dendrites in learning. However, due to technical challenges in data collection, little data is available for comparing the responses of apical dendrites to cell bodies over multiple days. Here we present a dataset collected through the Allen Institute Mindscope's OpenScope program that addresses this need. This dataset comprises high-quality two-photon calcium imaging from the apical dendrites and the cell bodies of visual cortical pyramidal neurons, acquired over multiple days in awake, behaving mice that were presented with visual stimuli. Many of the cell bodies and dendrite segments were tracked over days, enabling analyses of how their responses change over time. This dataset allows neuroscientists to explore the differences between apical and somatic processing and plasticity.

openalex-author · Nature Communications

Catalyzing next-generation Artificial Intelligence through NeuroAI

Neuroscience has long been an essential driver of progress in artificial intelligence (AI). We propose that to accelerate progress in AI, we must invest in fundamental research in NeuroAI. A core component of this is the embodied Turing test, which challenges AI animal models to interact with the sensorimotor world at skill levels akin to their living counterparts. The embodied Turing test shifts the focus from those capabilities like game playing and language that are especially well-developed or uniquely human to those capabilities - inherited from over 500 million years of evolution - that are shared with all animals. Building models that can pass the embodied Turing test will provide a roadmap for the next generation of AI.

openalex-author · PLOS Digital Health

Proactive Contact Tracing

The COVID-19 pandemic has spurred an unprecedented demand for interventions that can reduce disease spread without excessively restricting daily activity, given negative impacts on mental health and economic outcomes. Digital contact tracing (DCT) apps have emerged as a component of the epidemic management toolkit. Existing DCT apps typically recommend quarantine to all digitally-recorded contacts of test-confirmed cases. Over-reliance on testing may, however, impede the effectiveness of such apps, since by the time cases are confirmed through testing, onward transmissions are likely to have occurred. Furthermore, most cases are infectious over a short period; only a subset of their contacts are likely to become infected. These apps do not fully utilize data sources to base their predictions of transmission risk during an encounter, leading to recommendations of quarantine to many uninfected people and associated slowdowns in economic activity. This phenomenon, commonly termed as "pingdemic," may additionally contribute to reduced compliance to public health measures. In this work, we propose a novel DCT framework, Proactive Contact Tracing (PCT), which uses multiple sources of information (e.g. self-reported symptoms, received messages from contacts) to estimate app users' infectiousness histories and provide behavioral recommendations. PCT methods are by design proactive, predicting spread before it occurs. We present an interpretable instance of this framework, the Rule-based PCT algorithm, designed via a multi-disciplinary collaboration among epidemiologists, computer scientists, and behavior experts. Finally, we develop an agent-based model that allows us to compare different DCT methods and evaluate their performance in negotiating the trade-off between epidemic control and restricting population mobility. Performing extensive sensitivity analysis across user behavior, public health policy, and virological parameters, we compare Rule-based PCT to i) binary contact tracing (BCT), which exclusively relies on test results and recommends a fixed-duration quarantine, and ii) household quarantine (HQ). Our results suggest that both BCT and Rule-based PCT improve upon HQ, however, Rule-based PCT is more efficient at controlling spread of disease than BCT across a range of scenarios. In terms of cost-effectiveness, we show that Rule-based PCT pareto-dominates BCT, as demonstrated by a decrease in Disability Adjusted Life Years, as well as Temporary Productivity Loss. Overall, we find that Rule-based PCT outperforms existing approaches across a varying range of parameters. By leveraging anonymized infectiousness estimates received from digitally-recorded contacts, PCT is able to notify potentially infected users earlier than BCT methods and prevent onward transmissions. Our results suggest that PCT-based applications could be a useful tool in managing future epidemics.

openalex-author · Journal of the Canadian Association of Gastroenterology

A108 AUTOMATED DETECTION OF ILEOCECAL VALVE, APPENDICEAL ORIFICE, AND POLYP DURING COLONOSCOPY USING A DEEP LEARNING MODEL

Abstract Background Identification and photo-documentation of the ileocecal valve (ICV) and appendiceal orifice (AO) confirm completeness of colonoscopy examinations. We hypothesized that an artificial intelligence (AI)-empowered solution could help us automatically differentiate anatomical landmarks such as AO and ICV from polyps and normal colon mucosa. Purpose We aimed to develop and test a deep convolutional neural network (DCNN) model that can automatically identify ICV and AO, and differentiate these landmarks from normal mucosa and colorectal polyps. Method We prospectively collected annotated full-length colonoscopy videos of 318 patients undergoing outpatient colonoscopies. We created three non-overlapping training, validation, and test datasets with 25,444 unaltered frames extracted from the colonoscopy videos showing four landmarks/image classes (AO, ICV, normal mucosa, and polyps). For each landmark, we extracted an average of 30 frames for each time of its appearance. All the extracted frames were reviewed and annotated by a team of three clinicians. Using a quality assessment tool, the clinicians examined a total of 86,754 frames (7982 AO, 8374 ICV, 32,971 polyps, and 37,427 normal mucosa) and verified whether or not the frame contained one unique landmark. For this research, all frames were extracted from the white-light colonoscopies, and all narrow-band imaging frames were excluded. A DCNN classification model was developed, validated, and tested in separate datasets of images. The primary outcome was the proportion of patients in whom the AI model could identify both ICV and AO, and differentiate them from polyps and normal mucosa, with an accuracy of detecting both AO and ICV above a threshold of 40% (representing a value in which reliable identification of the landmarks can be assumed without increasing false-positive alerts). Result(s) We trained a DCNN AI model on 21,503 unaltered frames extracted from the recorded colonoscopy videos of 272 patients, and validated and tested the model on 1,924 (25 patients) and 2,017 (21 patients) unaltered frames, respectively. We applied a transfer learning technique to fine-tune the model parameters to the endoscopic images using a cross-entropy loss function and back-propagation algorithm. After training and validation, the DCNN model could identify both AO and ICV in 18 out of 21 patients (85.71%), if accuracies were above the threshold of 40%. The accuracy of the model for differentiating AO from normal mucosa, and ICV from normal mucosa were 86.37% (95% CI 84.06% to 88.45%), and 86.44% (95% CI 84.06% to 88.59%), respectively. Furthermore, the accuracy of the model for differentiating polyps from normal mucosa was 88.57% (95% CI 86.60% to 90.33%). Conclusion(s) The model can reliably distinguish these anatomical landmarks from normal mucosa and colorectal polyps. It can be implemented into automated colonoscopy report generation, photo-documentation, and quality auditing solutions to improve colonoscopy reporting quality. Please acknowledge all funding agencies by checking the applicable boxes below Other Please indicate your source of funding; MEDTEQ Disclosure of Interest M. Taghiakbari: None Declared, S. Hamidi Ghalehjegh Employee of: Imagia Canexia Health Inc. , E. Jehanno Employee of: Imagia Canexia Health Inc. , T. Berthier Employee of: Imagia Canexia Health Inc. , L. di Jorio Employee of: Imagia Canexia Health Inc. , A. N. Barkun Grant / Research support from: co-awardee in funded research projects with Imagia Canexia Health Inc., Consultant of: Medtronic Inc. and A.I. VALI Inc, E. Deslandres: None Declared, S. Bouchard: None Declared, S. Sidani: None Declared, Y. Bengio: None Declared, D. von Renteln Grant / Research support from: ERBE, Ventage, Pendopharm, and Pentax, Consultant of: Boston Scientific and Pendopharm

openalex-author · arXiv (Cornell University)

Reusable Slotwise Mechanisms

Agents with the ability to comprehend and reason about the dynamics of objects would be expected to exhibit improved robustness and generalization in novel scenarios. However, achieving this capability necessitates not only an effective scene representation but also an understanding of the mechanisms governing interactions among object subsets. Recent studies have made significant progress in representing scenes using object slots. In this work, we introduce Reusable Slotwise Mechanisms, or RSM, a framework that models object dynamics by leveraging communication among slots along with a modular architecture capable of dynamically selecting reusable mechanisms for predicting the future states of each object slot. Crucially, RSM leverages the Central Contextual Information (CCI), enabling selected mechanisms to access the remaining slots through a bottleneck, effectively allowing for modeling of higher order and complex interactions that might require a sparse subset of objects. Experimental results demonstrate the superior performance of RSM compared to state-of-the-art methods across various future prediction and related downstream tasks, including Visual Question Answering and action planning. Furthermore, we showcase RSM's Out-of-Distribution generalization ability to handle scenes in intricate scenarios.

openalex-author · arXiv (Cornell University)

Hyena Hierarchy: Towards Larger Convolutional Language Models

Recent advances in deep learning have relied heavily on the use of large Transformers due to their ability to learn at scale. However, the core building block of Transformers, the attention operator, exhibits quadratic cost in sequence length, limiting the amount of context accessible. Existing subquadratic methods based on low-rank and sparse approximations need to be combined with dense attention layers to match Transformers, indicating a gap in capability. In this work, we propose Hyena, a subquadratic drop-in replacement for attention constructed by interleaving implicitly parametrized long convolutions and data-controlled gating. In recall and reasoning tasks on sequences of thousands to hundreds of thousands of tokens, Hyena improves accuracy by more than 50 points over operators relying on state-spaces and other implicit and explicit methods, matching attention-based models. We set a new state-of-the-art for dense-attention-free architectures on language modeling in standard datasets (WikiText103 and The Pile), reaching Transformer quality with a 20% reduction in training compute required at sequence length 2K. Hyena operators are twice as fast as highly optimized attention at sequence length 8K, and 100x faster at sequence length 64K.

openalex-author · arXiv (Cornell University)

Stochastic Generative Flow Networks

Generative Flow Networks (or GFlowNets for short) are a family of probabilistic agents that learn to sample complex combinatorial structures through the lens of "inference as control". They have shown great potential in generating high-quality and diverse candidates from a given energy landscape. However, existing GFlowNets can be applied only to deterministic environments, and fail in more general tasks with stochastic dynamics, which can limit their applicability. To overcome this challenge, this paper introduces Stochastic GFlowNets, a new algorithm that extends GFlowNets to stochastic environments. By decomposing state transitions into two steps, Stochastic GFlowNets isolate environmental stochasticity and learn a dynamics model to capture it. Extensive experimental results demonstrate that Stochastic GFlowNets offer significant advantages over standard GFlowNets as well as MCMC- and RL-based approaches, on a variety of standard benchmarks with stochastic dynamics.

openalex-author · arXiv (Cornell University)

GFlowNet-EM for learning compositional latent variable models

Latent variable models (LVMs) with discrete compositional latents are an important but challenging setting due to a combinatorially large number of possible configurations of the latents. A key tradeoff in modeling the posteriors over latents is between expressivity and tractable optimization. For algorithms based on expectation-maximization (EM), the E-step is often intractable without restrictive approximations to the posterior. We propose the use of GFlowNets, algorithms for sampling from an unnormalized density by learning a stochastic policy for sequential construction of samples, for this intractable E-step. By training GFlowNets to sample from the posterior over latents, we take advantage of their strengths as amortized variational inference algorithms for complex distributions over discrete structures. Our approach, GFlowNet-EM, enables the training of expressive LVMs with discrete compositional latents, as shown by experiments on non-context-free grammar induction and on images using discrete variational autoencoders (VAEs) without conditional independence enforced in the encoder.

openalex-author · arXiv (Cornell University)

Distributional GFlowNets with Quantile Flows

Generative Flow Networks (GFlowNets) are a new family of probabilistic samplers where an agent learns a stochastic policy for generating complex combinatorial structure through a series of decision-making steps. Despite being inspired from reinforcement learning, the current GFlowNet framework is relatively limited in its applicability and cannot handle stochasticity in the reward function. In this work, we adopt a distributional paradigm for GFlowNets, turning each flow function into a distribution, thus providing more informative learning signals during training. By parameterizing each edge flow through their quantile functions, our proposed \textit{quantile matching} GFlowNet learning algorithm is able to learn a risk-sensitive policy, an essential component for handling scenarios with risk uncertainty. Moreover, we find that the distributional approach can achieve substantial improvement on existing benchmarks compared to prior methods due to our enhanced training algorithm, even in settings with deterministic rewards.

openalex-author · arXiv (Cornell University)

DynGFN: Towards Bayesian Inference of Gene Regulatory Networks with GFlowNets

One of the grand challenges of cell biology is inferring the gene regulatory network (GRN) which describes interactions between genes and their products that control gene expression and cellular function. We can treat this as a causal discovery problem but with two non-standard challenges: (1) regulatory networks are inherently cyclic so we should not model a GRN as a directed acyclic graph (DAG), and (2) observations have significant measurement noise, so for typical sample sizes there will always be a large equivalence class of graphs that are likely given the data, and we want methods that capture this uncertainty. Existing methods either focus on challenge (1), identifying cyclic structure from dynamics, or on challenge (2) learning complex Bayesian posteriors over DAGs, but not both. In this paper we leverage the fact that it is possible to estimate the "velocity" of gene expression with RNA velocity techniques to develop an approach that addresses both challenges. Because we have access to velocity information, we can treat the Bayesian structure learning problem as a problem of sparse identification of a dynamical system, capturing cyclic feedback loops through time. Since our objective is to model uncertainty over discrete structures, we leverage Generative Flow Networks (GFlowNets) to estimate the posterior distribution over the combinatorial space of possible sparse dependencies. Our results indicate that our method learns posteriors that better encapsulate the distributions of cyclic structures compared to counterpart state-of-the-art Bayesian structure learning approaches.

openalex-author · arXiv (Cornell University)

Better Training of GFlowNets with Local Credit and Incomplete Trajectories

Generative Flow Networks or GFlowNets are related to Monte-Carlo Markov chain methods (as they sample from a distribution specified by an energy function), reinforcement learning (as they learn a policy to sample composed objects through a sequence of steps), generative models (as they learn to represent and sample from a distribution) and amortized variational methods (as they can be used to learn to approximate and sample from an otherwise intractable posterior, given a prior and a likelihood). They are trained to generate an object $x$ through a sequence of steps with probability proportional to some reward function $R(x)$ (or $\exp(-\mathcal{E}(x))$ with $\mathcal{E}(x)$ denoting the energy function), given at the end of the generative trajectory. Like for other RL settings where the reward is only given at the end, the efficiency of training and credit assignment may suffer when those trajectories are longer. With previous GFlowNet work, no learning was possible from incomplete trajectories (lacking a terminal state and the computation of the associated reward). In this paper, we consider the case where the energy function can be applied not just to terminal states but also to intermediate states. This is for example achieved when the energy function is additive, with terms available along the trajectory. We show how to reparameterize the GFlowNet state flow function to take advantage of the partial reward already accrued at each state. This enables a training objective that can be applied to update parameters even with incomplete trajectories. Even when complete trajectories are available, being able to obtain more localized credit and gradients is found to speed up training convergence, as demonstrated across many simulations.

openalex-author · arXiv (Cornell University)

Improving and generalizing flow-based generative models with minibatch optimal transport

Continuous normalizing flows (CNFs) are an attractive generative modeling technique, but they have been held back by limitations in their simulation-based maximum likelihood training. We introduce the generalized conditional flow matching (CFM) technique, a family of simulation-free training objectives for CNFs. CFM features a stable regression objective like that used to train the stochastic flow in diffusion models but enjoys the efficient inference of deterministic flow models. In contrast to both diffusion models and prior CNF training algorithms, CFM does not require the source distribution to be Gaussian or require evaluation of its density. A variant of our objective is optimal transport CFM (OT-CFM), which creates simpler flows that are more stable to train and lead to faster inference, as evaluated in our experiments. Furthermore, we show that when the true OT plan is available, our OT-CFM method approximates dynamic OT. Training CNFs with CFM improves results on a variety of conditional and unconditional generation tasks, such as inferring single cell dynamics, unsupervised image translation, and Schrödinger bridge inference.

openalex-author · arXiv (Cornell University)

A theory of continuous generative flow networks

Generative flow networks (GFlowNets) are amortized variational inference algorithms that are trained to sample from unnormalized target distributions over compositional objects. A key limitation of GFlowNets until this time has been that they are restricted to discrete spaces. We present a theory for generalized GFlowNets, which encompasses both existing discrete GFlowNets and ones with continuous or hybrid state spaces, and perform experiments with two goals in mind. First, we illustrate critical points of the theory and the importance of various assumptions. Second, we empirically demonstrate how observations about discrete GFlowNets transfer to the continuous case and show strong results compared to non-GFlowNet baselines on several previously studied tasks. This work greatly widens the perspectives for the application of GFlowNets in probabilistic inference and various modeling settings.

openalex-author · arXiv (Cornell University)

Leveraging the Third Dimension in Contrastive Learning

Self-Supervised Learning (SSL) methods operate on unlabeled data to learn robust representations useful for downstream tasks. Most SSL methods rely on augmentations obtained by transforming the 2D image pixel map. These augmentations ignore the fact that biological vision takes place in an immersive three-dimensional, temporally contiguous environment, and that low-level biological vision relies heavily on depth cues. Using a signal provided by a pretrained state-of-the-art monocular RGB-to-depth model (the \emph{Depth Prediction Transformer}, Ranftl et al., 2021), we explore two distinct approaches to incorporating depth signals into the SSL framework. First, we evaluate contrastive learning using an RGB+depth input representation. Second, we use the depth signal to generate novel views from slightly different camera positions, thereby producing a 3D augmentation for contrastive learning. We evaluate these two approaches on three different SSL methods -- BYOL, SimSiam, and SwAV -- using ImageNette (10 class subset of ImageNet), ImageNet-100 and ImageNet-1k datasets. We find that both approaches to incorporating depth signals improve the robustness and generalization of the baseline SSL methods, though the first approach (with depth-channel concatenation) is superior. For instance, BYOL with the additional depth channel leads to an increase in downstream classification accuracy from 85.3\% to 88.0\% on ImageNette and 84.1\% to 87.0\% on ImageNet-C.

openalex-author · Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Combining Parameter-efficient Modules for Task-level Generalisation

A modular design encourages neural models to disentangle and recombine different facets of knowledge to generalise more systematically to new tasks. In this work, we assume that each task is associated with a subset of latent skills from an (arbitrary size) inventory. In turn, each skill corresponds to a parameter-efficient (sparse / low-rank) model adapter. By jointly learning adapters and a routing function that allocates skills to each task, the full network is instantiated as the average of the parameters of active skills. We propose several inductive biases that encourage re-usage and composition of the skills, including variable-size skill allocation and a dual-speed learning rate. We evaluate our latent-skill model in two main settings: 1) multitask reinforcement learning for instruction following on 8 levels of the BabyAI platform; and 2) few-shot fine-tuning of language models on 160 NLP tasks of the CrossFit benchmark. We find that the modular design of our network enhances sample efficiency in reinforcement learning and few-shot generalisation in supervised learning, compared to a series of baselines. These include models where parameters are fully shared, task-specific, conditionally generated (HyperFormer), or sparse mixture-of-experts (TaskMoE).

openalex-author · Digital Discovery

GFlowNets for AI-driven scientific discovery

GFlowNets provide a general probabilistic framework for accelerating the computational phase of the scientific discovery process, which is crucial for tackling pressing challenges posed by global pandemics and the climate crisis.

openalex-author · The Journal of Physical Chemistry B

Diversifying Design of Nucleic Acid Aptamers Using Unsupervised Machine Learning

Inverse design of short single-stranded RNA and DNA sequences (aptamers) is the task of finding sequences that satisfy a set of desired criteria. Relevant criteria may be, for example, the presence of specific folding motifs, binding to molecular ligands, sensing properties, and so on. Most practical approaches to aptamer design identify a small set of promising candidate sequences using high-throughput experiments (e.g., SELEX) and then optimize performance by introducing only minor modifications to the empirically found candidates. Sequences that possess the desired properties but differ drastically in chemical composition will add diversity to the search space and facilitate the discovery of useful nucleic acid aptamers. Systematic diversification protocols are needed. Here we propose to use an unsupervised machine learning model known as the Potts model to discover new, useful sequences with controllable sequence diversity. We start by training a Potts model using the maximum entropy principle on a small set of empirically identified sequences unified by a common feature. To generate new candidate sequences with a controllable degree of diversity, we take advantage of the model's spectral feature: an "energy" bandgap separating sequences that are similar to the training set from those that are distinct. By controlling the Potts energy range that is sampled, we generate sequences that are distinct from the training set yet still likely to have the encoded features. To demonstrate performance, we apply our approach to design diverse pools of sequences with specified secondary structure motifs in 30-mer RNA and DNA aptamers.

openalex-author · arXiv (Cornell University)

MixupE: Understanding and Improving Mixup from Directional Derivative Perspective

Mixup is a popular data augmentation technique for training deep neural networks where additional samples are generated by linearly interpolating pairs of inputs and their labels. This technique is known to improve the generalization performance in many learning paradigms and applications. In this work, we first analyze Mixup and show that it implicitly regularizes infinitely many directional derivatives of all orders. Based on this new insight, we propose an improved version of Mixup, theoretically justified to deliver better generalization performance than the vanilla Mixup. To demonstrate the effectiveness of the proposed method, we conduct experiments across various domains such as images, tabular data, speech, and graphs. Our results show that the proposed method improves Mixup across multiple datasets using a variety of architectures, for instance, exhibiting an improvement over Mixup by 0.8% in ImageNet top-1 accuracy.

openalex-author · Mathematical Aspects of Deep Learning

Generalization in Deep Learning

International audience

openalex-author · arXiv (Cornell University)

Synergies between Disentanglement and Sparsity: Generalization and Identifiability in Multi-Task Learning

Although disentangled representations are often said to be beneficial for downstream tasks, current empirical and theoretical understanding is limited. In this work, we provide evidence that disentangled representations coupled with sparse base-predictors improve generalization. In the context of multi-task learning, we prove a new identifiability result that provides conditions under which maximally sparse base-predictors yield disentangled representations. Motivated by this theoretical result, we propose a practical approach to learn disentangled representations based on a sparsity-promoting bi-level optimization problem. Finally, we explore a meta-learning version of this algorithm based on group Lasso multiclass SVM base-predictors, for which we derive a tractable dual formulation. It obtains competitive results on standard few-shot classification benchmarks, while each task is using only a fraction of the learned representations.

openalex-author · arXiv (Cornell University)

PhAST: Physics-Aware, Scalable, and Task-specific GNNs for Accelerated Catalyst Design

Mitigating the climate crisis requires a rapid transition towards lower-carbon energy. Catalyst materials play a crucial role in the electrochemical reactions involved in numerous industrial processes key to this transition, such as renewable energy storage and electrofuel synthesis. To reduce the energy spent on such activities, we must quickly discover more efficient catalysts to drive electrochemical reactions. Machine learning (ML) holds the potential to efficiently model materials properties from large amounts of data, accelerating electrocatalyst design. The Open Catalyst Project OC20 dataset was constructed to that end. However, ML models trained on OC20 are still neither scalable nor accurate enough for practical applications. In this paper, we propose task-specific innovations applicable to most architectures, enhancing both computational efficiency and accuracy. This includes improvements in (1) the graph creation step, (2) atom representations, (3) the energy prediction head, and (4) the force prediction head. We describe these contributions, referred to as PhAST, and evaluate them thoroughly on multiple architectures. Overall, PhAST improves energy MAE by 4 to 42$\%$ while dividing compute time by 3 to 8$\times$ depending on the targeted task/model. PhAST also enables CPU training, leading to 40$\times$ speedups in highly parallelized settings. Python package: \url{https://phast.readthedocs.io}.

openalex-author · arXiv (Cornell University)

Latent Bottlenecked Attentive Neural Processes

Neural Processes (NPs) are popular methods in meta-learning that can estimate predictive uncertainty on target datapoints by conditioning on a context dataset. Previous state-of-the-art method Transformer Neural Processes (TNPs) achieve strong performance but require quadratic computation with respect to the number of context datapoints, significantly limiting its scalability. Conversely, existing sub-quadratic NP variants perform significantly worse than that of TNPs. Tackling this issue, we propose Latent Bottlenecked Attentive Neural Processes (LBANPs), a new computationally efficient sub-quadratic NP variant, that has a querying computational complexity independent of the number of context datapoints. The model encodes the context dataset into a constant number of latent vectors on which self-attention is performed. When making predictions, the model retrieves higher-order information from the context dataset via multiple cross-attention mechanisms on the latent vectors. We empirically show that LBANPs achieve results competitive with the state-of-the-art on meta-regression, image completion, and contextual multi-armed bandits. We demonstrate that LBANPs can trade-off the computational cost and performance according to the number of latent vectors. Finally, we show LBANPs can scale beyond existing attention-based NP variants to larger dataset settings.

openalex-author · arXiv (Cornell University)

Equivariance with Learned Canonicalization Functions

Symmetry-based neural networks often constrain the architecture in order to achieve invariance or equivariance to a group of transformations. In this paper, we propose an alternative that avoids this architectural constraint by learning to produce canonical representations of the data. These canonicalization functions can readily be plugged into non-equivariant backbone architectures. We offer explicit ways to implement them for some groups of interest. We show that this approach enjoys universality while providing interpretable insights. Our main hypothesis, supported by our empirical results, is that learning a small neural network to perform canonicalization is better than using predefined heuristics. Our experiments show that learning the canonicalization function is competitive with existing techniques for learning equivariant functions across many tasks, including image classification, $N$-body dynamics prediction, point cloud classification and part segmentation, while being faster across the board.

openalex-author · arXiv (Cornell University)

Posterior samples of source galaxies in strong gravitational lenses with score-based priors

Inferring accurate posteriors for high-dimensional representations of the brightness of gravitationally-lensed sources is a major challenge, in part due to the difficulties of accurately quantifying the priors. Here, we report the use of a score-based model to encode the prior for the inference of undistorted images of background galaxies. This model is trained on a set of high-resolution images of undistorted galaxies. By adding the likelihood score to the prior score and using a reverse-time stochastic differential equation solver, we obtain samples from the posterior. Our method produces independent posterior samples and models the data almost down to the noise level. We show how the balance between the likelihood and the prior meet our expectations in an experiment with out-of-distribution data.

openalex-author · arXiv (Cornell University)

A General Purpose Neural Architecture for Geospatial Systems

Geospatial Information Systems are used by researchers and Humanitarian Assistance and Disaster Response (HADR) practitioners to support a wide variety of important applications. However, collaboration between these actors is difficult due to the heterogeneous nature of geospatial data modalities (e.g., multi-spectral images of various resolutions, timeseries, weather data) and diversity of tasks (e.g., regression of human activity indicators or detecting forest fires). In this work, we present a roadmap towards the construction of a general-purpose neural architecture (GPNA) with a geospatial inductive bias, pre-trained on large amounts of unlabelled earth observation data in a self-supervised manner. We envision how such a model may facilitate cooperation between members of the community. We show preliminary results on the first step of the roadmap, where we instantiate an architecture that can process a wide variety of geospatial data modalities and demonstrate that it can achieve competitive performance with domain-specific architectures on tasks relating to the U.N.'s Sustainable Development Goals.

openalex-author · arXiv (Cornell University)

Bayesian learning of Causal Structure and Mechanisms with GFlowNets and Variational Bayes

Bayesian causal structure learning aims to learn a posterior distribution over directed acyclic graphs (DAGs), and the mechanisms that define the relationship between parent and child variables. By taking a Bayesian approach, it is possible to reason about the uncertainty of the causal model. The notion of modelling the uncertainty over models is particularly crucial for causal structure learning since the model could be unidentifiable when given only a finite amount of observational data. In this paper, we introduce a novel method to jointly learn the structure and mechanisms of the causal model using Variational Bayes, which we call Variational Bayes-DAG-GFlowNet (VBG). We extend the method of Bayesian causal structure learning using GFlowNets to learn not only the posterior distribution over the structure, but also the parameters of a linear-Gaussian model. Our results on simulated data suggest that VBG is competitive against several baselines in modelling the posterior over DAGs and mechanisms, while offering several advantages over existing methods, including the guarantee to sample acyclic graphs, and the flexibility to generalize to non-linear causal mechanisms.

openalex-author · arXiv (Cornell University)

Consistent Training via Energy-Based GFlowNets for Modeling Discrete Joint Distributions

Generative Flow Networks (GFlowNets) have demonstrated significant performance improvements for generating diverse discrete objects $x$ given a reward function $R(x)$, indicating the utility of the object and trained independently from the GFlowNet by supervised learning to predict a desirable property $y$ given $x$. We hypothesize that this can lead to incompatibility between the inductive optimization biases in training $R$ and in training the GFlowNet, potentially leading to worse samples and slow adaptation to changes in the distribution. In this work, we build upon recent work on jointly learning energy-based models with GFlowNets and extend it to learn the joint over multiple variables, which we call Joint Energy-Based GFlowNets (JEBGFNs), such as peptide sequences and their antimicrobial activity. Joint learning of the energy-based model, used as a reward for the GFlowNet, can resolve the issues of incompatibility since both the reward function $R$ and the GFlowNet sampler are trained jointly. We find that this joint training or joint energy-based formulation leads to significant improvements in generating anti-microbial peptides. As the training sequences arose out of evolutionary or artificial selection for high antibiotic activity, there is presumably some structure in the distribution of sequences that reveals information about the antibiotic activity. This results in an advantage to modeling their joint generatively vs. pure discriminative modeling. We also evaluate JEBGFN in an active learning setting for discovering anti-microbial peptides.

openalex-author · arXiv (Cornell University)

Discrete Factorial Representations as an Abstraction for Goal Conditioned Reinforcement Learning

Goal-conditioned reinforcement learning (RL) is a promising direction for training agents that are capable of solving multiple tasks and reach a diverse set of objectives. How to \textit{specify} and \textit{ground} these goals in such a way that we can both reliably reach goals during training as well as generalize to new goals during evaluation remains an open area of research. Defining goals in the space of noisy and high-dimensional sensory inputs poses a challenge for training goal-conditioned agents, or even for generalization to novel goals. We propose to address this by learning factorial representations of goals and processing the resulting representation via a discretization bottleneck, for coarser goal specification, through an approach we call DGRL. We show that applying a discretizing bottleneck can improve performance in goal-conditioned RL setups, by experimentally evaluating this method on tasks ranging from maze environments to complex robotic navigation and manipulation. Additionally, we prove a theorem lower-bounding the expected return on out-of-distribution goals, while still allowing for specifying goals with expressive combinatorial structure.

openalex-author · arXiv (Cornell University)

FL Games: A Federated Learning Framework for Distribution Shifts

Federated learning aims to train predictive models for data that is distributed across clients, under the orchestration of a server. However, participating clients typically each hold data from a different distribution, which can yield to catastrophic generalization on data from a different client, which represents a new domain. In this work, we argue that in order to generalize better across non-i.i.d. clients, it is imperative to only learn correlations that are stable and invariant across domains. We propose FL GAMES, a game-theoretic framework for federated learning that learns causal features that are invariant across clients. While training to achieve the Nash equilibrium, the traditional best response strategy suffers from high-frequency oscillations. We demonstrate that FL GAMES effectively resolves this challenge and exhibits smooth performance curves. Further, FL GAMES scales well in the number of clients, requires significantly fewer communication rounds, and is agnostic to device heterogeneity. Through empirical evaluation, we demonstrate that FL GAMES achieves high out-of-distribution performance on various benchmarks.

openalex-author · arXiv (Cornell University)

GFlowOut: Dropout with Generative Flow Networks

Bayesian Inference offers principled tools to tackle many critical problems with modern neural networks such as poor calibration and generalization, and data inefficiency. However, scaling Bayesian inference to large architectures is challenging and requires restrictive approximations. Monte Carlo Dropout has been widely used as a relatively cheap way for approximate Inference and to estimate uncertainty with deep neural networks. Traditionally, the dropout mask is sampled independently from a fixed distribution. Recent works show that the dropout mask can be viewed as a latent variable, which can be inferred with variational inference. These methods face two important challenges: (a) the posterior distribution over masks can be highly multi-modal which can be difficult to approximate with standard variational inference and (b) it is not trivial to fully utilize sample-dependent information and correlation among dropout masks to improve posterior estimation. In this work, we propose GFlowOut to address these issues. GFlowOut leverages the recently proposed probabilistic framework of Generative Flow Networks (GFlowNets) to learn the posterior distribution over dropout masks. We empirically demonstrate that GFlowOut results in predictive distributions that generalize better to out-of-distribution data, and provide uncertainty estimates which lead to better performance in downstream tasks.

openalex-author · arXiv (Cornell University)

Multi-Objective GFlowNets

We study the problem of generating diverse candidates in the context of Multi-Objective Optimization. In many applications of machine learning such as drug discovery and material design, the goal is to generate candidates which simultaneously optimize a set of potentially conflicting objectives. Moreover, these objectives are often imperfect evaluations of some underlying property of interest, making it important to generate diverse candidates to have multiple options for expensive downstream evaluations. We propose Multi-Objective GFlowNets (MOGFNs), a novel method for generating diverse Pareto optimal solutions, based on GFlowNets. We introduce two variants of MOGFNs: MOGFN-PC, which models a family of independent sub-problems defined by a scalarization function, with reward-conditional GFlowNets, and MOGFN-AL, which solves a sequence of sub-problems defined by an acquisition function in an active learning loop. Our experiments on wide variety of synthetic and benchmark tasks demonstrate advantages of the proposed methods in terms of the Pareto performance and importantly, improved candidate diversity, which is the main contribution of this work.

openalex-author · arXiv (Cornell University)

Toward Next-Generation Artificial Intelligence: Catalyzing the NeuroAI Revolution

Neuroscience has long been an essential driver of progress in artificial intelligence (AI). We propose that to accelerate progress in AI, we must invest in fundamental research in NeuroAI. A core component of this is the embodied Turing test, which challenges AI animal models to interact with the sensorimotor world at skill levels akin to their living counterparts. The embodied Turing test shifts the focus from those capabilities like game playing and language that are especially well-developed or uniquely human to those capabilities, inherited from over 500 million years of evolution, that are shared with all animals. Building models that can pass the embodied Turing test will provide a roadmap for the next generation of AI.

openalex-author · arXiv (Cornell University)

Neural Attentive Circuits

Recent work has seen the development of general purpose neural architectures that can be trained to perform tasks across diverse data modalities. General purpose models typically make few assumptions about the underlying data-structure and are known to perform well in the large-data regime. At the same time, there has been growing interest in modular neural architectures that represent the data using sparsely interacting modules. These models can be more robust out-of-distribution, computationally efficient, and capable of sample-efficient adaptation to new data. However, they tend to make domain-specific assumptions about the data, and present challenges in how module behavior (i.e., parameterization) and connectivity (i.e., their layout) can be jointly learned. In this work, we introduce a general purpose, yet modular neural architecture called Neural Attentive Circuits (NACs) that jointly learns the parameterization and a sparse connectivity of neural modules without using domain knowledge. NACs are best understood as the combination of two systems that are jointly trained end-to-end: one that determines the module configuration and the other that executes it on an input. We demonstrate qualitatively that NACs learn diverse and meaningful module configurations on the NLVR2 dataset without additional supervision. Quantitatively, we show that by incorporating modularity in this way, NACs improve upon a strong non-modular baseline in terms of low-shot adaptation on CIFAR and CUBs dataset by about 10%, and OOD robustness on Tiny ImageNet-R by about 2.5%. Further, we find that NACs can achieve an 8x speedup at inference time while losing less than 3% performance. Finally, we find NACs to yield competitive results on diverse data modalities spanning point-cloud classification, symbolic processing and text-classification from ASCII bytes, thereby confirming its general purpose nature.

openalex-author · arXiv (Cornell University)

Contrastive Retrospection: honing in on critical steps for rapid learning and generalization in RL

In real life, success is often contingent upon multiple critical steps that are distant in time from each other and from the final reward. These critical steps are challenging to identify with traditional reinforcement learning (RL) methods that rely on the Bellman equation for credit assignment. Here, we present a new RL algorithm that uses offline contrastive learning to hone in on these critical steps. This algorithm, which we call Contrastive Retrospection (ConSpec), can be added to any existing RL algorithm. ConSpec learns a set of prototypes for the critical steps in a task by a novel contrastive loss and delivers an intrinsic reward when the current state matches one of the prototypes. The prototypes in ConSpec provide two key benefits for credit assignment: (i) They enable rapid identification of all the critical steps. (ii) They do so in a readily interpretable manner, enabling out-of-distribution generalization when sensory features are altered. Distinct from other contemporary RL approaches to credit assignment, ConSpec takes advantage of the fact that it is easier to retrospectively identify the small set of steps that success is contingent upon (and ignoring other states) than it is to prospectively predict reward at every taken step. ConSpec greatly improves learning in a diverse set of RL tasks. The code is available at the link: https://github.com/sunchipsster1/ConSpec

openalex-author · arXiv (Cornell University)

Robust and Controllable Object-Centric Learning through Energy-based Models

Humans are remarkably good at understanding and reasoning about complex visual scenes. The capability to decompose low-level observations into discrete objects allows us to build a grounded abstract representation and identify the compositional structure of the world. Accordingly, it is a crucial step for machine learning models to be capable of inferring objects and their properties from visual scenes without explicit supervision. However, existing works on object-centric representation learning either rely on tailor-made neural network modules or strong probabilistic assumptions in the underlying generative and inference processes. In this work, we present \ours, a conceptually simple and general approach to learning object-centric representations through an energy-based model. By forming a permutation-invariant energy function using vanilla attention blocks readily available in Transformers, we can infer object-centric latent variables via gradient-based MCMC methods where permutation equivariance is automatically guaranteed. We show that \ours can be easily integrated into existing architectures and can effectively extract high-quality object-centric representations, leading to better segmentation accuracy and competitive downstream task performance. Further, empirical evaluations show that \ours's learned representations are robust against distribution shift. Finally, we demonstrate the effectiveness of \ours in systematic compositional generalization, by re-composing learned energy functions for novel scene generation and manipulation.

openalex-author · arXiv (Cornell University)

MAgNet: Mesh Agnostic Neural PDE Solver

The computational complexity of classical numerical methods for solving Partial Differential Equations (PDE) scales significantly as the resolution increases. As an important example, climate predictions require fine spatio-temporal resolutions to resolve all turbulent scales in the fluid simulations. This makes the task of accurately resolving these scales computationally out of reach even with modern supercomputers. As a result, current numerical modelers solve PDEs on grids that are too coarse (3km to 200km on each side), which hinders the accuracy and usefulness of the predictions. In this paper, we leverage the recent advances in Implicit Neural Representations (INR) to design a novel architecture that predicts the spatially continuous solution of a PDE given a spatial position query. By augmenting coordinate-based architectures with Graph Neural Networks (GNN), we enable zero-shot generalization to new non-uniform meshes and long-term predictions up to 250 frames ahead that are physically consistent. Our Mesh Agnostic Neural PDE Solver (MAgNet) is able to make accurate predictions across a variety of PDE simulation datasets and compares favorably with existing baselines. Moreover, MAgNet generalizes well to different meshes and resolutions up to four times those trained on.

openalex-author · arXiv (Cornell University)

Generative Augmented Flow Networks

The Generative Flow Network is a probabilistic framework where an agent learns a stochastic policy for object generation, such that the probability of generating an object is proportional to a given reward function. Its effectiveness has been shown in discovering high-quality and diverse solutions, compared to reward-maximizing reinforcement learning-based methods. Nonetheless, GFlowNets only learn from rewards of the terminal states, which can limit its applicability. Indeed, intermediate rewards play a critical role in learning, for example from intrinsic motivation to provide intermediate feedback even in particularly challenging sparse reward tasks. Inspired by this, we propose Generative Augmented Flow Networks (GAFlowNets), a novel learning framework to incorporate intermediate rewards into GFlowNets. We specify intermediate rewards by intrinsic motivation to tackle the exploration problem in sparse reward environments. GAFlowNets can leverage edge-based and state-based intrinsic rewards in a joint way to improve exploration. Based on extensive experiments on the GridWorld task, we demonstrate the effectiveness and efficiency of GAFlowNet in terms of convergence, performance, and diversity of solutions. We further show that GAFlowNet is scalable to a more complex and large-scale molecule generation domain, where it achieves consistent and significant performance improvement.

openalex-author · arXiv (Cornell University)

Stateful active facilitator: Coordination and Environmental Heterogeneity in Cooperative Multi-Agent Reinforcement Learning

In cooperative multi-agent reinforcement learning, a team of agents works together to achieve a common goal. Different environments or tasks may require varying degrees of coordination among agents in order to achieve the goal in an optimal way. The nature of coordination will depend on the properties of the environment -- its spatial layout, distribution of obstacles, dynamics, etc. We term this variation of properties within an environment as heterogeneity. Existing literature has not sufficiently addressed the fact that different environments may have different levels of heterogeneity. We formalize the notions of coordination level and heterogeneity level of an environment and present HECOGrid, a suite of multi-agent RL environments that facilitates empirical evaluation of different MARL approaches across different levels of coordination and environmental heterogeneity by providing a quantitative control over coordination and heterogeneity levels of the environment. Further, we propose a Centralized Training Decentralized Execution learning approach called Stateful Active Facilitator (SAF) that enables agents to work efficiently in high-coordination and high-heterogeneity environments through a differentiable and shared knowledge source used during training and dynamic selection from a shared pool of policies. We evaluate SAF and compare its performance against baselines IPPO and MAPPO on HECOGrid. Our results show that SAF consistently outperforms the baselines across different tasks and different heterogeneity and coordination levels. We release the code for HECOGrid as well as all our experiments.

openalex-author · arXiv (Cornell University)

Latent State Marginalization as a Low-cost Approach for Improving Exploration

While the maximum entropy (MaxEnt) reinforcement learning (RL) framework -- often touted for its exploration and robustness capabilities -- is usually motivated from a probabilistic perspective, the use of deep probabilistic models has not gained much traction in practice due to their inherent complexity. In this work, we propose the adoption of latent variable policies within the MaxEnt framework, which we show can provably approximate any policy distribution, and additionally, naturally emerges under the use of world models with a latent belief state. We discuss why latent variable policies are difficult to train, how naive approaches can fail, then subsequently introduce a series of improvements centered around low-cost marginalization of the latent state, allowing us to make full use of the latent state at minimal additional cost. We instantiate our method under the actor-critic framework, marginalizing both the actor and critic. The resulting algorithm, referred to as Stochastic Marginal Actor-Critic (SMAC), is simple yet effective. We experimentally validate our method on continuous control tasks, showing that effective marginalization can lead to better exploration and more robust training. Our implementation is open sourced at https://github.com/zdhNarsil/Stochastic-Marginal-Actor-Critic.

openalex-author · arXiv (Cornell University)

GFlowNets and variational inference

This paper builds bridges between two families of probabilistic algorithms: (hierarchical) variational inference (VI), which is typically used to model distributions over continuous spaces, and generative flow networks (GFlowNets), which have been used for distributions over discrete structures such as graphs. We demonstrate that, in certain cases, VI algorithms are equivalent to special cases of GFlowNets in the sense of equality of expected gradients of their learning objectives. We then point out the differences between the two families and show how these differences emerge experimentally. Notably, GFlowNets, which borrow ideas from reinforcement learning, are more amenable than VI to off-policy training without the cost of high gradient variance induced by importance sampling. We argue that this property of GFlowNets can provide advantages for capturing diversity in multimodal target distributions.

openalex-author · arXiv (Cornell University)

Predictive Inference with Feature Conformal Prediction

Conformal prediction is a distribution-free technique for establishing valid prediction intervals. Although conventionally people conduct conformal prediction in the output space, this is not the only possibility. In this paper, we propose feature conformal prediction, which extends the scope of conformal prediction to semantic feature spaces by leveraging the inductive bias of deep representation learning. From a theoretical perspective, we demonstrate that feature conformal prediction provably outperforms regular conformal prediction under mild assumptions. Our approach could be combined with not only vanilla conformal prediction, but also other adaptive conformal prediction methods. Apart from experiments on existing predictive inference benchmarks, we also demonstrate the state-of-the-art performance of the proposed methods on large-scale tasks such as ImageNet classification and Cityscapes image segmentation.The code is available at \url{https://github.com/AlvinWen428/FeatureCP}.

openalex-author · Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences

Inductive biases for deep learning of higher-level cognition

A fascinating hypothesis is that human and animal intelligence could be explained by a few principles (rather than an encyclopaedic list of heuristics). If that hypothesis was correct, we could more easily both understand our own intelligence and build intelligent machines. Just like in physics, the principles themselves would not be sufficient to predict the behaviour of complex systems like brains, and substantial computation might be needed to simulate human-like intelligence. This hypothesis would suggest that studying the kind of inductive biases that humans and animals exploit could help both clarify these principles and provide inspiration for AI research and neuroscience theories. Deep learning already exploits several key inductive biases, and this work considers a larger list, focusing on those which concern mostly higher-level and sequential conscious processing. The objective of clarifying these particular principles is that they could potentially help us build AI systems benefiting from humans’ abilities in terms of flexible out-of-distribution and systematic generalization, which is currently an area where a large gap exists between state-of-the-art machine learning and human intelligence.

openalex-author · arXiv (Cornell University)

Learning GFlowNets from partial episodes for improved convergence and stability

Generative flow networks (GFlowNets) are a family of algorithms for training a sequential sampler of discrete objects under an unnormalized target density and have been successfully used for various probabilistic modeling tasks. Existing training objectives for GFlowNets are either local to states or transitions, or propagate a reward signal over an entire sampling trajectory. We argue that these alternatives represent opposite ends of a gradient bias-variance tradeoff and propose a way to exploit this tradeoff to mitigate its harmful effects. Inspired by the TD($λ$) algorithm in reinforcement learning, we introduce subtrajectory balance or SubTB($λ$), a GFlowNet training objective that can learn from partial action subsequences of varying lengths. We show that SubTB($λ$) accelerates sampler convergence in previously studied and new environments and enables training GFlowNets in environments with longer action sequences and sparser reward landscapes than what was possible before. We also perform a comparative analysis of stochastic gradient dynamics, shedding light on the bias-variance tradeoff in GFlowNet training and the advantages of subtrajectory balance.

openalex-author · arXiv (Cornell University)

Interventional Causal Representation Learning

Causal representation learning seeks to extract high-level latent factors from low-level sensory data. Most existing methods rely on observational data and structural assumptions (e.g., conditional independence) to identify the latent factors. However, interventional data is prevalent across applications. Can interventional data facilitate causal representation learning? We explore this question in this paper. The key observation is that interventional data often carries geometric signatures of the latent factors' support (i.e. what values each latent can possibly take). For example, when the latent factors are causally connected, interventions can break the dependency between the intervened latents' support and their ancestors'. Leveraging this fact, we prove that the latent causal factors can be identified up to permutation and scaling given data from perfect $do$ interventions. Moreover, we can achieve block affine identification, namely the estimated latent factors are only entangled with a few other latents if we have access to data from imperfect interventions. These results highlight the unique power of interventional data in causal representation learning; they can enable provable identification of latent factors without any assumptions about their distributions or dependency structure.

openalex-author · arXiv (Cornell University)

Graph-Based Active Machine Learning Method for Diverse and Novel Antimicrobial Peptides Generation and Selection

As antibiotic-resistant bacterial strains are rapidly spreading worldwide, infections caused by these strains are emerging as a global crisis causing the death of millions of people every year. Antimicrobial Peptides (AMPs) are one of the candidates to tackle this problem because of their potential diversity, and ability to favorably modulate the host immune response. However, large-scale screening of new AMP candidates is expensive, time-consuming, and now affordable in developing countries, which need the treatments the most. In this work, we propose a novel active machine learning-based framework that statistically minimizes the number of wet-lab experiments needed to design new AMPs, while ensuring a high diversity and novelty of generated AMPs sequences, in multi-rounds of wet-lab AMP screening settings. Combining recurrent neural network models and a graph-based filter (GraphCC), our proposed approach delivers novel and diverse candidates and demonstrates better performances according to our defined metrics.

openalex-author · arXiv (Cornell University)

Designing Biological Sequences via Meta-Reinforcement Learning and Bayesian Optimization

The ability to accelerate the design of biological sequences can have a substantial impact on the progress of the medical field. The problem can be framed as a global optimization problem where the objective is an expensive black-box function such that we can query large batches restricted with a limitation of a low number of rounds. Bayesian Optimization is a principled method for tackling this problem. However, the astronomically large state space of biological sequences renders brute-force iterating over all possible sequences infeasible. In this paper, we propose MetaRLBO where we train an autoregressive generative model via Meta-Reinforcement Learning to propose promising sequences for selection via Bayesian Optimization. We pose this problem as that of finding an optimal policy over a distribution of MDPs induced by sampling subsets of the data acquired in the previous rounds. Our in-silico experiments show that meta-learning over such ensembles provides robustness against reward misspecification and achieves competitive results compared to existing strong baselines.

openalex-author · arXiv (Cornell University)

Unifying Generative Models with GFlowNets and Beyond

There are many frameworks for deep generative modeling, each often presented with their own specific training algorithms and inference methods. Here, we demonstrate the connections between existing deep generative models and the recently introduced GFlowNet framework, a probabilistic inference machine which treats sampling as a decision-making process. This analysis sheds light on their overlapping traits and provides a unifying viewpoint through the lens of learning with Markovian trajectories. Our framework provides a means for unifying training and inference algorithms, and provides a route to shine a unifying light over many generative models. Beyond this, we provide a practical and experimentally verified recipe for improving generative modeling with insights from the GFlowNet perspective.

openalex-author · Zhang, T, Williams, A, Wozny, P, Cohrs, K-H, Ponse, K, Jiralerspong, M, Phade, S R, Srinivasa, S, Li, L, Zhang, Y, Gupta, P, Acar, E, Rish, I, Bengio, Y & Zheng

AI for Global Climate Cooperation: Modeling Global Climate Negotiations, Agreements, and Long-Term Cooperation in RICE-N

Comprehensive global cooperation is essential to limit global temperature increases while continuing economic development, e.g., reducing severe inequality or achieving long-term economic growth. Achieving long-term cooperation on climate change mitigation with n strategic agents poses a complex game-theoretic problem. For example, agents may negotiate and reach climate agreements, but there is no central authority to enforce adherence to those agreements. Hence, it is critical to design negotiation and agreement frameworks that foster cooperation, allow all agents to meet their individual policy objectives, and incentivize long-term adherence. This is an interdisciplinary challenge that calls for collaboration between researchers in machine learning, economics, climate science, law, policy, ethics, and other fields. In particular, we argue that machine learning is a critical tool to address the complexity of this domain. To facilitate this research, here we introduce RICE-N, a multi-region integrated assessment model that simulates the global climate and economy, and which can be used to design and evaluate the strategic outcomes for different negotiation and agreement frameworks. We also describe how to use multi-agent reinforcement learning to train rational agents using RICE-N. This framework underpinsAI for Global Climate Cooperation, a working group collaboration and competition on climate negotiation and agreement design. Here, we invite the scientific community to design and evaluate their solutions using RICE-N, machine learning, economic intuition, and other domain knowledge. More information can be found on www.ai4climatecoop.org.

openalex-author · arXiv (Cornell University)

Controlled Sparsity via Constrained Optimization or: How I Learned to Stop Tuning Penalties and Love Constraints

The performance of trained neural networks is robust to harsh levels of pruning. Coupled with the ever-growing size of deep learning models, this observation has motivated extensive research on learning sparse models. In this work, we focus on the task of controlling the level of sparsity when performing sparse learning. Existing methods based on sparsity-inducing penalties involve expensive trial-and-error tuning of the penalty factor, thus lacking direct control of the resulting model sparsity. In response, we adopt a constrained formulation: using the gate mechanism proposed by Louizos et al. (2018), we formulate a constrained optimization problem where sparsification is guided by the training objective and the desired sparsity target in an end-to-end fashion. Experiments on CIFAR-{10, 100}, TinyImageNet, and ImageNet using WideResNet and ResNet{18, 50} models validate the effectiveness of our proposal and demonstrate that we can reliably achieve pre-determined sparsity targets without compromising on predictive performance.

openalex-author · arXiv (Cornell University)

Discrete Key-Value Bottleneck

Deep neural networks perform well on classification tasks where data streams are i.i.d. and labeled data is abundant. Challenges emerge with non-stationary training data streams such as continual learning. One powerful approach that has addressed this challenge involves pre-training of large encoders on volumes of readily available data, followed by task-specific tuning. Given a new task, however, updating the weights of these encoders is challenging as a large number of weights needs to be fine-tuned, and as a result, they forget information about the previous tasks. In the present work, we propose a model architecture to address this issue, building upon a discrete bottleneck containing pairs of separate and learnable key-value codes. Our paradigm will be to encode; process the representation via a discrete bottleneck; and decode. Here, the input is fed to the pre-trained encoder, the output of the encoder is used to select the nearest keys, and the corresponding values are fed to the decoder to solve the current task. The model can only fetch and re-use a sparse number of these key-value pairs during inference, enabling localized and context-dependent model updates. We theoretically investigate the ability of the discrete key-value bottleneck to minimize the effect of learning under distribution shifts and show that it reduces the complexity of the hypothesis class. We empirically verify the proposed method under challenging class-incremental learning scenarios and show that the proposed model - without any task boundaries - reduces catastrophic forgetting across a wide variety of pre-trained models, outperforming relevant baselines on this task.

openalex-author · Neural Networks

Interpolated Adversarial Training: Achieving robust neural networks without sacrificing too much accuracy

Adversarial robustness has become a central goal in deep learning, both in the theory and the practice. However, successful methods to improve the adversarial robustness (such as adversarial training) greatly hurt generalization performance on the unperturbed data. This could have a major impact on how the adversarial robustness affects real world systems (i.e. many may opt to forego robustness if it can improve accuracy on the unperturbed data). We propose Interpolated Adversarial Training, which employs recently proposed interpolation based training methods in the framework of adversarial training. On CIFAR-10, adversarial training increases the standard test error ( when there is no adversary) from 4.43% to 12.32%, whereas with our Interpolated adversarial training we retain the adversarial robustness while achieving a standard test error of only 6.45%. With our technique, the relative increase in the standard error for the robust model is reduced from 178.1% to just 45.5%. Moreover, we provide mathematical analysis of Interpolated Adversarial Training to confirm its efficiencies and demonstrate its advantages in terms of robustness and generalization.

openalex-author · arXiv (Cornell University)

Lookback for Learning to Branch

The expressive and computationally inexpensive bipartite Graph Neural Networks (GNN) have been shown to be an important component of deep learning based Mixed-Integer Linear Program (MILP) solvers. Recent works have demonstrated the effectiveness of such GNNs in replacing the branching (variable selection) heuristic in branch-and-bound (B&B) solvers. These GNNs are trained, offline and on a collection of MILPs, to imitate a very good but computationally expensive branching heuristic, strong branching. Given that B&B results in a tree of sub-MILPs, we ask (a) whether there are strong dependencies exhibited by the target heuristic among the neighboring nodes of the B&B tree, and (b) if so, whether we can incorporate them in our training procedure. Specifically, we find that with the strong branching heuristic, a child node's best choice was often the parent's second-best choice. We call this the "lookback" phenomenon. Surprisingly, the typical branching GNN of Gasse et al. (2019) often misses this simple "answer". To imitate the target behavior more closely by incorporating the lookback phenomenon in GNNs, we propose two methods: (a) target smoothing for the standard cross-entropy loss function, and (b) adding a Parent-as-Target (PAT) Lookback regularizer term. Finally, we propose a model selection framework to incorporate harder-to-formulate objectives such as solving time in the final models. Through extensive experimentation on standard benchmark instances, we show that our proposal results in up to 22% decrease in the size of the B&B tree and up to 15% improvement in the solving times.

openalex-author · arXiv (Cornell University)

Your Autoregressive Generative Model Can be Better If You Treat It as an Energy-Based One

Autoregressive generative models are commonly used, especially for those tasks involving sequential data. They have, however, been plagued by a slew of inherent flaws due to the intrinsic characteristics of chain-style conditional modeling (e.g., exposure bias or lack of long-range coherence), severely limiting their ability to model distributions properly. In this paper, we propose a unique method termed E-ARM for training autoregressive generative models that takes advantage of a well-designed energy-based learning objective. By leveraging the extra degree of freedom of the softmax operation, we are allowed to make the autoregressive model itself be an energy-based model for measuring the likelihood of input without introducing any extra parameters. Furthermore, we show that E-ARM can be trained efficiently and is capable of alleviating the exposure bias problem and increase temporal coherence for autoregressive generative models. Extensive empirical results, covering benchmarks like language modeling, neural machine translation, and image generation, demonstrate the effectiveness of the proposed approach.

openalex-author · arXiv (Cornell University)

On the Generalization and Adaption Performance of Causal Models

Learning models that offer robust out-of-distribution generalization and fast adaptation is a key challenge in modern machine learning. Modelling causal structure into neural networks holds the promise to accomplish robust zero and few-shot adaptation. Recent advances in differentiable causal discovery have proposed to factorize the data generating process into a set of modules, i.e. one module for the conditional distribution of every variable where only causal parents are used as predictors. Such a modular decomposition of knowledge enables adaptation to distributions shifts by only updating a subset of parameters. In this work, we systematically study the generalization and adaption performance of such modular neural causal models by comparing it to monolithic models and structured models where the set of predictors is not constrained to causal parents. Our analysis shows that the modular neural causal models outperform other models on both zero and few-shot adaptation in low data regimes and offer robust generalization. We also found that the effects are more significant for sparser graphs as compared to denser graphs.

openalex-author · arXiv (Cornell University)

On Neural Architecture Inductive Biases for Relational Tasks

Current deep learning approaches have shown good in-distribution generalization performance, but struggle with out-of-distribution generalization. This is especially true in the case of tasks involving abstract relations like recognizing rules in sequences, as we find in many intelligence tests. Recent work has explored how forcing relational representations to remain distinct from sensory representations, as it seems to be the case in the brain, can help artificial systems. Building on this work, we further explore and formalize the advantages afforded by 'partitioned' representations of relations and sensory details, and how this inductive bias can help recompose learned relational structure in newly encountered settings. We introduce a simple architecture based on similarity scores which we name Compositional Relational Network (CoRelNet). Using this model, we investigate a series of inductive biases that ensure abstract relations are learned and represented distinctly from sensory data, and explore their effects on out-of-distribution generalization for a series of relational psychophysics tasks. We find that simple architectural choices can outperform existing models in out-of-distribution generalization. Together, these results show that partitioning relational representations from other information streams may be a simple way to augment existing network architectures' robustness when performing out-of-distribution relational computations.

openalex-author · arXiv (Cornell University)

Building Robust Ensembles via Margin Boosting

In the context of adversarial robustness, a single model does not usually have enough power to defend against all possible adversarial attacks, and as a result, has sub-optimal robustness. Consequently, an emerging line of work has focused on learning an ensemble of neural networks to defend against adversarial attacks. In this work, we take a principled approach towards building robust ensembles. We view this problem from the perspective of margin-boosting and develop an algorithm for learning an ensemble with maximum margin. Through extensive empirical evaluation on benchmark datasets, we show that our algorithm not only outperforms existing ensembling techniques, but also large models trained in an end-to-end fashion. An important byproduct of our work is a margin-maximizing cross-entropy (MCE) loss, which is a better alternative to the standard cross-entropy (CE) loss. Empirically, we show that replacing the CE loss in state-of-the-art adversarial training techniques with our MCE loss leads to significant performance improvement.

openalex-author · arXiv (Cornell University)

Is a Modular Architecture Enough?

Inspired from human cognition, machine learning systems are gradually revealing advantages of sparser and more modular architectures. Recent work demonstrates that not only do some modular architectures generalize well, but they also lead to better out-of-distribution generalization, scaling properties, learning speed, and interpretability. A key intuition behind the success of such systems is that the data generating system for most real-world settings is considered to consist of sparsely interacting parts, and endowing models with similar inductive biases will be helpful. However, the field has been lacking in a rigorous quantitative assessment of such systems because these real-world data distributions are complex and unknown. In this work, we provide a thorough assessment of common modular architectures, through the lens of simple and known modular data distributions. We highlight the benefits of modularity and sparsity and reveal insights on the challenges faced while optimizing modular systems. In doing so, we propose evaluation metrics that highlight the benefits of modularity, the regimes in which these benefits are substantial, as well as the sub-optimality of current end-to-end learned modular systems as opposed to their claimed potential.

openalex-author · Ahuja, K, Hartford, J S & Bengio, Y 2022, 'Weakly supervised representation learning with sparse perturbations', Advances in Neural Information Processing Syste

Weakly Supervised Representation Learning with Sparse Perturbations

The theory of representation learning aims to build methods that provably invert the data generating process with minimal domain knowledge or any source of supervision. Most prior approaches require strong distributional assumptions on the latent variables and weak supervision (auxiliary information such as timestamps) to provide provable identification guarantees. In this work, we show that if one has weak supervision from observations generated by sparse perturbations of the latent variables--e.g. images in a reinforcement learning environment where actions move individual sprites--identification is achievable under unknown continuous latent distributions. We show that if the perturbations are applied only on mutually exclusive blocks of latents, we identify the latents up to those blocks. We also show that if these perturbation blocks overlap, we identify latents up to the smallest blocks shared across perturbations. Consequently, if there are blocks that intersect in one latent variable only, then such latents are identified up to permutation and scaling. We propose a natural estimation procedure based on this theory and illustrate it on low-dimensional synthetic and image-based experiments.

openalex-author · arXiv (Cornell University)

Agnostic Physics-Driven Deep Learning

This work establishes that a physical system can perform statistical learning without gradient computations, via an Agnostic Equilibrium Propagation (Aeqprop) procedure that combines energy minimization, homeostatic control, and nudging towards the correct response. In Aeqprop, the specifics of the system do not have to be known: the procedure is based only on external manipulations, and produces a stochastic gradient descent without explicit gradient computations. Thanks to nudging, the system performs a true, order-one gradient step for each training sample, in contrast with order-zero methods like reinforcement or evolutionary strategies, which rely on trial and error. This procedure considerably widens the range of potential hardware for statistical learning to any system with enough controllable parameters, even if the details of the system are poorly known. Aeqprop also establishes that in natural (bio)physical systems, genuine gradient-based statistical learning may result from generic, relatively simple mechanisms, without backpropagation and its requirement for analytic knowledge of partial derivatives.

openalex-author · arXiv (Cornell University)

Temporal Latent Bottleneck: Synthesis of Fast and Slow Processing Mechanisms in Sequence Learning

Recurrent neural networks have a strong inductive bias towards learning temporally compressed representations, as the entire history of a sequence is represented by a single vector. By contrast, Transformers have little inductive bias towards learning temporally compressed representations, as they allow for attention over all previously computed elements in a sequence. Having a more compressed representation of a sequence may be beneficial for generalization, as a high-level representation may be more easily re-used and re-purposed and will contain fewer irrelevant details. At the same time, excessive compression of representations comes at the cost of expressiveness. We propose a solution which divides computation into two streams. A slow stream that is recurrent in nature aims to learn a specialized and compressed representation, by forcing chunks of $K$ time steps into a single representation which is divided into multiple vectors. At the same time, a fast stream is parameterized as a Transformer to process chunks consisting of $K$ time-steps conditioned on the information in the slow-stream. In the proposed approach we hope to gain the expressiveness of the Transformer, while encouraging better compression and structuring of representations in the slow stream. We show the benefits of the proposed method in terms of improved sample efficiency and generalization performance as compared to various competitive baselines for visual perception and sequential decision making tasks.

openalex-author · arXiv (Cornell University)

Coordinating Policies Among Multiple Agents via an Intelligent Communication Channel

In Multi-Agent Reinforcement Learning (MARL), specialized channels are often introduced that allow agents to communicate directly with one another. In this paper, we propose an alternative approach whereby agents communicate through an intelligent facilitator that learns to sift through and interpret signals provided by all agents to improve the agents' collective performance. To ensure that this facilitator does not become a centralized controller, agents are incentivized to reduce their dependence on the messages it conveys, and the messages can only influence the selection of a policy from a fixed set, not instantaneous actions given the policy. We demonstrate the strength of this architecture over existing baselines on several cooperative MARL environments.

openalex-author · arXiv (Cornell University)

FedILC: Weighted Geometric Mean and Invariant Gradient Covariance for Federated Learning on Non-IID Data

Federated learning is a distributed machine learning approach which enables a shared server model to learn by aggregating the locally-computed parameter updates with the training data from spatially-distributed client silos. Though successfully possessing advantages in both scale and privacy, federated learning is hurt by domain shift problems, where the learning models are unable to generalize to unseen domains whose data distribution is non-i.i.d. with respect to the training domains. In this study, we propose the Federated Invariant Learning Consistency (FedILC) approach, which leverages the gradient covariance and the geometric mean of Hessians to capture both inter-silo and intra-silo consistencies of environments and unravel the domain shift problems in federated networks. The benchmark and real-world dataset experiments bring evidence that our proposed algorithm outperforms conventional baselines and similar federated learning algorithms. This is relevant to various fields such as medical healthcare, computer vision, and the Internet of Things (IoT). The code is released at https://github.com/mikemikezhu/FedILC.

openalex-author · Journal of Chemical Information and Modeling

RetroGNN: Fast Estimation of Synthesizability for Virtual Screening and De Novo Design by Learning from Slow Retrosynthesis Software

De novo molecule design algorithms often result in chemically unfeasible or synthetically inaccessible molecules. A natural idea to mitigate this problem is to bias these algorithms toward more easily synthesizable molecules using a proxy score for synthetic accessibility. However, using currently available proxies can still result in highly unrealistic compounds. Here, we propose a novel approach, RetroGNN, to estimate synthesizability. First, we search for routes using synthesis planning software for a large number of random molecules. This information is then used to train a graph neural network to predict the outcome of the synthesis planner given the target molecule, in which the regression task can be used as a synthesizability scorer. We highlight how RetroGNN can be used in generative molecule-discovery pipelines together with other scoring functions. We evaluate our approach on several QSAR-based molecule design benchmarks, for which we find synthesizable molecules with state-of-the-art scores. Compared to the virtual screening of 5 million existing molecules from the ZINC database, using RetroGNNScore with a simple fragment-based de novo design algorithm finds molecules predicted to be more likely to possess the desired activity exponentially faster, while maintaining good druglike properties and being easier to synthesize. Importantly, our deep neural network can successfully filter out hard to synthesize molecules while achieving a 105 times speedup over using retrosynthesis planning software.

openalex-author · arXiv (Cornell University)

Temporal Abstractions-Augmented Temporally Contrastive Learning: An Alternative to the Laplacian in RL

In reinforcement learning, the graph Laplacian has proved to be a valuable tool in the task-agnostic setting, with applications ranging from skill discovery to reward shaping. Recently, learning the Laplacian representation has been framed as the optimization of a temporally-contrastive objective to overcome its computational limitations in large (or continuous) state spaces. However, this approach requires uniform access to all states in the state space, overlooking the exploration problem that emerges during the representation learning process. In this work, we propose an alternative method that is able to recover, in a non-uniform-prior setting, the expressiveness and the desired properties of the Laplacian representation. We do so by combining the representation learning with a skill-based covering policy, which provides a better training distribution to extend and refine the representation. We also show that a simple augmentation of the representation objective with the learned temporal abstractions improves dynamics-awareness and helps exploration. We find that our method succeeds as an alternative to the Laplacian in the non-uniform setting and scales to challenging continuous control environments. Finally, even if our method is not optimized for skill discovery, the learned skills can successfully solve difficult continuous navigation tasks with sparse rewards, where standard skill discovery approaches are no so effective.

openalex-author · arXiv (Cornell University)

A New Era: Intelligent Tutoring Systems Will Transform Online Learning for Millions

Despite artificial intelligence (AI) having transformed major aspects of our society, less than a fraction of its potential has been explored, let alone deployed, for education. AI-powered learning can provide millions of learners with a highly personalized, active and practical learning experience, which is key to successful learning. This is especially relevant in the context of online learning platforms. In this paper, we present the results of a comparative head-to-head study on learning outcomes for two popular online learning platforms (n=199 participants): A MOOC platform following a traditional model delivering content using lecture videos and multiple-choice quizzes, and the Korbit learning platform providing a highly personalized, active and practical learning experience. We observe a huge and statistically significant increase in the learning outcomes, with students on the Korbit platform providing full feedback resulting in higher course completion rates and achieving learning gains 2 to 2.5 times higher than both students on the MOOC platform and students in a control group who don't receive personalized feedback on the Korbit platform. The results demonstrate the tremendous impact that can be achieved with a personalized, active learning AI-powered system. Making this technology and learning experience available to millions of learners around the world will represent a significant leap forward towards the democratization of education.

openalex-author · arXiv (Cornell University)

Continuous-Time Meta-Learning with Forward Mode Differentiation

Drawing inspiration from gradient-based meta-learning methods with infinitely small gradient steps, we introduce Continuous-Time Meta-Learning (COMLN), a meta-learning algorithm where adaptation follows the dynamics of a gradient vector field. Specifically, representations of the inputs are meta-learned such that a task-specific linear classifier is obtained as a solution of an ordinary differential equation (ODE). Treating the learning process as an ODE offers the notable advantage that the length of the trajectory is now continuous, as opposed to a fixed and discrete number of gradient steps. As a consequence, we can optimize the amount of adaptation necessary to solve a new task using stochastic gradient descent, in addition to learning the initial conditions as is standard practice in gradient-based meta-learning. Importantly, in order to compute the exact meta-gradients required for the outer-loop updates, we devise an efficient algorithm based on forward mode differentiation, whose memory requirements do not scale with the length of the learning trajectory, thus allowing longer adaptation in constant memory. We provide analytical guarantees for the stability of COMLN, we show empirically its efficiency in terms of runtime and memory usage, and we illustrate its effectiveness on a range of few-shot image classification problems.

openalex-author · arXiv (Cornell University)

Biological Sequence Design with GFlowNets

Design of de novo biological sequences with desired properties, like protein and DNA sequences, often involves an active loop with several rounds of molecule ideation and expensive wet-lab evaluations. These experiments can consist of multiple stages, with increasing levels of precision and cost of evaluation, where candidates are filtered. This makes the diversity of proposed candidates a key consideration in the ideation phase. In this work, we propose an active learning algorithm leveraging epistemic uncertainty estimation and the recently proposed GFlowNets as a generator of diverse candidate solutions, with the objective to obtain a diverse batch of useful (as defined by some utility function, for example, the predicted anti-microbial activity of a peptide) and informative candidates after each round. We also propose a scheme to incorporate existing labeled datasets of candidates, in addition to a reward function, to speed up learning in GFlowNets. We present empirical results on several biological sequence design tasks, and we find that our method generates more diverse and novel batches with high scoring candidates compared to existing approaches.

openalex-author · arXiv (Cornell University)

Bayesian Structure Learning with Generative Flow Networks

In Bayesian structure learning, we are interested in inferring a distribution over the directed acyclic graph (DAG) structure of Bayesian networks, from data. Defining such a distribution is very challenging, due to the combinatorially large sample space, and approximations based on MCMC are often required. Recently, a novel class of probabilistic models, called Generative Flow Networks (GFlowNets), have been introduced as a general framework for generative modeling of discrete and composite objects, such as graphs. In this work, we propose to use a GFlowNet as an alternative to MCMC for approximating the posterior distribution over the structure of Bayesian networks, given a dataset of observations. Generating a sample DAG from this approximate distribution is viewed as a sequential decision problem, where the graph is constructed one edge at a time, based on learned transition probabilities. Through evaluation on both simulated and real data, we show that our approach, called DAG-GFlowNet, provides an accurate approximation of the posterior over DAGs, and it compares favorably against other methods based on MCMC or variational inference.

openalex-author · arXiv (Cornell University)

Combining Modular Skills in Multitask Learning

A modular design encourages neural models to disentangle and recombine different facets of knowledge to generalise more systematically to new tasks. In this work, we assume that each task is associated with a subset of latent discrete skills from a (potentially small) inventory. In turn, skills correspond to parameter-efficient (sparse / low-rank) model parameterisations. By jointly learning these and a task-skill allocation matrix, the network for each task is instantiated as the average of the parameters of active skills. To favour non-trivial soft partitions of skills across tasks, we experiment with a series of inductive biases, such as an Indian Buffet Process prior and a two-speed learning rate. We evaluate our latent-skill model on two main settings: 1) multitask reinforcement learning for grounded instruction following on 8 levels of the BabyAI platform; and 2) few-shot adaptation of pre-trained text-to-text generative models on CrossFit, a benchmark comprising 160 NLP tasks. We find that the modular design of a network significantly increases sample efficiency in reinforcement learning and few-shot generalisation in supervised learning, compared to baselines with fully shared, task-specific, or conditionally generated parameters where knowledge is entangled across tasks. In addition, we show how discrete skills help interpretability, as they yield an explicit hierarchy of tasks.

openalex-author · Nature Reviews Chemistry

CACHE (Critical Assessment of Computational Hit-finding Experiments): A public–private partnership benchmarking initiative to enable the development of computational methods for hit-finding

One aspirational goal of computational chemistry is to predict potent and drug-like binders for any protein, such that only those that bind are synthesized. In this Roadmap, we describe the launch of Critical Assessment of Computational Hit-finding Experiments (CACHE), a public benchmarking project to compare and improve small molecule hit-finding algorithms through cycles of prediction and experimental testing. Participants will predict small molecule binders for new and biologically relevant protein targets representing different prediction scenarios. Predicted compounds will be tested rigorously in an experimental hub, and all predicted binders as well as all experimental screening data, including the chemical structures of experimentally tested compounds, will be made publicly available, and not subject to any intellectual property restrictions. The ability of a range of computational approaches to find novel binders will be evaluated, compared, and openly published. CACHE will launch 3 new benchmarking exercises every year. The outcomes will be better prediction methods, new small molecule binders for target proteins of importance for fundamental biology or drug discovery, and a major technological step towards achieving the goal of Target 2035, a global initiative to identify pharmacological probes for all human proteins.

openalex-author · OPUS 4 (Zuse Institute Berlin)

Tackling Climate Change with Machine Learning

Climate change is one of the greatest challenges facing humanity, and we, as machine learning experts, may wonder how we can help. Here we describe how machine learning can be a powerful tool in reducing greenhouse gas emissions and helping society adapt to a changing climate. From smart grids to disaster management, we identify high impact problems where existing gaps can be filled by machine learning, in collaboration with other fields. Our recommendations encompass exciting research questions as well as promising business opportunities. We call on the machine learning community to join the global effort against climate change.

openalex-author · arXiv (Cornell University)

RECOVER: sequential model optimization platform for combination drug repurposing identifies novel synergistic compounds in vitro

For large libraries of small molecules, exhaustive combinatorial chemical screens become infeasible to perform when considering a range of disease models, assay conditions, and dose ranges. Deep learning models have achieved state of the art results in silico for the prediction of synergy scores. However, databases of drug combinations are biased towards synergistic agents and these results do not necessarily generalise out of distribution. We employ a sequential model optimization search utilising a deep learning model to quickly discover synergistic drug combinations active against a cancer cell line, requiring substantially less screening than an exhaustive evaluation. Our small scale wet lab experiments only account for evaluation of ~5% of the total search space. After only 3 rounds of ML-guided in vitro experimentation (including a calibration round), we find that the set of drug pairs queried is enriched for highly synergistic combinations; two additional rounds of ML-guided experiments were performed to ensure reproducibility of trends. Remarkably, we rediscover drug combinations later confirmed to be under study within clinical trials. Moreover, we find that drug embeddings generated using only structural information begin to reflect mechanisms of action. Prior in silico benchmarking suggests we can enrich search queries by a factor of ~5-10x for highly synergistic drug combinations by using sequential rounds of evaluation when compared to random selection, or by a factor of >3x when using a pretrained model selecting all drug combinations at a single time point.

openalex-author · arXiv (Cornell University)

Generative Flow Networks for Discrete Probabilistic Modeling

We present energy-based generative flow networks (EB-GFN), a novel probabilistic modeling algorithm for high-dimensional discrete data. Building upon the theory of generative flow networks (GFlowNets), we model the generation process by a stochastic data construction policy and thus amortize expensive MCMC exploration into a fixed number of actions sampled from a GFlowNet. We show how GFlowNets can approximately perform large-block Gibbs sampling to mix between modes. We propose a framework to jointly train a GFlowNet with an energy function, so that the GFlowNet learns to sample from the energy distribution, while the energy learns with an approximate MLE objective with negative samples from the GFlowNet. We demonstrate EB-GFN's effectiveness on various probabilistic modeling tasks. Code is publicly available at https://github.com/zdhNarsil/EB_GFN.

openalex-author · arXiv (Cornell University)

Adaptive Discrete Communication Bottlenecks with Dynamic Vector Quantization

Vector Quantization (VQ) is a method for discretizing latent representations and has become a major part of the deep learning toolkit. It has been theoretically and empirically shown that discretization of representations leads to improved generalization, including in reinforcement learning where discretization can be used to bottleneck multi-agent communication to promote agent specialization and robustness. The discretization tightness of most VQ-based methods is defined by the number of discrete codes in the representation vector and the codebook size, which are fixed as hyperparameters. In this work, we propose learning to dynamically select discretization tightness conditioned on inputs, based on the hypothesis that data naturally contains variations in complexity that call for different levels of representational coarseness. We show that dynamically varying tightness in communication bottlenecks can improve model performance on visual reasoning and reinforcement learning tasks.

openalex-author · arXiv (Cornell University)

Towards Scaling Difference Target Propagation by Learning Backprop Targets

The development of biologically-plausible learning algorithms is important for understanding learning in the brain, but most of them fail to scale-up to real-world tasks, limiting their potential as explanations for learning by real brains. As such, it is important to explore learning algorithms that come with strong theoretical guarantees and can match the performance of backpropagation (BP) on complex tasks. One such algorithm is Difference Target Propagation (DTP), a biologically-plausible learning algorithm whose close relation with Gauss-Newton (GN) optimization has been recently established. However, the conditions under which this connection rigorously holds preclude layer-wise training of the feedback pathway synaptic weights (which is more biologically plausible). Moreover, good alignment between DTP weight updates and loss gradients is only loosely guaranteed and under very specific conditions for the architecture being trained. In this paper, we propose a novel feedback weight training scheme that ensures both that DTP approximates BP and that layer-wise feedback weight training can be restored without sacrificing any theoretical guarantees. Our theory is corroborated by experimental results and we report the best performance ever achieved by DTP on CIFAR-10 and ImageNet 32$\times$32

openalex-author · arXiv (Cornell University)

Trajectory balance: Improved credit assignment in GFlowNets

Generative flow networks (GFlowNets) are a method for learning a stochastic policy for generating compositional objects, such as graphs or strings, from a given unnormalized density by sequences of actions, where many possible action sequences may lead to the same object. We find previously proposed learning objectives for GFlowNets, flow matching and detailed balance, which are analogous to temporal difference learning, to be prone to inefficient credit propagation across long action sequences. We thus propose a new learning objective for GFlowNets, trajectory balance, as a more efficient alternative to previously used objectives. We prove that any global minimizer of the trajectory balance objective can define a policy that samples exactly from the target distribution. In experiments on four distinct domains, we empirically demonstrate the benefits of the trajectory balance objective for GFlowNet convergence, diversity of generated samples, and robustness to long action sequences and large action spaces.

openalex-author · arXiv (Cornell University)

Boosting Exploration in Multi-Task Reinforcement Learning using Adversarial Networks

Advancements in reinforcement learning (RL) have been remarkable in recent years. However, the limitations of traditional training methods have become increasingly evident, particularly in meta-RL settings where agents face new, unseen tasks. Conventional training approaches are susceptible to failure in such situations as they need more robustness to adversity. Our proposed adversarial training regime for Multi-Task Reinforcement Learning (MT-RL) addresses the limitations of conventional training methods in RL, especially in meta-RL environments where the agent faces new tasks. The adversarial component challenges the agent, forcing it to improve its decision-making abilities in dynamic and unpredictable situations. This component operates without relying on manual intervention or domain-specific knowledge, making it a highly versatile solution. Experiments conducted in multiple MT-RL environments demonstrate that adversarial training leads to better exploration and a deeper understanding of the environment. The adversarial training regime for MT-RL presents a new perspective on training and development for RL agents and is a valuable contribution to the field.

openalex-author · arXiv (Cornell University)

Predicting Tactical Solutions to Operational Planning Problems under Imperfect Information

This paper offers a methodological contribution at the intersection of machine learning and operations research. Namely, we propose a methodology to quickly predict expected tactical descriptions of operational solutions (TDOSs). The problem we address occurs in the context of two-stage stochastic programming, where the second stage is demanding computationally. We aim to predict at a high speed the expected TDOS associated with the second-stage problem, conditionally on the first-stage variables. This may be used in support of the solution to the overall two-stage problem by avoiding the online generation of multiple second-stage scenarios and solutions. We formulate the tactical prediction problem as a stochastic optimal prediction program, whose solution we approximate with supervised machine learning. The training data set consists of a large number of deterministic operational problems generated by controlled probabilistic sampling. The labels are computed based on solutions to these problems (solved independently and offline), employing appropriate aggregation and subselection methods to address uncertainty. Results on our motivating application on load planning for rail transportation show that deep learning models produce accurate predictions in very short computing time (milliseconds or less). The predictive accuracy is close to the lower bounds calculated based on sample average approximation of the stochastic prediction programs.

openalex-author · SSRN Electronic Journal

(Private)-Retroactive Carbon Pricing [(P)Recap]: A Market-Based Approach for Climate Finance and Risk Assessment

No abstract available from the OpenAlex source record.

openalex-author · arXiv (Cornell University)

Multi-Domain Balanced Sampling Improves Out-of-Distribution Generalization of Chest X-ray Pathology Prediction Models

Learning models that generalize under different distribution shifts in medical imaging has been a long-standing research challenge. There have been several proposals for efficient and robust visual representation learning among vision research practitioners, especially in the sensitive and critical biomedical domain. In this paper, we propose an idea for out-of-distribution generalization of chest X-ray pathologies that uses a simple balanced batch sampling technique. We observed that balanced sampling between the multiple training datasets improves the performance over baseline models trained without balancing.

openalex-author · arXiv (Cornell University)

Multi-scale Feature Learning Dynamics: Insights for Double Descent

A key challenge in building theoretical foundations for deep learning is the complex optimization dynamics of neural networks, resulting from the high-dimensional interactions between the large number of network parameters. Such non-trivial dynamics lead to intriguing behaviors such as the phenomenon of "double descent" of the generalization error. The more commonly studied aspect of this phenomenon corresponds to model-wise double descent where the test error exhibits a second descent with increasing model complexity, beyond the classical U-shaped error curve. In this work, we investigate the origins of the less studied epoch-wise double descent in which the test error undergoes two non-monotonous transitions, or descents as the training time increases. By leveraging tools from statistical physics, we study a linear teacher-student setup exhibiting epoch-wise double descent similar to that in deep neural networks. In this setting, we derive closed-form analytical expressions for the evolution of generalization error over training. We find that double descent can be attributed to distinct features being learned at different scales: as fast-learning features overfit, slower-learning features start to fit, resulting in a second descent in test error. We validate our findings through numerical experiments where our theory accurately predicts empirical findings and remains consistent with observations in deep neural networks.

openalex-author · arXiv (Cornell University)

GFlowNet Foundations

Generative Flow Networks (GFlowNets) have been introduced as a method to sample a diverse set of candidates in an active learning context, with a training objective that makes them approximately sample in proportion to a given reward function. In this paper, we show a number of additional theoretical properties of GFlowNets. They can be used to estimate joint probability distributions and the corresponding marginal distributions where some variables are unspecified and, of particular interest, can represent distributions over composite objects like sets and graphs. GFlowNets amortize the work typically done by computationally expensive MCMC methods in a single but trained generative pass. They could also be used to estimate partition functions and free energies, conditional probabilities of supersets (supergraphs) given a subset (subgraph), as well as marginal distributions over all supersets (supergraphs) of a given set (graph). We introduce variations enabling the estimation of entropy and mutual information, sampling from a Pareto frontier, connections to reward-maximizing policies, and extensions to stochastic environments, continuous actions and modular energy functions.

openalex-author · Canadian Medical Association Journal

Problèmes associés au déploiement des modèles fondés sur l’apprentissage machine en santé

Points clés Dans un article connexe, Verma et ses collègues s’intéressent à la manière dont des solutions fondées sur l’apprentissage machine peuvent être élaborées et mises en place pour appuyer la prise de décision médicale[1][1]. Les systèmes d’aide à la prise de décision et

openalex-author · Bulletin 1024

Une nouvelle approche au traçage numérique des contacts (application COVI au Québec)

No abstract available from the OpenAlex source record.

openalex-author · arXiv (Cornell University)

Properties from Mechanisms: An Equivariance Perspective on Identifiable Representation Learning

A key goal of unsupervised representation learning is "inverting" a data generating process to recover its latent properties. Existing work that provably achieves this goal relies on strong assumptions on relationships between the latent variables (e.g., independence conditional on auxiliary information). In this paper, we take a very different perspective on the problem and ask, "Can we instead identify latent properties by leveraging knowledge of the mechanisms that govern their evolution?" We provide a complete characterization of the sources of non-identifiability as we vary knowledge about a set of possible mechanisms. In particular, we prove that if we know the exact mechanisms under which the latent properties evolve, then identification can be achieved up to any equivariances that are shared by the underlying mechanisms. We generalize this characterization to settings where we only know some hypothesis class over possible mechanisms, as well as settings where the mechanisms are stochastic. We demonstrate the power of this mechanism-based perspective by showing that we can leverage our results to generalize existing identifiable representation learning results. These results suggest that by exploiting inductive biases on mechanisms, it is possible to design a range of new identifiable representation learning approaches.

openalex-author · arXiv (Cornell University)

From Machine Learning to Robotics: Challenges and Opportunities for Embodied Intelligence

Machine learning has long since become a keystone technology, accelerating science and applications in a broad range of domains. Consequently, the notion of applying learning methods to a particular problem set has become an established and valuable modus operandi to advance a particular field. In this article we argue that such an approach does not straightforwardly extended to robotics -- or to embodied intelligence more generally: systems which engage in a purposeful exchange of energy and information with a physical environment. In particular, the purview of embodied intelligent agents extends significantly beyond the typical considerations of main-stream machine learning approaches, which typically (i) do not consider operation under conditions significantly different from those encountered during training; (ii) do not consider the often substantial, long-lasting and potentially safety-critical nature of interactions during learning and deployment; (iii) do not require ready adaptation to novel tasks while at the same time (iv) effectively and efficiently curating and extending their models of the world through targeted and deliberate actions. In reality, therefore, these limitations result in learning-based systems which suffer from many of the same operational shortcomings as more traditional, engineering-based approaches when deployed on a robot outside a well defined, and often narrow operating envelope. Contrary to viewing embodied intelligence as another application domain for machine learning, here we argue that it is in fact a key driver for the advancement of machine learning technology. In this article our goal is to highlight challenges and opportunities that are specific to embodied intelligence and to propose research directions which may significantly advance the state-of-the-art in robot learning.

openalex-author · arXiv (Cornell University)

Chunked Autoregressive GAN for Conditional Waveform Synthesis

Conditional waveform synthesis models learn a distribution of audio waveforms given conditioning such as text, mel-spectrograms, or MIDI. These systems employ deep generative models that model the waveform via either sequential (autoregressive) or parallel (non-autoregressive) sampling. Generative adversarial networks (GANs) have become a common choice for non-autoregressive waveform synthesis. However, state-of-the-art GAN-based models produce artifacts when performing mel-spectrogram inversion. In this paper, we demonstrate that these artifacts correspond with an inability for the generator to learn accurate pitch and periodicity. We show that simple pitch and periodicity conditioning is insufficient for reducing this error relative to using autoregression. We discuss the inductive bias that autoregression provides for learning the relationship between instantaneous frequency and phase, and show that this inductive bias holds even when autoregressively sampling large chunks of the waveform during each forward pass. Relative to prior state-of-the-art GAN-based models, our proposed model, Chunked Autoregressive GAN (CARGAN) reduces pitch error by 40-60%, reduces training time by 58%, maintains a fast generation speed suitable for real-time or interactive applications, and maintains or improves subjective quality.

openalex-author · arXiv (Cornell University)

Compositional Attention: Disentangling Search and Retrieval

Multi-head, key-value attention is the backbone of the widely successful Transformer model and its variants. This attention mechanism uses multiple parallel key-value attention blocks (called heads), each performing two fundamental computations: (1) search - selection of a relevant entity from a set via query-key interactions, and (2) retrieval - extraction of relevant features from the selected entity via a value matrix. Importantly, standard attention heads learn a rigid mapping between search and retrieval. In this work, we first highlight how this static nature of the pairing can potentially: (a) lead to learning of redundant parameters in certain tasks, and (b) hinder generalization. To alleviate this problem, we propose a novel attention mechanism, called Compositional Attention, that replaces the standard head structure. The proposed mechanism disentangles search and retrieval and composes them in a dynamic, flexible and context-dependent manner through an additional soft competition stage between the query-key combination and value pairing. Through a series of numerical experiments, we show that it outperforms standard multi-head attention on a variety of tasks, including some out-of-distribution settings. Through our qualitative analysis, we demonstrate that Compositional Attention leads to dynamic specialization based on the type of retrieval needed. Our proposed mechanism generalizes multi-head attention, allows independent scaling of search and retrieval, and can easily be implemented in lieu of standard attention heads in any network architecture.

openalex-author · arXiv (Cornell University)

Graph Neural Networks with Learnable Structural and Positional Representations

Graph neural networks (GNNs) have become the standard learning architectures for graphs. GNNs have been applied to numerous domains ranging from quantum chemistry, recommender systems to knowledge graphs and natural language processing. A major issue with arbitrary graphs is the absence of canonical positional information of nodes, which decreases the representation power of GNNs to distinguish e.g. isomorphic nodes and other graph symmetries. An approach to tackle this issue is to introduce Positional Encoding (PE) of nodes, and inject it into the input layer, like in Transformers. Possible graph PE are Laplacian eigenvectors. In this work, we propose to decouple structural and positional representations to make easy for the network to learn these two essential properties. We introduce a novel generic architecture which we call LSPE (Learnable Structural and Positional Encodings). We investigate several sparse and fully-connected (Transformer-like) GNNs, and observe a performance increase for molecular datasets, from 1.79% up to 64.14% when considering learnable PE for both GNN classes.

openalex-author · Neural Information Processing Systems

Dynamic Inference with Neural Interpreters

Modern neural network architectures can leverage large amounts of data to generalize well within the training distribution. However, they are less capable of systematic generalization to data drawn from unseen but related distributions, a feat that is hypothesized to require compositional reasoning and reuse of knowledge. In this work, we present Neural Interpreters, an architecture that factorizes inference in a self-attention network as a system of modules, which we call \emph{functions}. Inputs to the model are routed through a sequence of functions in a way that is end-to-end learned. The proposed architecture can flexibly compose computation along width and depth, and lends itself well to capacity extension after training. To demonstrate the versatility of Neural Interpreters, we evaluate it in two distinct settings: image classification and visual abstract reasoning on Raven Progressive Matrices. In the former, we show that Neural Interpreters perform on par with the vision transformer using fewer parameters, while being transferrable to a new task in a sample efficient manner. In the latter, we find that Neural Interpreters are competitive with respect to the state-of-the-art in terms of systematic generalization

openalex-author · arXiv (Cornell University)

Unifying Likelihood-free Inference with Black-box Sequence Design and Beyond.

Black-box optimization formulations for biological sequence design have drawn recent attention due to their promising potential impact on the pharmaceutical industry. In this work, we propose to unify two seemingly distinct worlds: likelihood-free inference and black-box sequence design, under one probabilistic framework. In tandem, we provide a recipe for constructing various sequence design methods based on this framework. We show how previous drug discovery approaches can be reinvented in our framework, and further propose new probabilistic sequence design algorithms. Extensive experiments illustrate the benefits of the proposed methodology.

openalex-author · arXiv (Cornell University)

ClimateGAN: Raising Climate Change Awareness by Generating Images of\n Floods

Climate change is a major threat to humanity, and the actions required to\nprevent its catastrophic consequences include changes in both policy-making and\nindividual behaviour. However, taking action requires understanding the effects\nof climate change, even though they may seem abstract and distant. Projecting\nthe potential consequences of extreme climate events such as flooding in\nfamiliar places can help make the abstract impacts of climate change more\nconcrete and encourage action. As part of a larger initiative to build a\nwebsite that projects extreme climate events onto user-chosen photos, we\npresent our solution to simulate photo-realistic floods on authentic images. To\naddress this complex task in the absence of suitable training data, we propose\nClimateGAN, a model that leverages both simulated and real data for\nunsupervised domain adaptation and conditional image generation. In this paper,\nwe describe the details of our framework, thoroughly evaluate components of our\narchitecture and demonstrate that our model is capable of robustly generating\nphoto-realistic flooding.\n

openalex-author · 2021 IEEE/CVF International Conference on Computer Vision (ICCV)

FloW: A Dataset and Benchmark for Floating Waste Detection in Inland Waters

Marine debris is severely threatening the marine lives and causing sustained pollution to the whole ecosystem. To prevent the wastes from getting into the ocean, it is helpful to clean up the floating wastes in inland waters using the autonomous cleaning devices like unmanned surface vehicles. The cleaning efficiency relies on a high-accurate and robust object detection system. However, the small size of the target, the strong light reflection over water surface, and the reflection of other objects on bank-side all bring challenges to the vision-based object detection system. To promote the practical application for autonomous floating wastes cleaning, we present FloW † , the first dataset for floating waste detection in inland water areas. The dataset consists of an image sub-dataset FloW-Img and a multimodal sub-dataset FloW-RI which contains synchronized millimeter wave radar data and images. Accurate annotations for images and radar data are provided, supporting floating waste detection strategies based on image, radar data, and the fusion of two sensors. We perform several baseline experiments on our dataset, including vision-based and radar-based detection methods. The results show that, the detection accuracy is relatively low and floating waste detection still remains a challenging task.

openalex-author · PLoS Computational Biology, Vol 17, Iss 10 (2021)

CAMAP: Artificial neural networks unveil the role of codon arrangement in modulating MHC-I peptides presentation

MHC-I associated peptides (MAPs) play a central role in the elimination of virus-infected and neoplastic cells by CD8 T cells. However, accurately predicting the MAP repertoire remains difficult, because only a fraction of the transcriptome generates MAPs. In this study, we investigated whether codon arrangement (usage and placement) regulates MAP biogenesis. We developed an artificial neural network called Codon Arrangement MAP Predictor (CAMAP), predicting MAP presentation solely from mRNA sequences flanking the MAP-coding codons (MCCs), while excluding the MCC per se. CAMAP predictions were significantly more accurate when using original codon sequences than shuffled codon sequences which reflect amino acid usage. Furthermore, predictions were independent of mRNA expression and MAP binding affinity to MHC-I molecules and applied to several cell types and species. Combining MAP ligand scores, transcript expression level and CAMAP scores was particularly useful to increase MAP prediction accuracy. Using an in vitro assay, we showed that varying the synonymous codons in the regions flanking the MCCs (without changing the amino acid sequence) resulted in significant modulation of MAP presentation at the cell surface. Taken together, our results demonstrate the role of codon arrangement in the regulation of MAP presentation and support integration of both translational and post-translational events in predictive algorithms to ameliorate modeling of the immunopeptidome. Author summary MHC-I associated peptides (MAPs) are small fragments of intracellular proteins presented at the surface of cells and used by the immune system to detect and eliminate cancerous or virus-infected cells. While it is theoretically possible to predict which portions of the intracellular proteins will be naturally processed by the cells to ultimately reach the surface, current methodologies have prohibitively high false discovery rates. Here we introduce an artificial neural network called Codon Arrangement MAP Predictor (CAMAP) which integrates information from mRNA-to-protein translation to other factors regulating MAP biogenesis (e.g. MAP ligand score and transcript expression levels) to improve MAP prediction accuracy. While most MAP predictive approaches focus on MAP sequences per se, CAMAP’s novelty is to analyze the MAP-flanking mRNA sequences, thereby providing completely independent information for MAP prediction. We show on several datasets that the integration of CAMAP scores with other known factors involved in MAP presentation (i.e. MAP ligand score and mRNA expression) significantly improves MAP prediction accuracy, and further validate CAMAP learned features using an in-vitro assay. These findings may have major implications for the design of vaccines against cancers and viruses, and in times of pandemics could accelerate the identification of relevant MAPs of viral origins.

openalex-author · arXiv (Cornell University)

Learning Neural Causal Models with Active Interventions

Discovering causal structures from data is a challenging inference problem of fundamental importance in all areas of science. The appealing properties of neural networks have recently led to a surge of interest in differentiable neural network-based methods for learning causal structures from data. So far, differentiable causal discovery has focused on static datasets of observational or fixed interventional origin. In this work, we introduce an active intervention targeting (AIT) method which enables a quick identification of the underlying causal structure of the data-generating process. Our method significantly reduces the required number of interactions compared with random intervention targeting and is applicable for both discrete and continuous optimization formulations of learning the underlying directed acyclic graph (DAG) from data. We examine the proposed method across multiple frameworks in a wide range of settings and demonstrate superior performance on multiple benchmarks from simulated to real-world data.

openalex-author · Canadian Medical Association Journal

Problems in the deployment of machine-learned models in health care

[See related articles at www.cmaj.ca/lookup/doi/10.1503/cmaj.202434][1] and [www.cmaj.ca/lookup/doi/10.1503/cmaj.210036][2] KEY POINTS In a companion article, Verma and colleagues discuss how machine-learned solutions can be developed and implemented to support medical decision-making.[1][3] Both

openalex-author · 2021 International Joint Conference on Neural Networks (IJCNN)

Combating False Negatives in Adversarial Imitation Learning

In adversarial imitation learning, a discriminator is trained to differentiate agent episodes from expert demonstrations representing the desired behavior. However, as the trained policy learns to be more successful, the negative examples (the ones produced by the agent) become increasingly similar to expert ones. Despite the fact that the task is successfully accomplished in some of the agent's trajectories, the discriminator is trained to output low values for them. We hypothesize that this inconsistent training signal for the discriminator can impede its learning, and consequently leads to worse overall performance of the agent. We show experimental evidence for this hypothesis and that the 'False Negatives' (i.e. successful agent episodes) significantly hinder adversarial imitation learning, which is the first contribution of this paper. Then, we propose a method to alleviate the impact of false negatives and test it on the BabyAI environment. This method consistently improves sample efficiency over the baselines by at least an order of magnitude.

openalex-author · arXiv (Cornell University)

Discrete-Valued Neural Communication

Deep learning has advanced from fully connected architectures to structured models organized into components, e.g., the transformer composed of positional elements, modular architectures divided into slots, and graph neural nets made up of nodes. In structured models, an interesting question is how to conduct dynamic and possibly sparse communication among the separate components. Here, we explore the hypothesis that restricting the transmitted information among components to discrete representations is a beneficial bottleneck. The motivating intuition is human language in which communication occurs through discrete symbols. Even though individuals have different understandings of what a "cat" is based on their specific experiences, the shared discrete token makes it possible for communication among individuals to be unimpeded by individual differences in internal representation. To discretize the values of concepts dynamically communicated among specialist components, we extend the quantization mechanism from the Vector-Quantized Variational Autoencoder to multi-headed discretization with shared codebooks and use it for discrete-valued neural communication (DVNC). Our experiments show that DVNC substantially improves systematic generalization in a variety of architectures -- transformers, modular architectures, and graph neural networks. We also show that the DVNC is robust to the choice of hyperparameters, making the method very useful in practice. Moreover, we establish a theoretical justification of our discretization process, proving that it has the ability to increase noise robustness and reduce the underlying dimensionality of the model.

openalex-author · arXiv (Cornell University)

The Causal-Neural Connection: Expressiveness, Learnability, and Inference

One of the central elements of any causal inference is an object called structural causal model (SCM), which represents a collection of mechanisms and exogenous sources of random variation of the system under investigation (Pearl, 2000). An important property of many kinds of neural networks is universal approximability: the ability to approximate any function to arbitrary precision. Given this property, one may be tempted to surmise that a collection of neural nets is capable of learning any SCM by training on data generated by that SCM. In this paper, we show this is not the case by disentangling the notions of expressivity and learnability. Specifically, we show that the causal hierarchy theorem (Thm. 1, Bareinboim et al., 2020), which describes the limits of what can be learned from data, still holds for neural models. For instance, an arbitrarily complex and expressive neural net is unable to predict the effects of interventions given observational data alone. Given this result, we introduce a special type of SCM called a neural causal model (NCM), and formalize a new type of inductive bias to encode structural constraints necessary for performing causal inferences. Building on this new class of models, we focus on solving two canonical tasks found in the literature known as causal identification and estimation. Leveraging the neural toolbox, we develop an algorithm that is both sufficient and necessary to determine whether a causal effect can be learned from data (i.e., causal identifiability); it then estimates the effect whenever identifiability holds (causal estimation). Simulations corroborate the proposed approach.

openalex-author · arXiv (Cornell University)

Systematic Evaluation of Causal Discovery in Visual Model Based\n Reinforcement Learning

Inducing causal relationships from observations is a classic problem in\nmachine learning. Most work in causality starts from the premise that the\ncausal variables themselves are observed. However, for AI agents such as robots\ntrying to make sense of their environment, the only observables are low-level\nvariables like pixels in images. To generalize well, an agent must induce\nhigh-level variables, particularly those which are causal or are affected by\ncausal variables. A central goal for AI and causality is thus the joint\ndiscovery of abstract representations and causal structure. However, we note\nthat existing environments for studying causal induction are poorly suited for\nthis objective because they have complicated task-specific causal graphs which\nare impossible to manipulate parametrically (e.g., number of nodes, sparsity,\ncausal chain length, etc.). In this work, our goal is to facilitate research in\nlearning representations of high-level variables as well as causal structures\namong them. In order to systematically probe the ability of methods to identify\nthese variables and structures, we design a suite of benchmarking RL\nenvironments. We evaluate various representation learning algorithms from the\nliterature and find that explicitly incorporating structure and modularity in\nmodels can help causal induction in model-based reinforcement learning.\n

openalex-author · Communications of the ACM

Deep learning for AI

How can neural networks learn the rich internal representations required for difficult tasks such as recognizing objects or understanding language?

openalex-author · arXiv (Cornell University)

Predicting Unreliable Predictions by Shattering a Neural Network

Piecewise linear neural networks can be split into subfunctions, each with its own activation pattern, domain, and empirical error. Empirical error for the full network can be written as an expectation over empirical error of subfunctions. Constructing a generalization bound on subfunction empirical error indicates that the more densely a subfunction is surrounded by training samples in representation space, the more reliable its predictions are. Further, it suggests that models with fewer activation regions generalize better, and models that abstract knowledge to a greater degree generalize better, all else equal. We propose not only a theoretical framework to reason about subfunction error bounds but also a pragmatic way of approximately evaluating it, which we apply to predicting which samples the network will not successfully generalize to. We test our method on detection of misclassification and out-of-distribution samples, finding that it performs competitively in both cases. In short, some network activation patterns are associated with higher reliability than others, and these can be identified using subfunction error bounds.

openalex-author · arXiv (Cornell University)

Variational Causal Networks: Approximate Bayesian Inference over Causal Structures

Learning the causal structure that underlies data is a crucial step towards robust real-world decision making. The majority of existing work in causal inference focuses on determining a single directed acyclic graph (DAG) or a Markov equivalence class thereof. However, a crucial aspect to acting intelligently upon the knowledge about causal structure which has been inferred from finite data demands reasoning about its uncertainty. For instance, planning interventions to find out more about the causal mechanisms that govern our data requires quantifying epistemic uncertainty over DAGs. While Bayesian causal inference allows to do so, the posterior over DAGs becomes intractable even for a small number of variables. Aiming to overcome this issue, we propose a form of variational inference over the graphs of Structural Causal Models (SCMs). To this end, we introduce a parametric variational family modelled by an autoregressive distribution over the space of discrete DAGs. Its number of parameters does not grow exponentially with the number of variables and can be tractably learned by maximising an Evidence Lower Bound (ELBO). In our experiments, we demonstrate that the proposed variational posterior is able to provide a good approximation of the true posterior.

openalex-author · International Conference on Machine Learning

Exploration-Driven Representation Learning in Reinforcement Learning

No abstract available from the OpenAlex source record.

openalex-author · arXiv (Cornell University)

Invariance Principle Meets Information Bottleneck for Out-of-Distribution Generalization

The invariance principle from causality is at the heart of notable approaches such as invariant risk minimization (IRM) that seek to address out-of-distribution (OOD) generalization failures. Despite the promising theory, invariance principle-based approaches fail in common classification tasks, where invariant (causal) features capture all the information about the label. Are these failures due to the methods failing to capture the invariance? Or is the invariance principle itself insufficient? To answer these questions, we revisit the fundamental assumptions in linear regression tasks, where invariance-based approaches were shown to provably generalize OOD. In contrast to the linear regression tasks, we show that for linear classification tasks we need much stronger restrictions on the distribution shifts, or otherwise OOD generalization is impossible. Furthermore, even with appropriate restrictions on distribution shifts in place, we show that the invariance principle alone is insufficient. We prove that a form of the information bottleneck constraint along with invariance helps address key failures when invariant features capture all the information about the label and also retains the existing success when they do not. We propose an approach that incorporates both of these principles and demonstrate its effectiveness in several experiments.

openalex-author · arXiv (Cornell University)

Flow Network based Generative Models for Non-Iterative Diverse Candidate\n Generation

This paper is about the problem of learning a stochastic policy for\ngenerating an object (like a molecular graph) from a sequence of actions, such\nthat the probability of generating an object is proportional to a given\npositive reward for that object. Whereas standard return maximization tends to\nconverge to a single return-maximizing sequence, there are cases where we would\nlike to sample a diverse set of high-return solutions. These arise, for\nexample, in black-box function optimization when few rounds are possible, each\nwith large batches of queries, where the batches should be diverse, e.g., in\nthe design of new molecules. One can also see this as a problem of\napproximately converting an energy function to a generative distribution. While\nMCMC methods can achieve that, they are expensive and generally only perform\nlocal exploration. Instead, training a generative policy amortizes the cost of\nsearch during training and yields to fast generation. Using insights from\nTemporal Difference learning, we propose GFlowNet, based on a view of the\ngenerative process as a flow network, making it possible to handle the tricky\ncase where different trajectories can yield the same final state, e.g., there\nare many ways to sequentially add atoms to generate some molecular graph. We\ncast the set of trajectories as a flow and convert the flow consistency\nequations into a learning objective, akin to the casting of the Bellman\nequations into Temporal Difference methods. We prove that any global minimum of\nthe proposed objectives yields a policy which samples from the desired\ndistribution, and demonstrate the improved performance and diversity of\nGFlowNet on a simple domain where there are many modes to the reward function,\nand on a molecule synthesis task.\n

openalex-author · arXiv (Cornell University)

SpeechBrain: A General-Purpose Speech Toolkit

SpeechBrain is an open-source and all-in-one speech toolkit. It is designed to facilitate the research and development of neural speech processing technologies by being simple, flexible, user-friendly, and well-documented. This paper describes the core architecture designed to support several tasks of common interest, allowing users to naturally conceive, compare and share novel speech processing pipelines. SpeechBrain achieves competitive or state-of-the-art performance in a wide range of speech benchmarks. It also provides training recipes, pretrained models, and inference scripts for popular speech datasets, as well as tutorials which allow anyone with basic Python proficiency to familiarize themselves with speech technologies.

openalex-author · arXiv (Cornell University)

A Consciousness-Inspired Planning Agent for Model-Based Reinforcement\n Learning

We present an end-to-end, model-based deep reinforcement learning agent which\ndynamically attends to relevant parts of its state during planning. The agent\nuses a bottleneck mechanism over a set-based representation to force the number\nof entities to which the agent attends at each planning step to be small. In\nexperiments, we investigate the bottleneck mechanism with several sets of\ncustomized environments featuring different challenges. We consistently observe\nthat the design allows the planning agents to generalize their learned\ntask-solving abilities in compatible unseen environments by attending to the\nrelevant objects, leading to better out-of-distribution generalization\nperformance.\n

openalex-author · Proceedings of the AAAI Conference on Artificial Intelligence

GraphMix: Improved Training of GNNs for Semi-Supervised Learning

We present GraphMix, a regularization method for Graph Neural Network based semi-supervised object classification, whereby we propose to train a fully-connected network jointly with the graph neural network via parameter sharing and interpolation-based regularization. Further, we provide a theoretical analysis of how GraphMix improves the generalization bounds of the underlying graph neural network, without making any assumptions about the "aggregation" layer or the depth of the graph neural networks. We experimentally validate this analysis by applying GraphMix to various architectures such as Graph Convolutional Networks, Graph Attention Networks and Graph-U-Net. Despite its simplicity, we demonstrate that GraphMix can consistently improve or closely match state-of-the-art performance using even simpler architectures such as Graph Convolutional Networks, across three established graph benchmarks: Cora, Citeseer and Pubmed citation network datasets, as well as three newly proposed datasets: Cora-Full, Co-author-CS and Co-author-Physics.

openalex-author · Proceedings of the AAAI Conference on Artificial Intelligence

Deep Verifier Networks: Verification of Deep Discriminative Models with Deep Generative Models

AI Safety is a major concern in many deep learning applications such as autonomous driving. Given a trained deep learning model, an important natural problem is how to reliably verify the model's prediction. In this paper, we propose a novel framework --- deep verifier networks (DVN) to detect unreliable inputs or predictions of deep discriminative models, using separately trained deep generative models. Our proposed model is based on conditional variational auto-encoders with disentanglement constraints to separate the label information from the latent representation. We give both intuitive and theoretical justifications for the model. Our verifier network is trained independently with the prediction model, which eliminates the need of retraining the verifier network for a new model. We test the verifier network on both out-of-distribution detection and adversarial example detection problems, as well as anomaly detection problems in structured prediction tasks such as image caption generation. We achieve state-of-the-art results in all of these problems.

openalex-author · Proceedings of the AAAI Conference on Artificial Intelligence

Visual Concept Reasoning Networks

A split-transform-merge strategy has been broadly used as an architectural constraint in convolutional neural networks for visual recognition tasks. It approximates sparsely connected networks by explicitly defining multiple branches to simultaneously learn representations with different visual concepts or properties. Dependencies or interactions between these representations are typically defined by dense and local operations, however, without any adaptiveness or high-level reasoning. In this work, we propose to exploit this strategy and combine it with our Visual Concept Reasoning Networks (VCRNet) to enable reasoning between high-level visual concepts. We associate each branch with a visual concept and derive a compact concept state by selecting a few local descriptors through an attention module. These concept states are then updated by graph-based interaction and used to adaptively modulate the local descriptors. We describe our proposed model by split-transform-attend-interact-modulate-merge stages, which are implemented by opting for a highly modularized architecture. Extensive experiments on visual recognition tasks such as image classification, semantic segmentation, object detection, scene recognition, and action recognition show that our proposed model, VCRNet, consistently improves the performance by increasing the number of parameters by less than 1%.

openalex-author · Proceedings of the AAAI Conference on Artificial Intelligence

Parameterizing Branch-and-Bound Search Trees to Learn Branching Policies

Branch and Bound (B&B) is the exact tree search method typically used to solve Mixed-Integer Linear Programming problems (MILPs). Learning branching policies for MILP has become an active research area, with most works proposing to imitate the strong branching rule and specialize it to distinct classes of problems. We aim instead at learning a policy that generalizes across heterogeneous MILPs: our main hypothesis is that parameterizing the state of the B&B search tree can aid this type of generalization. We propose a novel imitation learning framework, and introduce new input features and architectures to represent branching. Experiments on MILP benchmark instances clearly show the advantages of incorporating an explicit parameterization of the state of the search tree to modulate the branching decisions, in terms of both higher accuracy and smaller B&B trees. The resulting policies significantly outperform the current state-of-the-art method for "learning to branch" by effectively allowing generalization to generic unseen instances.

openalex-author · Proceedings of the AAAI Conference on Artificial Intelligence

Meta-Learning Framework with Applications to Zero-Shot Time-Series Forecasting

Can meta-learning discover generic ways of processing time series (TS) from a diverse dataset so as to greatly improve generalization on new TS coming from different datasets? This work provides positive evidence to this using a broad meta-learning framework which we show subsumes many existing meta-learning algorithms. Our theoretical analysis suggests that residual connections act as a meta-learning adaptation mechanism, generating a subset of task-specific parameters based on a given TS input, thus gradually expanding the expressive power of the architecture on-the-fly. The same mechanism is shown via linearization analysis to have the interpretation of a sequential update of the final linear layer. Our empirical results on a wide range of data emphasize the importance of the identified meta-learning mechanisms for successful zero-shot univariate forecasting, suggesting that it is viable to train a neural network on a source TS dataset and deploy it on a different target TS dataset without retraining, resulting in performance that is at least as good as that of state-of-practice univariate forecasting models.

openalex-author · PolyPublie (École Polytechnique de Montréal)

Fast and Slow Learning of Recurrent Independent Mechanisms

Decomposing knowledge into interchangeable pieces promises a generalization advantage when there are changes in distribution. A learning agent interacting with its environment is likely to be faced with situations requiring novel combinations of existing pieces of knowledge. We hypothesize that such a decomposition of knowledge is particularly relevant for being able to generalize in a systematic manner to out-of-distribution changes. To study these ideas, we propose a particular training framework in which we assume that the pieces of knowledge an agent needs and its reward function are stationary and can be re-used across tasks. An attention mechanism dynamically selects which modules can be adapted to the current task, and the parameters of the selected modules are allowed to change quickly as the learner is confronted with variations in what it experiences, while the parameters of the attention mechanisms act as stable, slowly changing, meta-parameters. We focus on pieces of knowledge captured by an ensemble of modules sparsely communicating with each other via a bottleneck of attention. We find that meta-learning the modular aspects of the proposed system greatly helps in achieving faster adaptation in a reinforcement learning setup involving navigation in a partially observed grid world with image-level input. We also find that reversing the role of parameters and meta-parameters does not work nearly as well, suggesting a particular role for fast adaptation of the dynamically selected modules.

openalex-author · Proceedings of the AAAI Conference on Artificial Intelligence

Object-Centric Image Generation from Layouts

We begin with the hypothesis that a model must be able to understand individual objects and relationships between objects in order to generate complex scenes with multiple objects well. Our layout-to-image-generation method, which we call Object-Centric Generative Adversarial Network (or OC-GAN), relies on a novel Scene-Graph Similarity Module (SGSM). The SGSM learns representations of the spatial relationships between objects in the scene, which lead to our model's improved layout-fidelity. We also propose changes to the conditioning mechanism of the generator that enhance its object instance-awareness. Apart from improving image quality, our contributions mitigate two failure modes in previous approaches: (1) spurious objects being generated without corresponding bounding boxes in the layout, and (2) overlapping bounding boxes in the layout leading to merged objects in images. Extensive quantitative evaluation and ablation studies demonstrate the impact of our contributions, with our model outperforming previous state-of-the-art approaches on both the COCO-Stuff and Visual Genome datasets. Finally, we address an important limitation of evaluation metrics used in previous works by introducing SceneFID -- an object-centric adaptation of the popular Fréchet Inception Distance metric, that is better suited for multi-object images.

openalex-author · Neuron

How does hemispheric specialization contribute to human-defining cognition?

No abstract available from the OpenAlex source record.

openalex-author · arXiv (Cornell University)

An End-to-End Framework for Molecular Conformation Generation via\n Bilevel Programming

Predicting molecular conformations (or 3D structures) from molecular graphs\nis a fundamental problem in many applications. Most existing approaches are\nusually divided into two steps by first predicting the distances between atoms\nand then generating a 3D structure through optimizing a distance geometry\nproblem. However, the distances predicted with such two-stage approaches may\nnot be able to consistently preserve the geometry of local atomic\nneighborhoods, making the generated structures unsatisfying. In this paper, we\npropose an end-to-end solution for molecular conformation prediction called\nConfVAE based on the conditional variational autoencoder framework.\nSpecifically, the molecular graph is first encoded in a latent space, and then\nthe 3D structures are generated by solving a principled bilevel optimization\nprogram. Extensive experiments on several benchmark data sets prove the\neffectiveness of our proposed approach over existing state-of-the-art\napproaches. Code is available at https://github.com/MinkaiXu/ConfVAE-ICML21\n

openalex-author · ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

CMIM: Cross-Modal Information Maximization For Medical Imaging

In hospitals, data are siloed to specific information systems that make the same information available under different modalities such as the different medical imaging exams the patient undergoes (CT scans, MRI, PET, Ultrasound, etc.) and their associated radiology reports. This offers unique opportunities to obtain and use at train-time those multiple views of the same information that might not always be available at test-time.In this paper, we propose an innovative framework that makes the most of available data by learning good representations of a multi-modal input that are resilient to modality dropping at test-time, using recent advances in mutual information maximization. By maximizing cross-modal information at train time, we are able to outperform several state-of-the-art baselines in two different settings, medical image classification, and segmentation. In particular, our method is shown to have a strong impact on the inference-time performance of weaker modalities.

openalex-author · IEEE Transactions on Neural Networks and Learning Systems

A Two-Stream Continual Learning System With Variational Domain-Agnostic Feature Replay

Learning in nonstationary environments is one of the biggest challenges in machine learning. Nonstationarity can be caused by either task drift, i.e., the drift in the conditional distribution of labels given the input data, or the domain drift, i.e., the drift in the marginal distribution of the input data. This article aims to tackle this challenge with a modularized two-stream continual learning (CL) system, where the model is required to learn new tasks from a support stream and adapted to new domains in the query stream while maintaining previously learned knowledge. To deal with both drifts within and across the two streams, we propose a variational domain-agnostic feature replay-based approach that decouples the system into three modules: an inference module that filters the input data from the two streams into domain-agnostic representations, a generative module that facilitates the high-level knowledge transfer, and a solver module that applies the filtered and transferable knowledge to solve the queries. We demonstrate the effectiveness of our proposed approach in addressing the two fundamental scenarios and complex scenarios in two-stream CL.

openalex-author · arXiv (Cornell University)

Neural Production Systems

Visual environments are structured, consisting of distinct objects or entities. These entities have properties -- both visible and latent -- that determine the manner in which they interact with one another. To partition images into entities, deep-learning researchers have proposed structural inductive biases such as slot-based architectures. To model interactions among entities, equivariant graph neural nets (GNNs) are used, but these are not particularly well suited to the task for two reasons. First, GNNs do not predispose interactions to be sparse, as relationships among independent entities are likely to be. Second, GNNs do not factorize knowledge about interactions in an entity-conditional manner. As an alternative, we take inspiration from cognitive science and resurrect a classic approach, production systems, which consist of a set of rule templates that are applied by binding placeholder variables in the rules to specific entities. Rules are scored on their match to entities, and the best fitting rules are applied to update entity properties. In a series of experiments, we demonstrate that this architecture achieves a flexible, dynamic flow of control and serves to factorize entity-specific and rule-based information. This disentangling of knowledge achieves robust future-state prediction in rich visual environments, outperforming state-of-the-art methods using GNNs, and allows for the extrapolation from simple (few object) environments to more complex environments.

openalex-author · arXiv (Cornell University)

Coordination Among Neural Modules Through a Shared Global Workspace

Deep learning has seen a movement away from representing examples with a monolithic hidden state towards a richly structured state. For example, Transformers segment by position, and object-centric architectures decompose images into entities. In all these architectures, interactions between different elements are modeled via pairwise interactions: Transformers make use of self-attention to incorporate information from other positions; object-centric architectures make use of graph neural networks to model interactions among entities. However, pairwise interactions may not achieve global coordination or a coherent, integrated representation that can be used for downstream tasks. In cognitive science, a global workspace architecture has been proposed in which functionally specialized components share information through a common, bandwidth-limited communication channel. We explore the use of such a communication channel in the context of deep learning for modeling the structure of complex environments. The proposed method includes a shared workspace through which communication among different specialist modules takes place but due to limits on the communication bandwidth, specialist modules must compete for access. We show that capacity limitations have a rational basis in that (1) they encourage specialization and compositionality and (2) they facilitate the synchronization of otherwise independent specialists.

openalex-author · arXiv (Cornell University)

Transformers with Competitive Ensembles of Independent Mechanisms

An important development in deep learning from the earliest MLPs has been a move towards architectures with structural inductive biases which enable the model to keep distinct sources of information and routes of processing well-separated. This structure is linked to the notion of independent mechanisms from the causality literature, in which a mechanism is able to retain the same processing as irrelevant aspects of the world are changed. For example, convnets enable separation over positions, while attention-based architectures (especially Transformers) learn which combination of positions to process dynamically. In this work we explore a way in which the Transformer architecture is deficient: it represents each position with a large monolithic hidden representation and a single set of parameters which are applied over the entire hidden representation. This potentially throws unrelated sources of information together, and limits the Transformer's ability to capture independent mechanisms. To address this, we propose Transformers with Independent Mechanisms (TIM), a new Transformer layer which divides the hidden representation and parameters into multiple mechanisms, which only exchange information through attention. Additionally, we propose a competition mechanism which encourages these mechanisms to specialize over time steps, and thus be more independent. We study TIM on a large-scale BERT model, on the Image Transformer, and on speech enhancement and find evidence for semantically meaningful specialization as well as improved performance.

openalex-author · Proceedings of the IEEE

Toward Causal Representation Learning

The two fields of machine learning and graphical causality arose and are developed separately. However, there is, now, cross-pollination and increasing interest in both fields to benefit from the advances of the other. In this article, we review fundamental concepts of causal inference and relate them to crucial open problems of machine learning, including transfer and generalization, thereby assaying how causality can contribute to modern machine learning research. This also applies in the opposite direction: we note that most work in causality starts from the premise that the causal variables are given. A central problem for AI and causality is, thus, causal representation learning, that is, the discovery of high-level causal variables from low-level observations. Finally, we delineate some implications of causality for machine learning and propose key research areas at the intersection of both communities.

openalex-author · arXiv (Cornell University)

Towards Causal Representation Learning

The two fields of machine learning and graphical causality arose and developed separately. However, there is now cross-pollination and increasing interest in both fields to benefit from the advances of the other. In the present paper, we review fundamental concepts of causal inference and relate them to crucial open problems of machine learning, including transfer and generalization, thereby assaying how causality can contribute to modern machine learning research. This also applies in the opposite direction: we note that most work in causality starts from the premise that the causal variables are given. A central problem for AI and causality is, thus, causal representation learning, the discovery of high-level causal variables from low-level observations. Finally, we delineate some implications of causality for machine learning and propose key research areas at the intersection of both communities.

openalex-author · arXiv (Cornell University)

Learning Neural Generative Dynamics for Molecular Conformation Generation

We study how to generate molecule conformations (i.e., 3D structures) from a molecular graph. Traditional methods, such as molecular dynamics, sample conformations via computationally expensive simulations. Recently, machine learning methods have shown great potential by training on a large collection of conformation data. Challenges arise from the limited model capacity for capturing complex distributions of conformations and the difficulty in modeling long-range dependencies between atoms. Inspired by the recent progress in deep generative models, in this paper, we propose a novel probabilistic framework to generate valid and diverse conformations given a molecular graph. We propose a method combining the advantages of both flow-based and energy-based models, enjoying: (1) a high model capacity to estimate the multimodal conformation distribution; (2) explicitly capturing the complex long-range dependencies between atoms in the observation space. Extensive experiments demonstrate the superior performance of the proposed method on several benchmarks, including conformation generation and distance modeling tasks, with a significant improvement over existing generative models for molecular conformation sampling.

openalex-author · Frontiers in Neuroscience

Scaling Equilibrium Propagation to Deep ConvNets by Drastically Reducing Its Gradient Estimator Bias

Equilibrium Propagation is a biologically-inspired algorithm that trains convergent recurrent neural networks with a local learning rule. This approach constitutes a major lead to allow learning-capable neuromophic systems and comes with strong theoretical guarantees. Equilibrium propagation operates in two phases, during which the network is let to evolve freely and then "nudged" toward a target; the weights of the network are then updated based solely on the states of the neurons that they connect. The weight updates of Equilibrium Propagation have been shown mathematically to approach those provided by Backpropagation Through Time (BPTT), the mainstream approach to train recurrent neural networks, when nudging is performed with infinitely small strength. In practice, however, the standard implementation of Equilibrium Propagation does not scale to visual tasks harder than MNIST. In this work, we show that a bias in the gradient estimate of equilibrium propagation, inherent in the use of finite nudging, is responsible for this phenomenon and that canceling it allows training deep convolutional neural networks. We show that this bias can be greatly reduced by using symmetric nudging (a positive nudging and a negative one). We also generalize Equilibrium Propagation to the case of cross-entropy loss (by opposition to squared error). As a result of these advances, we are able to achieve a test error of 11.7% on CIFAR-10, which approaches the one achieved by BPTT and provides a major improvement with respect to the standard Equilibrium Propagation that gives 86% test error. We also apply these techniques to train an architecture with unidirectional forward and backward connections, yielding a 13.2% test error. These results highlight equilibrium propagation as a compelling biologically-plausible approach to compute error gradients in deep neuromorphic systems.

openalex-author · arXiv (Cornell University)

DEUP: Direct Epistemic Uncertainty Prediction

Epistemic Uncertainty is a measure of the lack of knowledge of a learner which diminishes with more evidence. While existing work focuses on using the variance of the Bayesian posterior due to parameter uncertainty as a measure of epistemic uncertainty, we argue that this does not capture the part of lack of knowledge induced by model misspecification. We discuss how the excess risk, which is the gap between the generalization error of a predictor and the Bayes predictor, is a sound measure of epistemic uncertainty which captures the effect of model misspecification. We thus propose a principled framework for directly estimating the excess risk by learning a secondary predictor for the generalization error and subtracting an estimate of aleatoric uncertainty, i.e., intrinsic unpredictability. We discuss the merits of this novel measure of epistemic uncertainty, and highlight how it differs from variance-based measures of epistemic uncertainty and addresses its major pitfall. Our framework, Direct Epistemic Uncertainty Prediction (DEUP) is particularly interesting in interactive learning environments, where the learner is allowed to acquire novel examples in each round. Through a wide set of experiments, we illustrate how existing methods in sequential model optimization can be improved with epistemic uncertainty estimates from DEUP, and how DEUP can be used to drive exploration in reinforcement learning. We also evaluate the quality of uncertainty estimates from DEUP for probabilistic image classification and predicting synergies of drug combinations.

openalex-author · arXiv (Cornell University)

Structured Sparsity Inducing Adaptive Optimizers for Deep Learning

The parameters of a neural network are naturally organized in groups, some of which might not contribute to its overall performance. To prune out unimportant groups of parameters, we can include some non-differentiable penalty to the objective function, and minimize it using proximal gradient methods. In this paper, we derive the weighted proximal operator, which is a necessary component of these proximal methods, of two structured sparsity inducing penalties. Moreover, they can be approximated efficiently with a numerical solver, and despite this approximation, we prove that existing convergence guarantees are preserved when these operators are integrated as part of a generic adaptive proximal method. Finally, we show that this adaptive method, together with the weighted proximal operators derived here, is indeed capable of finding solutions with structure in their sparsity patterns, on representative examples from computer vision and natural language processing.

openalex-author · Harvard Dataverse

BiasCorp

The dataset for the Bias Project @ Mila

openalex-author · bioRxiv

Learning from unexpected events in the neocortical microcircuit

Abstract Scientists have long conjectured that the neocortex learns the structure of the environment in a predictive, hierarchical manner. According to this conjecture, expected, predictable features are differentiated from unexpected ones by comparing bottom-up and top-down streams of information. It is theorized that the neocortex then changes the representation of incoming stimuli, guided by differences in the responses to expected and unexpected events. In line with this conjecture, different responses to expected and unexpected sensory features have been observed in spiking and somatic calcium events. However, it remains unknown whether these unexpected event signals occur in the distal apical dendrites where many top-down signals are received, and whether these signals govern subsequent changes in the brain’s stimulus representations. Here, we show that both somata and distal apical dendrites of cortical pyramidal neurons exhibit distinct unexpected event signals that systematically change over days. These findings were obtained by tracking the responses of individual somata and dendritic branches of layer 2/3 and layer 5 pyramidal neurons over multiple days in primary visual cortex of awake, behaving mice using two-photon calcium imaging. Many neurons in both layers 2/3 and 5 showed large differences between their responses to expected and unexpected events. Interestingly, these responses evolved in opposite directions in the somata and distal apical dendrites. These differences between the somata and distal apical dendrites may be important for hierarchical computation, given that these two compartments tend to receive bottom-up and top-down information, respectively.

openalex-author · 2020 25th International Conference on Pattern Recognition (ICPR)

Attention Based Pruning for Shift Networks

In many application domains such as computer vision, Convolutional Layers (CLs) are key to the accuracy of deep learning methods. However, it is often required to assemble a large number of CLs, each containing thousands of parameters, in order to reach state-of-the-art accuracy, thus resulting in complex and demanding systems that are poorly fitted to resource-limited devices. Recently, methods have been proposed to replace the generic convolution operator by the combination of a shift operation and a simpler 1×1 convolution. The resulting block, called Shift Layer (SL), is an efficient alternative to CLs in the sense it allows to reach similar accuracies on various tasks with faster computations and fewer parameters. In this contribution, we introduce Shift Attention Layers (SALs), which extend SLs by using an attention mechanism that learns which shifts are the best at the same time the network function is trained. We demonstrate SALs are able to outperform vanilla SLs (and CLs) on various object recognition benchmarks while significantly reducing the number of float operations and parameters for the inference.

openalex-author · Figshare

JEDI Billion Molecules against Covid-19: compounds synthesized

On May 4th 2020 a GrandChallenge entitled "Billion molecules against Covid-19" was launched by the Joint European Disruptive Initiative (JEDI). Teams from all over the world provided lists of compounds with suspected inhibitory properties versus specific SARS-CoV-2 and human host protein targets. Here we disclose 878 compounds, which were selected by the 20 top-ranked teams ('finalists'). Some finalists teams did not have compounds produced, because they were either unstable or synthetically unfeasible. The compounds were synthesized thanks to JEDI fundraising, and are in the process of being tested in biophysical assays, cell-based assays, etc. Purity data for each compound by LC-MS (liquid chromatography mass spectrometry) is provided. The data file contains SMILES information, molecular weight, purity, and solubility information (in DMSO dimethylsulfoxide). The compounds should be active against either one of the following SARS-CoV-2 protein targets: nsp3, nsp5, nucleocapsid (N), Spike (S), or RdRp (Nsp12), or against human host target TMPRSS2. Authors are listed in random order. Some co-authors are members of the Scientific Committee or organizing team of the GrandChallenge. All work has been performed with an open science mindset, and all data will be shared in the public domain. A publication is in preparation to disclose the experimental results. In version 1 of this data set it is likely that some co-author data is missing, which will be corrected in following versions. The data has been made public by the program manager of the GrandChallenge, Prof. Thomas M. Hermans. Author contributions are also available in spreadsheet format. Please contact [email protected] for more information, or check https://www.jedi.foundation/ About JEDI - The Joint European Disruptive Initiative (JEDI) seeks to bring Europe in a leadership position in breakthrough technologies by launching Scientific and Technology GrandChallenges to push the frontiers of innovation. It is the European advanced research projects agency (ARPA) with a radical method based on excellence, speed, high expectations, bold risk-taking & no geographical return to solve major societal issues in environment, healthcare, digital, education, oceans & space. Driven by humanistic values, JEDI is a not-for-profit organization financed by foundations, governments, corporates and powered by the contribution of 4.100 technology and scientific leaders from academia, corporations and deeptech startups in 29 countries in Europe.

openalex-author · Archivio istituzionale della ricerca (Alma Mater Studiorum Università di Bologna)

Machine learning for combinatorial optimization: A methodological tour d'horizon

This paper surveys the recent attempts, both from the machine learning and operations research communities, at leveraging machine learning to solve combinatorial optimization problems. Given the hard nature of these problems, state-of-the-art algorithms rely on handcrafted heuristics for making decisions that are otherwise too expensive to compute or mathematically not well defined. Thus, machine learning looks like a natural candidate to make such decisions in a more principled and optimized way. We advocate for pushing further the integration of machine learning and combinatorial optimization and detail a methodology to do so. A main point of the paper is seeing generic optimization problems as data points and inquiring what is the relevant distribution of problems to use for learning on a given task.

openalex-author · IEEE Computer Graphics and Applications

Using Artificial Intelligence to Visualize the Impacts of Climate Change

Public awareness and concern about climate change often do not match the magnitude of its threat to humans and our environment. One reason for this disagreement is that it is difficult to mentally simulate the effects of a process as complex as climate change and to have a concrete representation of the impact that our individual actions will have on our own future, especially if the consequences are long term and abstract. To overcome these challenges, we propose to use cutting-edge artificial intelligence (AI) approaches to develop an interactive personalized visualization tool, the AI climate impact visualizer. It will allow a user to enter an address-be it their house, their school, or their workplace--and it will provide them with an AI-imagined possible visualization of the future of this location in 2050 following the detrimental effects of climate change such as floods, storms, and wildfires. This image will be accompanied by accessible information regarding the science behind climate change, i.e., why extreme weather events are becoming more frequent and what kinds of changes are happening on a local and global scale.

openalex-author · Lecture Notes in Computer Science

A Comparative Study of Learning Outcomes for Online Learning Platforms

No abstract available from the OpenAlex source record.

openalex-author · Paper

Discrete-Valued Neural Communication in Structured Architectures Enhances Generalization

No abstract available from the OpenAlex source record.

openalex-author · Paper

Machine Learning for Glacier Monitoring in the Hindu Kush Himalaya.

Glacier mapping is key to ecological monitoring in the hkh region. Climate change poses a risk to individuals whose livelihoods depend on the health of glacier ecosystems. In this work, we present a machine learning based approach to support ecological monitoring, with a focus on glaciers. Our approach is based on semi-automated mapping from satellite images. We utilize readily available remote sensing data to create a model to identify and outline both clean ice and debris-covered glaciers from satellite imagery. We also release data and develop a web tool that allows experts to visualize and correct model predictions, with the ultimate aim of accelerating the glacier mapping process.

openalex-author · Informatica : Journal of Applied Machines Electrical Electronics Computer Science and Communication Systems

Establishing an evaluation metric to quantify climate change image realism

With success on controlled tasks, deep generative models are being increasingly applied to humanitarian applications . In this paper, we focus on the evaluation of a conditional generative model that illustrates the consequences of climate change-induced flooding to encourage public interest and awareness on the issue. Because metrics for comparing the realism of different modes in a conditional generative model do not exist, we propose several automated and human-based methods for evaluation. To do this, we adapt several existing metrics and assess the automated metrics against gold standard human evaluation. We find that using Fréchet Inception Distance with embeddings from an intermediary Inception-v3 layer that precedes the auxiliary classifier produces results most correlated with human realism. While insufficient alone to establish a human-correlated automatic evaluation metric, we believe this work begins to bridge the gap between human and automated generative evaluation procedures, and to generate more realistic images of the future consequences of climate change.

openalex-author · arXiv (Cornell University)

RetroGNN: Approximating Retrosynthesis by Graph Neural Networks for De Novo Drug Design

De novo molecule generation often results in chemically unfeasible molecules. A natural idea to mitigate this problem is to bias the search process towards more easily synthesizable molecules using a proxy for synthetic accessibility. However, using currently available proxies still results in highly unrealistic compounds. We investigate the feasibility of training deep graph neural networks to approximate the outputs of a retrosynthesis planning software, and their use to bias the search process. We evaluate our method on a benchmark involving searching for drug-like molecules with antibiotic properties. Compared to enumerating over five million existing molecules from the ZINC database, our approach finds molecules predicted to be more likely to be antibiotics while maintaining good drug-like properties and being easily synthesizable. Importantly, our deep neural network can successfully filter out hard to synthesize molecules while achieving a $10^5$ times speed-up over using the retrosynthesis planning software.

openalex-author · arXiv (Cornell University)

Gradient Starvation: A Learning Proclivity in Neural Networks

We identify and formalize a fundamental gradient descent phenomenon resulting in a learning proclivity in over-parameterized neural networks. Gradient Starvation arises when cross-entropy loss is minimized by capturing only a subset of features relevant for the task, despite the presence of other predictive features that fail to be discovered. This work provides a theoretical explanation for the emergence of such feature imbalance in neural networks. Using tools from Dynamical Systems theory, we identify simple properties of learning dynamics during gradient descent that lead to this imbalance, and prove that such a situation can be expected given certain statistical structure in training data. Based on our proposed formalism, we develop guarantees for a novel regularization method aimed at decoupling feature learning dynamics, improving accuracy and robustness in cases hindered by gradient starvation. We illustrate our findings with simple and real-world out-of-distribution (OOD) generalization experiments.

openalex-author · arXiv (Cornell University)

COVI-AgentSim: an Agent-based Model for Evaluating Methods of Digital Contact Tracing

The rapid global spread of COVID-19 has led to an unprecedented demand for effective methods to mitigate the spread of the disease, and various digital contact tracing (DCT) methods have emerged as a component of the solution. In order to make informed public health choices, there is a need for tools which allow evaluation and comparison of DCT methods. We introduce an agent-based compartmental simulator we call COVI-AgentSim, integrating detailed consideration of virology, disease progression, social contact networks, and mobility patterns, based on parameters derived from empirical research. We verify by comparing to real data that COVI-AgentSim is able to reproduce realistic COVID-19 spread dynamics, and perform a sensitivity analysis to verify that the relative performance of contact tracing methods are consistent across a range of settings. We use COVI-AgentSim to perform cost-benefit analyses comparing no DCT to: 1) standard binary contact tracing (BCT) that assigns binary recommendations based on binary test results; and 2) a rule-based method for feature-based contact tracing (FCT) that assigns a graded level of recommendation based on diverse individual features. We find all DCT methods consistently reduce the spread of the disease, and that the advantage of FCT over BCT is maintained over a wide range of adoption rates. Feature-based methods of contact tracing avert more disability-adjusted life years (DALYs) per socioeconomic cost (measured by productive hours lost). Our results suggest any DCT method can help save lives, support re-opening of economies, and prevent second-wave outbreaks, and that FCT methods are a promising direction for enriching BCT using self-reported symptoms, yielding earlier warning signals and a significantly reduced spread of the virus per socioeconomic cost.

openalex-author · Journal of Artificial Intelligence Research

The Bottleneck Simulator: A Model-Based Deep Reinforcement Learning Approach

Deep reinforcement learning has recently shown many impressive successes. However, one major obstacle towards applying such methods to real-world problems is their lack of data-efficiency. To this end, we propose the Bottleneck Simulator: a model-based reinforcement learning method which combines a learned, factorized transition model of the environment with rollout simulations to learn an effective policy from few examples. The learned transition model employs an abstract, discrete (bottleneck) state, which increases sample efficiency by reducing the number of model parameters and by exploiting structural properties of the environment. We provide a mathematical analysis of the Bottleneck Simulator in terms of fixed points of the learned policy, which reveals how performance is affected by four distinct sources of error: an error related to the abstract space structure, an error related to the transition model estimation variance, an error related to the transition model estimation bias, and an error related to the transition model class bias. Finally, we evaluate the Bottleneck Simulator on two natural language processing tasks: a text adventure game and a real-world, complex dialogue response selection task. On both tasks, the Bottleneck Simulator yields excellent performance beating competing approaches.

openalex-author · PolyPublie (École Polytechnique de Montréal)

Predicting Infectiousness for Proactive Contact Tracing

The COVID-19 pandemic has spread rapidly worldwide, overwhelming manual contact tracing in many countries and resulting in widespread lockdowns for emergency containment. Large-scale digital contact tracing (DCT) has emerged as a potential solution to resume economic and social activity while minimizing spread of the virus. Various DCT methods have been proposed, each making trade-offs between privacy, mobility restrictions, and public health. The most common approach, binary contact tracing (BCT), models infection as a binary event, informed only by an individual's test results, with corresponding binary recommendations that either all or none of the individual's contacts quarantine. BCT ignores the inherent uncertainty in contacts and the infection process, which could be used to tailor messaging to high-risk individuals, and prompt proactive testing or earlier warnings. It also does not make use of observations such as symptoms or pre-existing medical conditions, which could be used to make more accurate infectiousness predictions. In this paper, we use a recently-proposed COVID-19 epidemiological simulator to develop and test methods that can be deployed to a smartphone to locally and proactively predict an individual's infectiousness (risk of infecting others) based on their contact history and other information, while respecting strong privacy constraints. Predictions are used to provide personalized recommendations to the individual via an app, as well as to send anonymized messages to the individual's contacts, who use this information to better predict their own infectiousness, an approach we call proactive contact tracing (PCT). We find a deep-learning based PCT method which improves over BCT for equivalent average mobility, suggesting PCT could help in safe re-opening and second-wave prevention.

openalex-author · Communications of the ACM

Generative adversarial networks

Generative adversarial networks are a kind of artificial intelligence algorithm designed to solve the generative modeling problem. The goal of a generative model is to study a collection of training examples and learn the probability distribution that generated them. Generative Adversarial Networks (GANs) are then able to generate more examples from the estimated probability distribution. Generative models based on deep learning are common, but GANs are among the most successful generative models (especially in terms of their ability to generate realistic high-resolution images). GANs have been successfully applied to a wide variety of tasks (mostly in research settings) but continue to present unique challenges and research opportunities because they are based on game theory while most other approaches to generative modeling are based on optimization.

openalex-author · arXiv (Cornell University)

NU-GAN: High resolution neural upsampling with GAN

In this paper, we propose NU-GAN, a new method for resampling audio from lower to higher sampling rates (upsampling). Audio upsampling is an important problem since productionizing generative speech technology requires operating at high sampling rates. Such applications use audio at a resolution of 44.1 kHz or 48 kHz, whereas current speech synthesis methods are equipped to handle a maximum of 24 kHz resolution. NU-GAN takes a leap towards solving audio upsampling as a separate component in the text-to-speech (TTS) pipeline by leveraging techniques for audio generation using GANs. ABX preference tests indicate that our NU-GAN resampler is capable of resampling 22 kHz to 44.1 kHz audio that is distinguishable from original audio only 7.4% higher than random chance for single speaker dataset, and 10.8% higher than chance for multi-speaker dataset.

openalex-author · arXiv (Cornell University)

Cross-Modal Information Maximization for Medical Imaging: CMIM

In hospitals, data are siloed to specific information systems that make the same information available under different modalities such as the different medical imaging exams the patient undergoes (CT scans, MRI, PET, Ultrasound, etc.) and their associated radiology reports. This offers unique opportunities to obtain and use at train-time those multiple views of the same information that might not always be available at test-time. In this paper, we propose an innovative framework that makes the most of available data by learning good representations of a multi-modal input that are resilient to modality dropping at test-time, using recent advances in mutual information maximization. By maximizing cross-modal information at train time, we are able to outperform several state-of-the-art baselines in two different settings, medical image classification, and segmentation. In particular, our method is shown to have a strong impact on the inference-time performance of weaker modalities.

openalex-author · arXiv (Cornell University)

Neural Function Modules with Sparse Arguments: A Dynamic Approach to\n Integrating Information across Layers

Feed-forward neural networks consist of a sequence of layers, in which each\nlayer performs some processing on the information from the previous layer. A\ndownside to this approach is that each layer (or module, as multiple modules\ncan operate in parallel) is tasked with processing the entire hidden state,\nrather than a particular part of the state which is most relevant for that\nmodule. Methods which only operate on a small number of input variables are an\nessential part of most programming languages, and they allow for improved\nmodularity and code re-usability. Our proposed method, Neural Function Modules\n(NFM), aims to introduce the same structural capability into deep learning.\nMost of the work in the context of feed-forward networks combining top-down and\nbottom-up feedback is limited to classification problems. The key contribution\nof our work is to combine attention, sparsity, top-down and bottom-up feedback,\nin a flexible algorithm which, as we show, improves the results in standard\nclassification, out-of-domain generalization, generative modeling, and learning\nrepresentations in the context of reinforcement learning.\n

openalex-author · arXiv (Cornell University)

CausalWorld: A Robotic Manipulation Benchmark for Causal Structure and Transfer Learning

Despite recent successes of reinforcement learning (RL), it remains a challenge for agents to transfer learned skills to related environments. To facilitate research addressing this problem, we propose CausalWorld, a benchmark for causal structure and transfer learning in a robotic manipulation environment. The environment is a simulation of an open-source robotic platform, hence offering the possibility of sim-to-real transfer. Tasks consist of constructing 3D shapes from a given set of blocks - inspired by how children learn to build complex structures. The key strength of CausalWorld is that it provides a combinatorial family of such tasks with common causal structure and underlying factors (including, e.g., robot and object masses, colors, sizes). The user (or the agent) may intervene on all causal variables, which allows for fine-grained control over how similar different tasks (or task distributions) are. One can thus easily define training and evaluation distributions of a desired difficulty level, targeting a specific form of generalization (e.g., only changes in appearance or object mass). Further, this common parametrization facilitates defining curricula by interpolating between an initial and a target task. While users may define their own task distributions, we present eight meaningful distributions as concrete benchmarks, ranging from simple to very challenging, all of which require long-horizon planning as well as precise low-level motor control. Finally, we provide baseline results for a subset of these tasks on distinct training curricula and corresponding evaluation protocols, verifying the feasibility of the tasks in this benchmark.

openalex-author · arXiv (Cornell University)

RNNLogic: Learning Logic Rules for Reasoning on Knowledge Graphs

This paper studies learning logic rules for reasoning on knowledge graphs. Logic rules provide interpretable explanations when used for prediction as well as being able to generalize to other tasks, and hence are critical to learn. Existing methods either suffer from the problem of searching in a large search space (e.g., neural logic programming) or ineffective optimization due to sparse rewards (e.g., techniques based on reinforcement learning). To address these limitations, this paper proposes a probabilistic model called RNNLogic. RNNLogic treats logic rules as a latent variable, and simultaneously trains a rule generator as well as a reasoning predictor with logic rules. We develop an EM-based algorithm for optimization. In each iteration, the reasoning predictor is first updated to explore some generated logic rules for reasoning. Then in the E-step, we select a set of high-quality rules from all generated rules with both the rule generator and reasoning predictor via posterior inference; and in the M-step, the rule generator is updated with the rules selected in the E-step. Experiments on four datasets prove the effectiveness of RNNLogic.

openalex-author · The Journal of Physical Chemistry Letters

Generating Multiscale Amorphous Molecular Structures Using Deep Learning: A Study in 2D

Amorphous molecular assemblies appear in a vast array of systems: from living cells to chemical plants and from everyday items to new devices. The absence of long-range order in amorphous materials implies that precise knowledge of their underlying structures throughout is needed to rationalize and control their properties at the mesoscale. Standard computational simulations suffer from exponentially unfavorable scaling of the required compute with system size. We present a method based on deep learning that leverages the finite range of structural correlations for an autoregressive generation of disordered molecular aggregates up to arbitrary size from small-scale computational or experimental samples. We benchmark performance on self-assembled nanoparticle aggregates and proceed to simulate monolayer amorphous carbon with atomistic resolution. This method bridges the gap between the nanoscale and mesoscale simulations of amorphous molecular systems.

openalex-author · arXiv (Cornell University)

Mastering Rate based Curriculum Learning

Recent automatic curriculum learning algorithms, and in particular Teacher-Student algorithms, rely on the notion of learning progress, making the assumption that the good next tasks are the ones on which the learner is making the fastest progress or digress. In this work, we first propose a simpler and improved version of these algorithms. We then argue that the notion of learning progress itself has several shortcomings that lead to a low sample efficiency for the learner. We finally propose a new algorithm, based on the notion of mastering rate, that significantly outperforms learning progress-based algorithms.

openalex-author · arXiv (Cornell University)

Deriving Differential Target Propagation from Iterating Approximate Inverses

We show that a particular form of target propagation, i.e., relying on learned inverses of each layer, which is differential, i.e., where the target is a small perturbation of the forward propagation, gives rise to an update rule which corresponds to an approximate Gauss-Newton gradient-based optimization, without requiring the manipulation or inversion of large matrices. What is interesting is that this is more biologically plausible than back-propagation yet may turn out to implicitly provide a stronger optimization procedure. Extending difference target propagation, we consider several iterative calculations based on local auto-encoders at each layer in order to achieve more precise inversions for more accurate target propagation and we show that these iterative procedures converge exponentially fast if the auto-encoding function minus the identity function has a Lipschitz constant smaller than one, i.e., the auto-encoder is coarsely succeeding at performing an inversion. We also propose a way to normalize the changes at each layer to take into account the relative influence of each layer on the output, so that larger weight changes are done on more influential layers, like would happen in ordinary back-propagation with gradient descent.

openalex-author · Cureus

Predicting COVID-19 Pneumonia Severity on Chest X-ray With Deep Learning

Introduction The need to streamline patient management for coronavirus disease-19 (COVID-19) has become more pressing than ever. Chest X-rays (CXRs) provide a non-invasive (potentially bedside) tool to monitor the progression of the disease. In this study, we present a severity score prediction model for COVID-19 pneumonia for frontal chest X-ray images. Such a tool can gauge the severity of COVID-19 lung infections (and pneumonia in general) that can be used for escalation or de-escalation of care as well as monitoring treatment efficacy, especially in the ICU. Methods Images from a public COVID-19 database were scored retrospectively by three blinded experts in terms of the extent of lung involvement as well as the degree of opacity. A neural network model that was pre-trained on large (non-COVID-19) chest X-ray datasets is used to construct features for COVID-19 images which are predictive for our task. Results This study finds that training a regression model on a subset of the outputs from this pre-trained chest X-ray model predicts our geographic extent score (range 0-8) with 1.14 mean absolute error (MAE) and our lung opacity score (range 0-6) with 0.78 MAE. Conclusions These results indicate that our model's ability to gauge the severity of COVID-19 lung infections could be used for escalation or de-escalation of care as well as monitoring treatment efficacy, especially in the ICU. To enable follow up work, we make our code, labels, and data available online.

openalex-author · arXiv (Cornell University)

BabyAI 1.1

The BabyAI platform is designed to measure the sample efficiency of training an agent to follow grounded-language instructions. BabyAI 1.0 presents baseline results of an agent trained by deep imitation or reinforcement learning. BabyAI 1.1 improves the agent's architecture in three minor ways. This increases reinforcement learning sample efficiency by up to 3 times and improves imitation learning performance on the hardest level from 77 % to 90.4 %. We hope that these improvements increase the computational efficiency of BabyAI experiments and help users design better agents.

openalex-author · arXiv (Cornell University)

S2RMs: Spatially Structured Recurrent Modules

Capturing the structure of a data-generating process by means of appropriate inductive biases can help in learning models that generalize well and are robust to changes in the input distribution. While methods that harness spatial and temporal structures find broad application, recent work has demonstrated the potential of models that leverage sparse and modular structure using an ensemble of sparingly interacting modules. In this work, we take a step towards dynamic models that are capable of simultaneously exploiting both modular and spatiotemporal structures. We accomplish this by abstracting the modeled dynamical system as a collection of autonomous but sparsely interacting sub-systems. The sub-systems interact according to a topology that is learned, but also informed by the spatial structure of the underlying real-world system. This results in a class of models that are well suited for modeling the dynamics of systems that only offer local views into their state, along with corresponding spatial locations of those views. On the tasks of video prediction from cropped frames and multi-agent world modeling from partial observations in the challenging Starcraft2 domain, we find our models to be more robust to the number of available views and better capable of generalization to novel tasks without additional training, even when compared against strong baselines that perform equally well or better on the training distribution.

openalex-author · arXiv (Cornell University)

Revisiting Fundamentals of Experience Replay

Experience replay is central to off-policy algorithms in deep reinforcement learning (RL), but there remain significant gaps in our understanding. We therefore present a systematic and extensive analysis of experience replay in Q-learning methods, focusing on two fundamental properties: the replay capacity and the ratio of learning updates to experience collected (replay ratio). Our additive and ablative studies upend conventional wisdom around experience replay -- greater capacity is found to substantially increase the performance of certain algorithms, while leaving others unaffected. Counterintuitively we show that theoretically ungrounded, uncorrected n-step returns are uniquely beneficial while other techniques confer limited benefit for sifting through larger memory. Separately, by directly controlling the replay ratio we contextualize previous observations in the literature and empirically measure its importance across a variety of deep RL algorithms. Finally, we conclude by testing a set of hypotheses on the nature of these performance benefits.

openalex-author · International Conference on Machine Learning

Small-GAN: Speeding up GAN Training using Core-Sets

Recent work by Brock et al. (2018) suggests that Generative Adversarial Networks (GANs) benefit disproportionately from large mini-batch sizes. Unfortunately, using large batches is slow and expensive on conventional hardware. Thus, it would be nice if we could generate batches that were effectively large though actually small. In this work, we propose a method to do this, inspired by the use of Coreset-selection in active learning. When training a GAN, we draw a large batch of samples from the prior and then compress that batch using Coreset-selection. To create effectively large batches of 'real' images, we create a cached dataset of Inception activations of each training image, randomly project them down to a smaller dimension, and then use Coreset-selection on those projected activations at training time. We conduct experiments showing that this technique substantially reduces training time and memory usage for modern GAN variants, that it reduces the fraction of dropped modes in a synthetic dataset, and that it allows GANs to reach a new state of the art in anomaly detection.

openalex-author · arXiv (Cornell University)

Learning to Combine Top-Down and Bottom-Up Signals in Recurrent Neural Networks with Attention over Modules

Robust perception relies on both bottom-up and top-down signals. Bottom-up signals consist of what's directly observed through sensation. Top-down signals consist of beliefs and expectations based on past experience and short-term memory, such as how the phrase `peanut butter and~...' will be completed. The optimal combination of bottom-up and top-down information remains an open question, but the manner of combination must be dynamic and both context and task dependent. To effectively utilize the wealth of potential top-down information available, and to prevent the cacophony of intermixed signals in a bidirectional architecture, mechanisms are needed to restrict information flow. We explore deep recurrent neural net architectures in which bottom-up and top-down signals are dynamically combined using attention. Modularity of the architecture further restricts the sharing and communication of information. Together, attention and modularity direct information flow, which leads to reliable performance improvements in perceptual and language tasks, and in particular improves robustness to distractions and noisy data. We demonstrate on a variety of benchmarks in language modeling, sequential image classification, video prediction and reinforcement learning that the \emph{bidirectional} information flow can improve results over strong baselines.

openalex-author · arXiv (Cornell University)

Object Files and Schemata: Factorizing Declarative and Procedural\n Knowledge in Dynamical Systems

Modeling a structured, dynamic environment like a video game requires keeping\ntrack of the objects and their states declarative knowledge) as well as\npredicting how objects behave (procedural knowledge). Black-box models with a\nmonolithic hidden state often fail to apply procedural knowledge consistently\nand uniformly, i.e., they lack systematicity. For example, in a video game,\ncorrect prediction of one enemy's trajectory does not ensure correct prediction\nof another's. We address this issue via an architecture that factorizes\ndeclarative and procedural knowledge and that imposes modularity within each\nform of knowledge. The architecture consists of active modules called object\nfiles that maintain the state of a single object and invoke passive external\nknowledge sources called schemata that prescribe state updates. To use a video\ngame as an illustration, two enemies of the same type will share schemata but\nwill have separate object files to encode their distinct state (e.g., health,\nposition). We propose to use attention to determine which object files to\nupdate, the selection of schemata, and the propagation of information between\nobject files. The resulting architecture is a drop-in replacement conforming to\nthe same input-output interface as normal recurrent networks (e.g., LSTM, GRU)\nyet achieves substantially better generalization on environments that have\nmultiple object tokens of the same type, including a challenging intuitive\nphysics benchmark.\n

openalex-author · Journal of the American Medical Informatics Association

Inherent privacy limitations of decentralized contact tracing apps

Recently, there have been many efforts to use mobile apps as an aid in contact tracing to control the spread of the SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) (COVID-19 [coronavirus disease 2019]) pandemic. However, although many apps aim to protect individual privacy, the very nature of contact tracing must reveal some otherwise protected personal information. Digital contact tracing has endemic privacy risks that cannot be removed by technological means, and which may require legal or economic solutions. In this brief communication, we discuss a few of these inherent privacy limitations of any decentralized automatic contact tracing system.

openalex-author · arXiv (Cornell University)

Image-to-image Mapping with Many Domains by Sparse Attribute Transfer

Unsupervised image-to-image translation consists of learning a pair of mappings between two domains without known pairwise correspondences between points. The current convention is to approach this task with cycle-consistent GANs: using a discriminator to encourage the generator to change the image to match the target domain, while training the generator to be inverted with another mapping. While ending up with paired inverse functions may be a good end result, enforcing this restriction at all times during training can be a hindrance to effective modeling. We propose an alternate approach that directly restricts the generator to performing a simple sparse transformation in a latent layer, motivated by recent work from cognitive neuroscience suggesting an architectural prior on representations corresponding to consciousness. Our biologically motivated approach leads to representations more amenable to transformation by disentangling high-level abstract concepts in the latent space. We demonstrate that image-to-image domain translation with many different domains can be learned more effectively with our architecturally constrained, simple transformation than with previous unconstrained architectures that rely on a cycle-consistency loss.

openalex-author · arXiv (Cornell University)

Untangling tradeoffs between recurrence and self-attention in neural networks

Attention and self-attention mechanisms, are now central to state-of-the-art deep learning on sequential tasks. However, most recent progress hinges on heuristic approaches with limited understanding of attention's role in model optimization and computation, and rely on considerable memory and computational resources that scale poorly. In this work, we present a formal analysis of how self-attention affects gradient propagation in recurrent networks, and prove that it mitigates the problem of vanishing gradients when trying to capture long-term dependencies by establishing concrete bounds for gradient norms. Building on these results, we propose a relevancy screening mechanism, inspired by the cognitive process of memory consolidation, that allows for a scalable use of sparse self-attention with recurrence. While providing guarantees to avoid vanishing gradients, we use simple numerical experiments to demonstrate the tradeoffs in performance and computational resources by efficiently balancing attention and recurrence. Based on our results, we propose a concrete direction of research to improve scalability of attentive networks.

openalex-author · arXiv (Cornell University)

Learning Causal Models Online

Predictive models -- learned from observational data not covering the complete data distribution -- can rely on spurious correlations in the data for making predictions. These correlations make the models brittle and hinder generalization. One solution for achieving strong generalization is to incorporate causal structures in the models; such structures constrain learning by ignoring correlations that contradict them. However, learning these structures is a hard problem in itself. Moreover, it's not clear how to incorporate the machinery of causality with online continual learning. In this work, we take an indirect approach to discovering causal models. Instead of searching for the true causal model directly, we propose an online algorithm that continually detects and removes spurious features. Our algorithm works on the idea that the correlation of a spurious feature with a target is not constant over-time. As a result, the weight associated with that feature is constantly changing. We show that by continually removing such features, our method converges to solutions that have strong generalization. Moreover, our method combined with random search can also discover non-spurious features from raw sensory data. Finally, our work highlights that the information present in the temporal structure of the problem -- destroyed by shuffling the data -- is essential for detecting spurious features online.

openalex-author · arXiv (Cornell University)

Training End-to-End Analog Neural Networks with Equilibrium Propagation

We introduce a principled method to train end-to-end analog neural networks by stochastic gradient descent. In these analog neural networks, the weights to be adjusted are implemented by the conductances of programmable resistive devices such as memristors [Chua, 1971], and the nonlinear transfer functions (or `activation functions') are implemented by nonlinear components such as diodes. We show mathematically that a class of analog neural networks (called nonlinear resistive networks) are energy-based models: they possess an energy function as a consequence of Kirchhoff's laws governing electrical circuits. This property enables us to train them using the Equilibrium Propagation framework [Scellier and Bengio, 2017]. Our update rule for each conductance, which is local and relies solely on the voltage drop across the corresponding resistor, is shown to compute the gradient of the loss function. Our numerical simulations, which use the SPICE-based Spectre simulation framework to simulate the dynamics of electrical circuits, demonstrate training on the MNIST classification task, performing comparably or better than equivalent-size software-based neural networks. Our work can guide the development of a new generation of ultra-fast, compact and low-power neural networks supporting on-chip learning.

openalex-author · The Lancet Digital Health

The need for privacy with public digital contact tracing during the COVID-19 pandemic

Digital contact tracing applications represent a powerful yet controversial strategy to combat the COVID-19 pandemic. Manual contact tracing has important challenges, not limited to recall bias and delays in communicating with high-risk contacts.1Sun K Viboud C Impact of contact tracing on SARS-CoV-2 transmission.Lancet Infect Dis. 2020; (published online April 27.)https://doi.org/10.1016/S1473-3099(20)30357-1Summary Full Text Full Text PDF Scopus (42) Google Scholar Digital technologies are already increasingly used in the context of health-care delivery and clinical trials.2Sharma A Harrington RA McClellan MB et al.Using digital health technology to better generate evidence and deliver evidence-based care.J Am Coll Cardiol. 2018; 71: 2680-2690Crossref PubMed Scopus (159) Google Scholar Due to the considerable strain on public health institutions, digital contact tracing through mobile phones is being used or explored in a growing number of countries despite concerns raised over individual privacy and state surveillance.3Servick K Cellphone tracking could help stem the spread of coronavirus. Is privacy the price?.Science. 2020; (published online March 22.)DOI:10.1126/science.abb8296Google Scholar Mobile phone-enabled digital contact tracing colocalises individuals in time and space through the use of GPS, Bluetooth, or other such technologies. Google and Apple have promised to provide frameworks for how to use their technologies for contact tracing.4AppleGooglePrivacy-preserving contact tracing.https://www.apple.com/covid19/contacttracingDate accessed: May 11, 2020Google Scholar A digital contact trail can be created when individuals who have downloaded such applications come into physical proximity. Machine-learning strategies5Lecun Y Bengio Y Hinton G Deep learning.Nature. 2015; 521: 436-444Crossref PubMed Scopus (54998) Google Scholar can improve on simple binary contact tracing systems by providing methods to calculate quantifiable individual risk of acquiring COVID-19 depending on specific features such as distance and duration of interaction, self-reported comorbidities, demographics, and the presence of any symptoms in each individual in an interaction. As an individual's risk level for acquiring COVID-19 increases, various behavioural messages can be delivered quickly to enable the individual to take appropriate, measured action. These multiple advantages have the potential to establish rapid epidemiological control of the pandemic.6Ferretti L Wymant C Kendall M et al.Quantifying SARS-CoV-2 transmission suggests epidemic control with digital contact tracing.Science. 2020; 368eabb6936Crossref PubMed Scopus (1683) Google Scholar Despite the potential advantages, most of the applications in use or under consideration have an impact on individual privacy that democratic societies would normally consider to be unacceptably high. In a free and democratic society, there are major concerns regarding privacy. The UK, Australia, Singapore,3Servick K Cellphone tracking could help stem the spread of coronavirus. Is privacy the price?.Science. 2020; (published online March 22.)DOI:10.1126/science.abb8296Google Scholar South Korea, and other countries have deployed such tools (using binary variables of contact, not scalar risk probabilities for risk of infection); however, these applications have come under scrutiny relating to the ability of governments and other groups to access personal information.7Lin L Martin TW How coronavirus is eroding privacy.https://www.wsj.com/articles/coronavirus-paves-way-for-new-age-of-digital-surveillance-11586963028Date: April 15, 2020Date accessed: May 11, 2020Google Scholar Public trust in the use of these applications is paramount because widespread adoption of these technologies is needed to be effective in curbing viral transmission. Indiscriminate collection of personal information, chronic privacy breaches, and lax attitudes towards individual privacy in the private sector have eroded public trust in digital technologies. Moreover, tracing applications raise the spectre of generalised state surveillance in the face of the pandemic, with potentially devastating consequences if democratic societies learn to accept such an intrusion on civil liberties.8Harari YN Yuval Noah Harari: the world after coronavirus.https://www.ft.com/content/19d90308-6858-11ea-a3c9-1fe6fedcca75Date: March 20, 2020Date accessed: May 11, 2020Google Scholar Therefore, to counteract both negative perceptions and genuine threats, a privacy-protecting approach must be central in the development of such a contact tracing application. Several strategies can be leveraged to increase and maintain the public trust with such applications (panel). Express consent at each step of data sharing is crucial and must be meaningful, not buried within lengthy privacy policies or vague language agreements, and includes express consent to anonymously share COVID-19 test results. No identifiable data should be shared with any public institution or private enterprise. Pseudonymised or aggregate data can be adequately used to develop machine-learning and epidemiological models and inform public policy. Otherwise data should be kept encrypted on users' devices and inaccessible to public authorities or private interests. The tracing application itself can propagate alerts to high-risk contacts and can recommend that users voluntarily contact health authorities where relevant, thereby assisting markedly in contact tracing while minimising the potential for state surveillance, snooping, or vigilantism.PanelRecommendations for a privacy-protecting approach to digital contact tracingConsent•Download, installation, and use of the application must be entirely voluntary, and users must be able to uninstall the application at will•There must be express consent for all collection, use, and disclosure of personal information (ie, users might choose to share some data and not others, such as official test results or to feed a machine-learning model)•Individuals must be able to opt-in or opt-out of data sharing. This includes consent to download the application, turn on location services, receive notifications, and share COVID-19 test resultsOversight•A non-partisan independent oversight committee with representatives from legal, health, machine-learning, and privacy experts should be established to oversee ongoing development of the application, its information ecosystem, and data governance•Importantly, public representatives must be included in this oversight committeeVirtual data acquisition•No identifiable information regarding digital contact trails or personal health information that an individual enters on the application should be shared with other application users or public, private, and governmental agencies•Individual geolocation data should not be stored on a central server and should pass through a rigourous obfuscation protocol to reduce their information content to the bare minimum required for epidemiological and machine-learning modelling•Pseudonymised data should be used to inform machine-learning models, and only these data should be stored centrally on a protected server•Only non-identifiable aggregated data should be shared with public health institutions•The source code of the application and the algorithms used should be made accessible for public scrutiny•Personal identifiable information should be deleted from the device once the pandemic is overInformed decision making•User preferences should drive end-to-end experience•User comprehension should be prioritised and verified rather than assumed•User psychosocial wellbeing should be promoted•User empowerment to protect themselves and others should be maximised•User inclusivity should acknowledge the diversity of user needs in dimensions such as gender, race, education, and rural vs urban location Consent •Download, installation, and use of the application must be entirely voluntary, and users must be able to uninstall the application at will•There must be express consent for all collection, use, and disclosure of personal information (ie, users might choose to share some data and not others, such as official test results or to feed a machine-learning model)•Individuals must be able to opt-in or opt-out of data sharing. This includes consent to download the application, turn on location services, receive notifications, and share COVID-19 test results Oversight •A non-partisan independent oversight committee with representatives from legal, health, machine-learning, and privacy experts should be established to oversee ongoing development of the application, its information ecosystem, and data governance•Importantly, public representatives must be included in this oversight committee Virtual data acquisition •No identifiable information regarding digital contact trails or personal health information that an individual enters on the application should be shared with other application users or public, private, and governmental agencies•Individual geolocation data should not be stored on a central server and should pass through a rigourous obfuscation protocol to reduce their information content to the bare minimum required for epidemiological and machine-learning modelling•Pseudonymised data should be used to inform machine-learning models, and only these data should be stored centrally on a protected server•Only non-identifiable aggregated data should be shared with public health institutions•The source code of the application and the algorithms used should be made accessible for public scrutiny•Personal identifiable information should be deleted from the device once the pandemic is over Informed decision making •User preferences should drive end-to-end experience•User comprehension should be prioritised and verified rather than assumed•User psychosocial wellbeing should be promoted•User empowerment to protect themselves and others should be maximised•User inclusivity should acknowledge the diversity of user needs in dimensions such as gender, race, education, and rural vs urban location The granular non-identifying information used to train machine-learning models generally contains sufficient detail to re-identify individuals when correlated with other sources of data. This is why an independent, non-partisan trust or similar fiduciary structure must be established to protect and control access to these data, and manage the application and its ongoing development. The source code for the application and the privacy protocols used should be publicly available. Individuals must be able to make independent informed choices based on recommendations released from the application rather than using coercive or penalising strategies. An application self-destruction strategy should be used so that once the pandemic is over, all application-related personal data is deleted from participants' phones and deleted from the machine-learning server, leaving for further research, only de-identified, aggregated, and statistical data, or artificial data generated from the epidemiological model. The approach presented here advocates that consent must be explicating for users to download the application, transmit COVID-19 test results, and share data for research. Recent projections suggest that at least 56% of a country's population would need to be using the application to ensure maximal chance of epidemiological control of the COVID-19 pandemic.9Hinch R Probert W Nurtay A et al.Effective configurations of a digital contact tracing app: a report to NHSX.https://cdn.theconversation.com/static_files/files/1009/Report_-_Effective_App_Configurations.pdf?1587531217Date: April 16, 2020Date accessed: May 20, 2020Google Scholar There is a tension between mandating use of the application versus having a consent-based approach that we are advocating. In the face of such tension, the trade-off between individual civil rights and the need for population-level control of the COVID-19 pandemic comes to the forefront. Trust in the application by individuals is pivotal for such applications to have population-level benefit. We would suggest that advocating an approach that emphasises consent and prevents any central public or private authority from accessing identifiable data would embolden more individuals to download the application, thereby optimising the population-level benefit. Various designs are currently in place with regard to strategies for identifying contacts, the types of notifications that are received, and the use of centralised versus decentralised approaches.4AppleGooglePrivacy-preserving contact tracing.https://www.apple.com/covid19/contacttracingDate accessed: May 11, 2020Google Scholar, 10Alsdurf H Bengio Y Deleu T et al.COVI white paper.https://arxiv.org/abs/2005.08502Date: May 18, 2020Date accessed: May 20, 2020Google Scholar One question that arises in a system that emphasises a consent-based, opt-in approach, is that among individuals who do not receive a notification, does the absence of the notification imply the absence of contacts with other individuals with a COVID-19 infection or that other users are not consenting to share data? The absence of notifications might create a false sense of security in the user of the application or can cause frustration if a user presumes that others are not sharing information. This limitation with such opt-in applications emphasises the need for broad public outreach and education to optimise the number of users who download the application and consent to share data. Leveraging digital contact tracing technologies can change the course of the COVID-19 pandemic. Such technologies must robustly support democratic principles of privacy to maintain public trust and to enable individuals to make informed choices to help combat the pandemic. We are part of a team developing a COVID-19 risk awareness application in Canada. YWY reports funding from the Toronto COVID-19 Action Initiative. DP, BS, and SK received funding support from Mila to assist in application development. AS reports grants from Fonds de la Recherche en Sante du Quebec—Junior 1 clinician scientist programme and Bristol-Myers Squibb-Pfizer, personal fees from Novartis and AstraZeneca, grants and personal fees from Roche Diagnostics and Boehringer-Ingelheim, and funding from the McGill Interdisciplinary Initiative in Infection and Immunity (Mi4), outside the submitted work. All other authors declare no competing interests.

openalex-author · 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

Multi-Image Super-Resolution for Remote Sensing using Deep Recurrent Networks

High-resolution satellite imagery is critical for various earth observation applications related to environment monitoring, geoscience, forecasting, and land use analysis. However, the acquisition cost of such high-quality imagery due to the scarcity of providers and needs for high-frequency revisits restricts its accessibility in many fields. In this work, we present a data-driven, multi-image super resolution approach to alleviate these problems. Our approach is based on an end-to-end deep neural network that consists of an encoder, a fusion module, and a decoder. The encoder extracts co-registered highly efficient feature representations from low-resolution images of a scene. A Gated Re-current Unit (GRU)-based module acts as the fusion module, aggregating features into a combined representation. Finally, a decoder reconstructs the super-resolved image. The proposed model is evaluated on the PROBA-V dataset released in a recent competition held by the European Space Agency. Our results show that it performs among the top contenders and offers a new practical solution for real-world applications.

openalex-author · International Conference on Machine Learning

HNHN: Hypergraph Networks With Hyperedge Neurons

Hypergraphs provide a natural representation for many real world datasets. We propose a novel framework, HNHN, for hypergraph representation learning. HNHN is a hypergraph convolution network with nonlinear activation functions applied to both hypernodes and hyperedges, combined with a normalization scheme that can flexibly adjust the importance of high-cardinality hyperedges and high-degree vertices depending on the dataset. We demonstrate improved performance of HNHN in both classification accuracy and speed on real world datasets when compared to state of the art methods.

openalex-author · 2020 18th IEEE International New Circuits and Systems Conference (NEWCAS)

Quantized Guided Pruning for Efficient Hardware Implementations of Deep Neural Networks

Deep Neural Networks (DNNs) in general and Convolutional Neural Networks (CNNs) in particular are state-of-the-art in numerous computer vision tasks such as object classification and detection. However, the large amount of parameters they contain leads to a high computational complexity and strongly limits their usability in budget-constrained devices such as embedded devices. In this paper, we propose a combination of a pruning technique and a quantization scheme that effectively reduce the complexity and memory usage of convolutional layers of CNNs, by replacing the complex convolutional operation by a low-cost multiplexer. We perform experiments on CIFAR10, CIFAR100 and SVHN datasets and show that the proposed method achieves almost state-of-the-art accuracy, while drastically reducing the computational and memory footprints compared to the baselines. We also propose an efficient hardware architecture, implemented on Field Programmable Gate Arrays (FPGAs), to accelerate inference, which works as a pipeline and accommodates multiple layers working at the same time to speed up the inference process. In contrast with most proposed approaches which have used external memory or software defined memory controllers, our work is based on algorithmic optimization and full-hardware design, enabling a direct, on-chip memory implementation of a DNN while keeping close to state of the art accuracy.

openalex-author · arXiv (Cornell University)

COVI White Paper

The SARS-CoV-2 (Covid-19) pandemic has caused significant strain on public health institutions around the world. Contact tracing is an essential tool to change the course of the Covid-19 pandemic. Manual contact tracing of Covid-19 cases has significant challenges that limit the ability of public health authorities to minimize community infections. Personalized peer-to-peer contact tracing through the use of mobile apps has the potential to shift the paradigm. Some countries have deployed centralized tracking systems, but more privacy-protecting decentralized systems offer much of the same benefit without concentrating data in the hands of a state authority or for-profit corporations. Machine learning methods can circumvent some of the limitations of standard digital tracing by incorporating many clues and their uncertainty into a more graded and precise estimation of infection risk. The estimated risk can provide early risk awareness, personalized recommendations and relevant information to the user. Finally, non-identifying risk data can inform epidemiological models trained jointly with the machine learning predictor. These models can provide statistical evidence for the importance of factors involved in disease transmission. They can also be used to monitor, evaluate and optimize health policy and (de)confinement scenarios according to medical and economic productivity indicators. However, such a strategy based on mobile apps and machine learning should proactively mitigate potential ethical and privacy risks, which could have substantial impacts on society (not only impacts on health but also impacts such as stigmatization and abuse of personal data). Here, we present an overview of the rationale, design, ethical considerations and privacy strategy of `COVI,' a Covid-19 public peer-to-peer contact tracing and risk awareness mobile application developed in Canada.

openalex-author · arXiv (Cornell University)

An Analysis of the Adaptation Speed of Causal Models

Consider a collection of datasets generated by unknown interventions on an unknown structural causal model $G$. Recently, Bengio et al. (2020) conjectured that among all candidate models, $G$ is the fastest to adapt from one dataset to another, along with promising experiments. Indeed, intuitively $G$ has less mechanisms to adapt, but this justification is incomplete. Our contribution is a more thorough analysis of this hypothesis. We investigate the adaptation speed of cause-effect SCMs. Using convergence rates from stochastic optimization, we justify that a relevant proxy for adaptation speed is distance in parameter space after intervention. Applying this proxy to categorical and normal cause-effect models, we show two results. When the intervention is on the cause variable, the SCM with the correct causal direction is advantaged by a large factor. When the intervention is on the effect variable, we characterize the relative adaptation speed. Surprisingly, we find situations where the anticausal model is advantaged, falsifying the initial hypothesis. Code to reproduce experiments is available at https://github.com/remilepriol/causal-adaptation-speed

openalex-author · arXiv (Cornell University)

Exploiting Syntactic Structure for Better Language Modeling: A Syntactic Distance Approach

It is commonly believed that knowledge of syntactic structure should improve language modeling. However, effectively and computationally efficiently incorporating syntactic structure into neural language models has been a challenging topic. In this paper, we make use of a multi-task objective, i.e., the models simultaneously predict words as well as ground truth parse trees in a form called "syntactic distances", where information between these two separate objectives shares the same intermediate representation. Experimental results on the Penn Treebank and Chinese Treebank datasets show that when ground truth parse trees are provided as additional training signals, the model is able to achieve lower perplexity and induce trees with better quality.

openalex-author · Bioinformatics

Factorized embeddings learns rich and biologically meaningful embedding spaces using factorized tensor decomposition

Supplementary data are available at Bioinformatics online.

openalex-author · International Conference on Learning Representations

Learning the Arrow of Time for Problems in Reinforcement Learning

We humans have an innate understanding of the asymmetric progression of time, which we use to efficiently and safely perceive and manipulate our environment. Drawing inspiration from that, we approach the problem of learning an arrow of time in a Markov (Decision) Process. We illustrate how a learned arrow of time can capture salient information about the environment, which in turn can be used to measure reachability, detect side-effects and to obtain an intrinsic reward signal. Finally, we propose a simple yet effective algorithm to parameterize the problem at hand and learn an arrow of time with a function approximator (here, a deep neural network). Our empirical results span a selection of discrete and continuous environments, and demonstrate for a class of stochastic processes that the learned arrow of time agrees reasonably well with a well known notion of an arrow of time due to Jordan, Kinderlehrer and Otto (1998).

openalex-author · International Conference on Learning Representations

Reinforcement Learning with Competitive Ensembles of Information-Constrained Primitives

Reinforcement learning agents that operate in diverse and complex environments can benefit from the structured decomposition of their behavior. Often, this is addressed in the context of hierarchical reinforcement learning, where the aim is to decompose a policy into lower-level primitives or options, and a higher-level meta-policy that triggers the appropriate behaviors for a given situation. However, the meta-policy must still produce appropriate decisions in all states. In this work, we propose a policy design that decomposes into primitives, similarly to hierarchical reinforcement learning, but without a high-level meta-policy. Instead, each primitive can decide for themselves whether they wish to act in the current state. We use an information-theoretic mechanism for enabling this decentralized decision: each primitive chooses how much information it needs about the current state to make a decision and the primitive that requests the most information about the current state acts in the world. The primitives are regularized to use as little information as possible, which leads to natural competition and specialization. We experimentally demonstrate that this policy architecture improves over both flat and hierarchical policies in terms of generalization.

openalex-author · arXiv (Cornell University)

Continual Weight Updates and Convolutional Architectures for Equilibrium\n Propagation

Equilibrium Propagation (EP) is a biologically inspired alternative algorithm\nto backpropagation (BP) for training neural networks. It applies to RNNs fed by\na static input x that settle to a steady state, such as Hopfield networks. EP\nis similar to BP in that in the second phase of training, an error signal\npropagates backwards in the layers of the network, but contrary to BP, the\nlearning rule of EP is spatially local. Nonetheless, EP suffers from two major\nlimitations. On the one hand, due to its formulation in terms of real-time\ndynamics, EP entails long simulation times, which limits its applicability to\npractical tasks. On the other hand, the biological plausibility of EP is\nlimited by the fact that its learning rule is not local in time: the synapse\nupdate is performed after the dynamics of the second phase have converged and\nrequires information of the first phase that is no longer available physically.\nOur work addresses these two issues and aims at widening the spectrum of EP\nfrom standard machine learning models to more bio-realistic neural networks.\nFirst, we propose a discrete-time formulation of EP which enables to simplify\nequations, speed up training and extend EP to CNNs. Our CNN model achieves the\nbest performance ever reported on MNIST with EP. Using the same discrete-time\nformulation, we introduce Continual Equilibrium Propagation (C-EP): the weights\nof the network are adjusted continually in the second phase of training using\nlocal information in space and time. We show that in the limit of slow changes\nof synaptic strengths and small nudging, C-EP is equivalent to BPTT (Theorem\n1). We numerically demonstrate Theorem 1 and C-EP training on MNIST and\ngeneralize it to the bio-realistic situation of a neural network with\nasymmetric connections between neurons.\n

openalex-author · arXiv (Cornell University)

Equilibrium Propagation with Continual Weight Updates

Equilibrium Propagation (EP) is a learning algorithm that bridges Machine Learning and Neuroscience, by computing gradients closely matching those of Backpropagation Through Time (BPTT), but with a learning rule local in space. Given an input $x$ and associated target $y$, EP proceeds in two phases: in the first phase neurons evolve freely towards a first steady state; in the second phase output neurons are nudged towards $y$ until they reach a second steady state. However, in existing implementations of EP, the learning rule is not local in time: the weight update is performed after the dynamics of the second phase have converged and requires information of the first phase that is no longer available physically. In this work, we propose a version of EP named Continual Equilibrium Propagation (C-EP) where neuron and synapse dynamics occur simultaneously throughout the second phase, so that the weight update becomes local in time. Such a learning rule local both in space and time opens the possibility of an extremely energy efficient hardware implementation of EP. We prove theoretically that, provided the learning rates are sufficiently small, at each time step of the second phase the dynamics of neurons and synapses follow the gradients of the loss given by BPTT (Theorem 1). We demonstrate training with C-EP on MNIST and generalize C-EP to neural networks where neurons are connected by asymmetric connections. We show through experiments that the more the network updates follows the gradients of BPTT, the best it performs in terms of training. These results bring EP a step closer to biology by better complying with hardware constraints while maintaining its intimate link with backpropagation.

openalex-author · PolyPublie (École Polytechnique de Montréal)

Learning To Navigate The Synthetically Accessible Chemical Space Using Reinforcement Learning

Over the last decade, there has been significant progress in the field of machine learning for de novo drug design, particularly in deep generative models. However, current generative approaches exhibit a significant challenge as they do not ensure that the proposed molecular structures can be feasibly synthesized nor do they provide the synthesis routes of the proposed small molecules, thereby seriously limiting their practical applicability. In this work, we propose a novel forward synthesis framework powered by reinforcement learning (RL) for de novo drug design, Policy Gradient for Forward Synthesis (PGFS), that addresses this challenge by embedding the concept of synthetic accessibility directly into the de novo drug design system. In this setup, the agent learns to navigate through the immense synthetically accessible chemical space by subjecting commercially available small molecule building blocks to valid chemical reactions at every time step of the iterative virtual multi-step synthesis process. The proposed environment for drug discovery provides a highly challenging test-bed for RL algorithms owing to the large state space and high-dimensional continuous action space with hierarchical actions. PGFS achieves state-of-the-art performance in generating structures with high QED and penalized clogP. Moreover, we validate PGFS in an in-silico proof-of-concept associated with three HIV targets. Finally, we describe how the end-to-end training conceptualized in this study represents an important paradigm in radically expanding the synthesizable chemical space and automating the drug discovery process.

openalex-author · arXiv (Cornell University)

The Variational Bandwidth Bottleneck: Stochastic Evaluation on an\n Information Budget

In many applications, it is desirable to extract only the relevant\ninformation from complex input data, which involves making a decision about\nwhich input features are relevant. The information bottleneck method formalizes\nthis as an information-theoretic optimization problem by maintaining an optimal\ntradeoff between compression (throwing away irrelevant input information), and\npredicting the target. In many problem settings, including the reinforcement\nlearning problems we consider in this work, we might prefer to compress only\npart of the input. This is typically the case when we have a standard\nconditioning input, such as a state observation, and a "privileged" input,\nwhich might correspond to the goal of a task, the output of a costly planning\nalgorithm, or communication with another agent. In such cases, we might prefer\nto compress the privileged input, either to achieve better generalization\n(e.g., with respect to goals) or to minimize access to costly information\n(e.g., in the case of communication). Practical implementations of the\ninformation bottleneck based on variational inference require access to the\nprivileged input in order to compute the bottleneck variable, so although they\nperform compression, this compression operation itself needs unrestricted,\nlossless access. In this work, we propose the variational bandwidth bottleneck,\nwhich decides for each example on the estimated value of the privileged\ninformation before seeing it, i.e., only based on the standard input, and then\naccordingly chooses stochastically, whether to access the privileged input or\nnot. We formulate a tractable approximation to this framework and demonstrate\nin a series of reinforcement learning experiments that it can improve\ngeneralization and reduce access to computationally costly information.\n

openalex-author · arXiv.org, e-Print Archive, Mathematics, 2020:2004.07213

Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims

With the recent wave of progress in artificial intelligence (AI) has come a growing awareness of the large-scale impacts of AI systems, and recognition that existing regulations and norms in industry and academia are insufficient to ensure responsible AI development. In order for AI developers to earn trust from system users, customers, civil society, governments, and other stakeholders that they are building AI responsibly, they will need to make verifiable claims to which they can be held accountable. Those outside of a given organization also need effective means of scrutinizing such claims. This report suggests various steps that different stakeholders can take to improve the verifiability of claims made about AI systems and their associated development processes, with a focus on providing evidence about the safety, security, fairness, and privacy protection of AI systems. We analyze ten mechanisms for this purpose--spanning institutions, software, and hardware--and make recommendations aimed at implementing, exploring, or improving those mechanisms.

openalex-author · ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Multi-Task Self-Supervised Learning for Robust Speech Recognition

Despite the growing interest in unsupervised learning, extracting meaningful knowledge from unlabelled audio remains an open challenge. To take a step in this direction, we recently proposed a problem-agnostic speech encoder (PASE), that combines a convolutional encoder followed by multiple neural networks, called workers, tasked to solve self-supervised problems (i.e., ones that do not require manual annotations as ground truth). PASE was shown to capture relevant speech information, including speaker voice-print and phonemes. This paper proposes PASE+, an improved version of PASE for robust speech recognition in noisy and reverberant environments. To this end, we employ an online speech distortion module, that contaminates the input signals with a variety of random disturbances. We then propose a revised encoder that better learns short- and long-term speech dynamics with an efficient combination of recurrent and convolutional networks. Finally, we refine the set of workers used in self-supervision to encourage better cooperation. Results on TIMIT, DIRHA and CHiME-5 show that PASE+ significantly outperforms both the previous version of PASE as well as common acoustic features. Interestingly, PASE+ learns transferable representations suitable for highly mismatched acoustic conditions.

openalex-author · PLOS Biology

BigBrain 3D atlas of cortical layers: Cortical and laminar thickness gradients diverge in sensory and motor cortices

Histological atlases of the cerebral cortex, such as those made famous by Brodmann and von Economo, are invaluable for understanding human brain microstructure and its relationship with functional organization in the brain. However, these existing atlases are limited to small numbers of manually annotated samples from a single cerebral hemisphere, measured from 2D histological sections. We present the first whole-brain quantitative 3D laminar atlas of the human cerebral cortex. It was derived from a 3D histological atlas of the human brain at 20-micrometer isotropic resolution (BigBrain), using a convolutional neural network to segment, automatically, the cortical layers in both hemispheres. Our approach overcomes many of the historical challenges with measurement of histological thickness in 2D, and the resultant laminar atlas provides an unprecedented level of precision and detail. We utilized this BigBrain cortical atlas to test whether previously reported thickness gradients, as measured by MRI in sensory and motor processing cortices, were present in a histological atlas of cortical thickness and which cortical layers were contributing to these gradients. Cortical thickness increased across sensory processing hierarchies, primarily driven by layers III, V, and VI. In contrast, motor-frontal cortices showed the opposite pattern, with decreases in total and pyramidal layer thickness from motor to frontal association cortices. These findings illustrate how this laminar atlas will provide a link between single-neuron morphology, mesoscale cortical layering, macroscopic cortical thickness, and, ultimately, functional neuroanatomy.

openalex-author · Proceedings of the AAAI Conference on Artificial Intelligence

Combating False Negatives in Adversarial Imitation Learning (Student Abstract)

We define the False Negatives problem and show that it is a significant limitation in adversarial imitation learning. We propose a method that solves the problem by leveraging the nature of goal-conditioned tasks. The method, dubbed Fake Conditioning, is tested on instruction following tasks in BabyAI environments, where it improves sample efficiency over the baselines by at least an order of magnitude.

openalex-author · Les Cahiers du GERAD

Pruning for efficient hardware implementations of deep neural networks

No abstract available from the OpenAlex source record.

openalex-author · arXiv (Cornell University)

Your GAN is Secretly an Energy-based Model and You Should use Discriminator Driven Latent Sampling

We show that the sum of the implicit generator log-density $\log p_g$ of a GAN with the logit score of the discriminator defines an energy function which yields the true data density when the generator is imperfect but the discriminator is optimal, thus making it possible to improve on the typical generator (with implicit density $p_g$). To make that practical, we show that sampling from this modified density can be achieved by sampling in latent space according to an energy-based model induced by the sum of the latent prior log-density and the discriminator output score. This can be achieved by running a Langevin MCMC in latent space and then applying the generator function, which we call Discriminator Driven Latent Sampling~(DDLS). We show that DDLS is highly efficient compared to previous methods which work in the high-dimensional pixel space and can be applied to improve on previously trained GANs of many types. We evaluate DDLS on both synthetic and real-world datasets qualitatively and quantitatively. On CIFAR-10, DDLS substantially improves the Inception Score of an off-the-shelf pre-trained SN-GAN~\citep{sngan} from $8.22$ to $9.09$ which is even comparable to the class-conditional BigGAN~\citep{biggan} model. This achieves a new state-of-the-art in unconditional image synthesis setting without introducing extra parameters or additional training.

openalex-author · arXiv (Cornell University)

Continuous Domain Adaptation with Variational Domain-Agnostic Feature Replay

Learning in non-stationary environments is one of the biggest challenges in machine learning. Non-stationarity can be caused by either task drift, i.e., the drift in the conditional distribution of labels given the input data, or the domain drift, i.e., the drift in the marginal distribution of the input data. This paper aims to tackle this challenge in the context of continuous domain adaptation, where the model is required to learn new tasks adapted to new domains in a non-stationary environment while maintaining previously learned knowledge. To deal with both drifts, we propose variational domain-agnostic feature replay, an approach that is composed of three components: an inference module that filters the input data into domain-agnostic representations, a generative module that facilitates knowledge transfer, and a solver module that applies the filtered and transferable knowledge to solve the queries. We address the two fundamental scenarios in continuous domain adaptation, demonstrating the effectiveness of our proposed approach for practical usage.

openalex-author · arXiv (Cornell University)

Benchmarking Graph Neural Networks

In the last few years, graph neural networks (GNNs) have become the standard toolkit for analyzing and learning from data on graphs. This emerging field has witnessed an extensive growth of promising techniques that have been applied with success to computer science, mathematics, biology, physics and chemistry. But for any successful field to become mainstream and reliable, benchmarks must be developed to quantify progress. This led us in March 2020 to release a benchmark framework that i) comprises of a diverse collection of mathematical and real-world graphs, ii) enables fair model comparison with the same parameter budget to identify key architectures, iii) has an open-source, easy-to-use and reproducible code infrastructure, and iv) is flexible for researchers to experiment with new theoretical ideas. As of December 2022, the GitHub repository has reached 2,000 stars and 380 forks, which demonstrates the utility of the proposed open-source framework through the wide usage by the GNN community. In this paper, we present an updated version of our benchmark with a concise presentation of the aforementioned framework characteristics, an additional medium-sized molecular dataset AQSOL, similar to the popular ZINC, but with a real-world measured chemical target, and discuss how this framework can be leveraged to explore new GNN designs and insights. As a proof of value of our benchmark, we study the case of graph positional encoding (PE) in GNNs, which was introduced with this benchmark and has since spurred interest of exploring more powerful PE for Transformers and GNNs in a robust experimental setting.

openalex-author · arXiv (Cornell University)

On Catastrophic Interference in Atari 2600 Games

Model-free deep reinforcement learning is sample inefficient. One hypothesis -- speculated, but not confirmed -- is that catastrophic interference within an environment inhibits learning. We test this hypothesis through a large-scale empirical study in the Arcade Learning Environment (ALE) and, indeed, find supporting evidence. We show that interference causes performance to plateau; the network cannot train on segments beyond the plateau without degrading the policy used to reach there. By synthetically controlling for interference, we demonstrate performance boosts across architectures, learning algorithms and environments. A more refined analysis shows that learning one segment of a game often increases prediction errors elsewhere. Our study provides a clear empirical link between catastrophic interference and sample efficiency in reinforcement learning.

openalex-author · arXiv (Cornell University)

Neural Bayes: A Generic Parameterization Method for Unsupervised\n Representation Learning

We introduce a parameterization method called Neural Bayes which allows\ncomputing statistical quantities that are in general difficult to compute and\nopens avenues for formulating new objectives for unsupervised representation\nlearning. Specifically, given an observed random variable $\\mathbf{x}$ and a\nlatent discrete variable $z$, we can express $p(\\mathbf{x}|z)$,\n$p(z|\\mathbf{x})$ and $p(z)$ in closed form in terms of a sufficiently\nexpressive function (Eg. neural network) using our parameterization without\nrestricting the class of these distributions. To demonstrate its usefulness, we\ndevelop two independent use cases for this parameterization:\n 1. Mutual Information Maximization (MIM): MIM has become a popular means for\nself-supervised representation learning. Neural Bayes allows us to compute\nmutual information between observed random variables $\\mathbf{x}$ and latent\ndiscrete random variables $z$ in closed form. We use this for learning image\nrepresentations and show its usefulness on downstream classification tasks.\n 2. Disjoint Manifold Labeling: Neural Bayes allows us to formulate an\nobjective which can optimally label samples from disjoint manifolds present in\nthe support of a continuous distribution. This can be seen as a specific form\nof clustering where each disjoint manifold in the support is a separate\ncluster. We design clustering tasks that obey this formulation and empirically\nshow that the model optimally labels the disjoint manifolds. Our code is\navailable at \\url{https://github.com/salesforce/NeuralBayes}\n

openalex-author · arXiv (Cornell University)

HighRes-net: Recursive Fusion for Multi-Frame Super-Resolution of Satellite Imagery

Generative deep learning has sparked a new wave of Super-Resolution (SR) algorithms that enhance single images with impressive aesthetic results, albeit with imaginary details. Multi-frame Super-Resolution (MFSR) offers a more grounded approach to the ill-posed problem, by conditioning on multiple low-resolution views. This is important for satellite monitoring of human impact on the planet -- from deforestation, to human rights violations -- that depend on reliable imagery. To this end, we present HighRes-net, the first deep learning approach to MFSR that learns its sub-tasks in an end-to-end fashion: (i) co-registration, (ii) fusion, (iii) up-sampling, and (iv) registration-at-the-loss. Co-registration of low-resolution views is learned implicitly through a reference-frame channel, with no explicit registration mechanism. We learn a global fusion operator that is applied recursively on an arbitrary number of low-resolution pairs. We introduce a registered loss, by learning to align the SR output to a ground-truth through ShiftNet. We show that by learning deep representations of multiple views, we can super-resolve low-resolution signals and enhance Earth Observation data at scale. Our approach recently topped the European Space Agency's MFSR competition on real-world satellite imagery.

openalex-author · Machine Learning: Science and Technology

Establishing an evaluation metric to quantify climate change image realism *

Abstract With success on controlled tasks, deep generative models are being increasingly applied to humanitarian applications (Nie et al 2017 Int. Conf. on Medical Image Computing and Computer-Assisted Intervention (Berlin: Springer) pp 417–25, Yanardag et al 2017 Deep Empathy ). In this paper, we focus on the evaluation of a conditional generative model that illustrates the consequences of climate change-induced flooding to encourage public interest and awareness on the issue. Because metrics for comparing the realism of different modes in a conditional generative model do not exist, we propose several automated and human-based methods for evaluation. To do this, we adapt several existing metrics and assess the automated metrics against gold standard human evaluation. We find that using Fréchet Inception Distance with embeddings from an intermediary Inception-v3 layer that precedes the auxiliary classifier produces results most correlated with human realism. While insufficient alone to establish a human-correlated automatic evaluation metric, we believe this work begins to bridge the gap between human and automated generative evaluation procedures, and to generate more realistic images of the future consequences of climate change.

openalex-author · arXiv (Cornell University)

Modeling Cloud Reflectance Fields using Conditional Generative Adversarial Networks

We introduce a conditional Generative Adversarial Network (cGAN) approach to generate cloud reflectance fields (CRFs) conditioned on large scale meteorological variables such as sea surface temperature and relative humidity. We show that our trained model can generate realistic CRFs from the corresponding meteorological observations, which represents a step towards a data-driven framework for stochastic cloud parameterization.

openalex-author · arXiv (Cornell University)

Using Simulated Data to Generate Images of Climate Change

Generative adversarial networks (GANs) used in domain adaptation tasks have the ability to generate images that are both realistic and personalized, transforming an input image while maintaining its identifiable characteristics. However, they often require a large quantity of training data to produce high-quality images in a robust way, which limits their usability in cases when access to data is limited. In our paper, we explore the potential of using images from a simulated 3D environment to improve a domain adaptation task carried out by the MUNIT architecture, aiming to use the resulting images to raise awareness of the potential future impacts of climate change.

openalex-author · arXiv (Cornell University)

Universal Successor Features for Transfer Reinforcement Learning

Transfer in Reinforcement Learning (RL) refers to the idea of applying knowledge gained from previous tasks to solving related tasks. Learning a universal value function (Schaul et al., 2015), which generalizes over goals and states, has previously been shown to be useful for transfer. However, successor features are believed to be more suitable than values for transfer (Dayan, 1993; Barreto et al.,2017), even though they cannot directly generalize to new goals. In this paper, we propose (1) Universal Successor Features (USFs) to capture the underlying dynamics of the environment while allowing generalization to unseen goals and (2) a flexible end-to-end model of USFs that can be trained by interacting with the environment. We show that learning USFs is compatible with any RL algorithm that learns state values using a temporal difference method. Our experiments in a simple gridworld and with two MuJoCo environments show that USFs can greatly accelerate training when learning multiple tasks and can effectively transfer knowledge to new tasks.

openalex-author · Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Experience Grounds Language

Yonatan Bisk, Ari Holtzman, Jesse Thomason, Jacob Andreas, Yoshua Bengio, Joyce Chai, Mirella Lapata, Angeliki Lazaridou, Jonathan May, Aleksandr Nisnevich, Nicolas Pinto, Joseph Turian. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020.

openalex-author · IEEE Access

Joint Learning of Generative Translator and Classifier for Visually Similar Classes

In this paper, we propose a Generative Translation Classification Network (GTCN) for improving visual classification accuracy in settings where classes are visually similar and data is scarce. For this purpose, we propose joint learning from a scratch to train a classifier and a generative stochastic translation network end-to-end. The translation network is used to perform on-line data augmentation across classes, whereas previous works have mostly involved domain adaptation. To help the model further benefit from this data-augmentation, we introduce an adaptive fade-in loss and a quadruplet loss. We perform experiments on multiple datasets to demonstrate the proposed method's performance in varied settings. Of particular interest, training on 40% of the dataset is enough for our model to surpass the performance of baselines trained on the full dataset. When our architecture is trained on the full dataset, we achieve comparable performance with state-of-the-art methods despite using a light-weight architecture.

openalex-author · Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

Compositional Generalization by Factorizing Alignment and Translation

Standard methods in deep learning for natural language processing fail to capture the compositional structure of human language that allows for systematic generalization outside of the training distribution. However, human learners readily generalize in this way, e.g. by applying known grammatical rules to novel words. Inspired by work in cognitive science suggesting a functional distinction between systems for syntactic and semantic processing, we implement a modification to an existing approach in neural machine translation, imposing an analogous separation between alignment and translation. The resulting architecture substantially outperforms standard recurrent networks on the SCAN dataset, a compositional generalization task, without any additional supervision. Our work suggests that learning to align and to translate in separate modules may be a useful heuristic for capturing compositional structure.

openalex-author · PolyPublie (École Polytechnique de Montréal)

Hybrid Models for Learning to Branch

A recent Graph Neural Network (GNN) approach for learning to branch has been shown to successfully reduce the running time of branch-and-bound algorithms for Mixed Integer Linear Programming (MILP). While the GNN relies on a GPU for inference, MILP solvers are purely CPU-based. This severely limits its application as many practitioners may not have access to high-end GPUs. In this work, we ask two key questions. First, in a more realistic setting where only a CPU is available, is the GNN model still competitive? Second, can we devise an alternate computationally inexpensive model that retains the predictive power of the GNN architecture? We answer the first question in the negative, and address the second question by proposing a new hybrid architecture for efficient branching on CPU machines. The proposed architecture combines the expressive power of GNNs with computationally inexpensive multi-layer perceptrons (MLP) for branching. We evaluate our methods on four classes of MILP problems, and show that they lead to up to 26% reduction in solver running time compared to state-of-the-art methods without a GPU, while extrapolating to harder problems than it was trained on. The code for this project is publicly available at this https URL.

openalex-author · Proceedings of the Annual Meeting of the Cognitive Science Society, vol 42, iss 0

Systematicity in a Recurrent Neural Network by Factorizing Syntax and Semantics.

Standard methods in deep learning fail to capture composi-tional or systematic structure in their training data, as shownby their inability to generalize outside of the training distribu-tion. However, human learners readily generalize in this way,e.g. by applying known grammatical rules to novel words. Theinductive biases that might underlie this powerful cognitive ca-pacity remain unclear. Inspired by work in cognitive sciencesuggesting a functional distinction between systems for syn-tactic and semantic processing, we implement a modificationto an existing deep learning architecture, imposing an analo-gous separation. The resulting architecture substantially out-performs standard recurrent networks on the SCAN dataset, acompositional generalization task, without any additional su-pervision. Our work suggests that separating syntactic fromsemantic learning may be a useful heuristic for capturing com-positional structure, and highlights the potential of using cog-nitive principles to inform inductive biases in deep learning.

openalex-author · Lecture Notes in Computer Science

A Large-Scale, Open-Domain, Mixed-Interface Dialogue-Based ITS for STEM

No abstract available from the OpenAlex source record.

openalex-author · Lecture Notes in Computer Science

DiVA: Diverse Visual Feature Aggregation for Deep Metric Learning

No abstract available from the OpenAlex source record.

openalex-author · Neural Information Processing Systems

Untangling tradeoffs between recurrence and self-attention in artificial neural networks

No abstract available from the OpenAlex source record.

openalex-author · Archivio istituzionale della ricerca (Alma Mater Studiorum Università di Bologna)

A Learning-Based Algorithm to Quickly Compute Good Primal Solutions for Stochastic Integer Programs

We propose a novel approach using supervised learning to obtain near-optimal primal solutions for two-stage stochastic integer programming (2SIP) problems with constraints in the first and second stages. The goal of the algorithm is to predict a representative scenario (RS) for the problem such that, deterministically solving the 2SIP with the random realization equal to the RS, gives a near-optimal solution to the original 2SIP. Predicting an RS, instead of directly predicting a solution ensures first-stage feasibility of the solution. If the problem is known to have complete recourse, second-stage feasibility is also guaranteed. For computational testing, we learn to find an RS for a two-stage stochastic facility location problem with integer variables and linear constraints in both stages and consistently provide near-optimal solutions. Our computing times are very competitive with those of general-purpose integer programming solvers to achieve a similar solution quality.

openalex-author · arXiv (Cornell University)

Learning from Learning Machines: Optimisation, Rules, and Social Norms

There is an analogy between machine learning systems and economic entities in that they are both adaptive, and their behaviour is specified in a more-or-less explicit way. It appears that the area of AI that is most analogous to the behaviour of economic entities is that of morally good decision-making, but it is an open question as to how precisely moral behaviour can be achieved in an AI system. This paper explores the analogy between these two complex systems, and we suggest that a clearer understanding of this apparent analogy may help us forward in both the socio-economic domain and the AI domain: known results in economics may help inform feasible solutions in AI safety, but also known results in AI may inform economic policy. If this claim is correct, then the recent successes of deep learning for AI suggest that more implicit specifications work better than explicit ones for solving such problems.

openalex-author · arXiv (Cornell University)

On the Morality of Artificial Intelligence

Much of the existing research on the social and ethical impact of Artificial Intelligence has been focused on defining ethical principles and guidelines surrounding Machine Learning (ML) and other Artificial Intelligence (AI) algorithms [IEEE, 2017, Jobin et al., 2019]. While this is extremely useful for helping define the appropriate social norms of AI, we believe that it is equally important to discuss both the potential and risks of ML and to inspire the community to use ML for beneficial objectives. In the present article, which is specifically aimed at ML practitioners, we thus focus more on the latter, carrying out an overview of existing high-level ethical frameworks and guidelines, but above all proposing both conceptual and practical principles and guidelines for ML research and deployment, insisting on concrete actions that can be taken by practitioners to pursue a more ethical and moral practice of ML aimed at using AI for social good.

openalex-author · arXiv (Cornell University)

CLOSURE: Assessing Systematic Generalization of CLEVR Models

The CLEVR dataset of natural-looking questions about 3D-rendered scenes has recently received much attention from the research community. A number of models have been proposed for this task, many of which achieved very high accuracies of around 97-99%. In this work, we study how systematic the generalization of such models is, that is to which extent they are capable of handling novel combinations of known linguistic constructs. To this end, we test models' understanding of referring expressions based on matching object properties (such as e.g. "another cube that is the same size as the brown cube") in novel contexts. Our experiments on the thereby constructed CLOSURE benchmark show that state-of-the-art models often do not exhibit systematicity after being trained on CLEVR. Surprisingly, we find that an explicitly compositional Neural Module Network model also generalizes badly on CLOSURE, even when it has access to the ground-truth programs at test time. We improve the NMN's systematic generalization by developing a novel Vector-NMN module architecture with vector-valued inputs and outputs. Lastly, we investigate how much few-shot transfer learning can help models that are pretrained on CLEVR to adapt to CLOSURE. Our few-shot learning experiments contrast the adaptation behavior of the models with intermediate discrete programs with that of the end-to-end continuous models.

openalex-author · arXiv (Cornell University)

Applying Knowledge Transfer for Water Body Segmentation in Peru

In this work, we present the application of convolutional neural networks for segmenting water bodies in satellite images. We first use a variant of the U-Net model to segment rivers and lakes from very high-resolution images from Peru. To circumvent the issue of scarce labelled data, we investigate the applicability of a knowledge transfer-based model that learns the mapping from high-resolution labelled images and combines it with the very high-resolution mapping so that better segmentation can be achieved. We train this model in a single process, end-to-end. Our preliminary results show that adding the information from the available high-resolution images does not help out-of-the-box, and in fact worsen results. This leads us to infer that the high-resolution data could be from a different distribution, and its addition leads to increased variance in our results.

openalex-author · AGU Fall Meeting Abstracts

Artificial Intelligence Based Cloud Distributor (AI-CD): Probing Low Cloud Distribution with Generative Adversarial Neural Networks

No abstract available from the OpenAlex source record.

openalex-author · Neural Computation

Toward Training Recurrent Neural Networks for Lifelong Learning

Catastrophic forgetting and capacity saturation are the central challenges of any parametric lifelong learning system. In this work, we study these challenges in the context of sequential supervised learning with an emphasis on recurrent neural networks. To evaluate the models in the lifelong learning setting, we propose a curriculum-based, simple, and intuitive benchmark where the models are trained on tasks with increasing levels of difficulty. To measure the impact of catastrophic forgetting, the model is tested on all the previous tasks as it completes any task. As a step toward developing true lifelong learning systems, we unify gradient episodic memory (a catastrophic forgetting alleviation approach) and Net2Net (a capacity expansion approach). Both models are proposed in the context of feedforward networks, and we evaluate the feasibility of using them for recurrent networks. Evaluation on the proposed benchmark shows that the unified model is more suitable than the constituent models for lifelong learning setting.

openalex-author · Proceedings of the 12th ACM Workshop on Artificial Intelligence and Security

Interpolated Adversarial Training

Adversarial robustness has become a central goal in deep learning, both in theory and in practice. However, successful methods to improve the adversarial robustness (such as adversarial training) greatly hurt generalization performance on the unperturbed data. This could have a major impact on how achieving adversarial robustness affects real world systems (i.e. many may opt to forego robustness if it can improve accuracy on the unperturbed data). We propose Interpolated Adversarial Training, which employs recently proposed interpolation based training methods in the framework of adversarial training. On CIFAR-10, adversarial training increases the standard test error (when there is no adversary) from 4.43% to 12.32%, whereas with our Interpolated adversarial training we retain adversarial robustness while achieving a standard test error of only 6.45%. With our technique, the relative increase in the standard error for the robust model is reduced from 178.1% to just 45.5%.

openalex-author · Nature Neuroscience

A deep learning framework for neuroscience

No abstract available from the OpenAlex source record.

openalex-author · arXiv (Cornell University)

Icentia11K: An Unsupervised Representation Learning Dataset for Arrhythmia Subtype Discovery

We release the largest public ECG dataset of continuous raw signals for representation learning containing 11 thousand patients and 2 billion labelled beats. Our goal is to enable semi-supervised ECG models to be made as well as to discover unknown subtypes of arrhythmia and anomalous ECG signal events. To this end, we propose an unsupervised representation learning task, evaluated in a semi-supervised fashion. We provide a set of baselines for different feature extractors that can be built upon. Additionally, we perform qualitative evaluations on results from PCA embeddings, where we identify some clustering of known subtypes indicating the potential for representation learning in arrhythmia sub-type discovery.

openalex-author · arXiv (Cornell University)

Predicting ice flow using machine learning

Though machine learning has achieved notable success in modeling sequential and spatial data for speech recognition and in computer vision, applications to remote sensing and climate science problems are seldom considered. In this paper, we demonstrate techniques from unsupervised learning of future video frame prediction, to increase the accuracy of ice flow tracking in multi-spectral satellite images. As the volume of cryosphere data increases in coming years, this is an interesting and important opportunity for machine learning to address a global challenge for climate change, risk management from floods, and conserving freshwater resources. Future frame prediction of ice melt and tracking the optical flow of ice dynamics presents modeling difficulties, due to uncertainties in global temperature increase, changing precipitation patterns, occlusion from cloud cover, rapid melting and glacier retreat due to black carbon aerosol deposition, from wildfires or human fossil emissions. We show the adversarial learning method helps improve the accuracy of tracking the optical flow of ice dynamics compared to existing methods in climate science. We present a dataset, IceNet, to encourage machine learning research and to help facilitate further applications in the areas of cryospheric science and climate change.

openalex-author · arXiv (Cornell University)

MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis

Previous works (Donahue et al., 2018a; Engel et al., 2019a) have found that generating coherent raw audio waveforms with GANs is challenging. In this paper, we show that it is possible to train GANs reliably to generate high quality coherent waveforms by introducing a set of architectural changes and simple training techniques. Subjective evaluation metric (Mean Opinion Score, or MOS) shows the effectiveness of the proposed approach for high quality mel-spectrogram inversion. To establish the generality of the proposed techniques, we show qualitative results of our model in speech synthesis, music domain translation and unconditional music synthesis. We evaluate the various components of the model through ablation studies and suggest a set of guidelines to design general purpose discriminators and generators for conditional sequence synthesis tasks. Our model is non-autoregressive, fully convolutional, with significantly fewer parameters than competing models and generalizes to unseen speakers for mel-spectrogram inversion. Our pytorch implementation runs at more than 100x faster than realtime on GTX 1080Ti GPU and more than 2x faster than real-time on CPU, without any hardware specific optimization tricks.

openalex-author · arXiv (Cornell University)

Variational Temporal Abstraction

We introduce a variational approach to learning and inference of temporally hierarchical structure and representation for sequential data. We propose the Variational Temporal Abstraction (VTA), a hierarchical recurrent state space model that can infer the latent temporal structure and thus perform the stochastic state transition hierarchically. We also propose to apply this model to implement the jumpy-imagination ability in imagination-augmented agent-learning in order to improve the efficiency of the imagination. In experiments, we demonstrate that our proposed method can model 2D and 3D visual sequence datasets with interpretable temporal structure discovery and that its application to jumpy imagination enables more efficient agent-learning in a 3D navigation task.

openalex-author · 2019 IEEE/CVF International Conference on Computer Vision (ICCV)

Learning Fixed Points in Generative Adversarial Networks: From Image-to-Image Translation to Disease Detection and Localization

Generative adversarial networks (GANs) have ushered in a revolution in image-to-image translation. The development and proliferation of GANs raises an interesting question: can we train a GAN to remove an object, if present, from an image while otherwise preserving the image? Specifically, can a GAN "virtually heal" anyone by turning his medical image, with an unknown health status (diseased or healthy), into a healthy one, so that diseased regions could be revealed by subtracting those two images? Such a task requires a GAN to identify a minimal subset of target pixels for domain translation, an ability that we call fixed-point translation, which no GAN is equipped with yet. Therefore, we propose a new GAN, called Fixed-Point GAN, trained by (1) supervising same-domain translation through a conditional identity loss, and (2) regularizing cross-domain translation through revised adversarial, domain classification, and cycle consistency loss. Based on fixed-point translation, we further derive a novel framework for disease detection and localization using only image-level annotation. Qualitative and quantitative evaluations demonstrate that the proposed method outperforms the state of the art in multi-domain image-to-image translation and that it surpasses predominant weakly-supervised localization methods in both disease detection and localization. Implementation is available at https://github.com/jlianglab/Fixed-Point-GAN.

openalex-author · Paper

GraphMix: Regularized Training of Graph Neural Networks for Semi-Supervised Learning

No abstract available from the OpenAlex source record.

openalex-author · Paper

{COMPANYNAME}11K: An Unsupervised Representation Learning Dataset for Arrhythmia Subtype Discovery

No abstract available from the OpenAlex source record.

openalex-author · Paper

HighRes-net: Multi-Frame Super-Resolution by Recursive Fusion

No abstract available from the OpenAlex source record.

openalex-author · arXiv (Cornell University)

Underwhelming Generalization Improvements From Controlling Feature Attribution

No abstract available from the OpenAlex source record.

openalex-author · Paper

SPECTRA: Sparse Entity-centric Transitions

No abstract available from the OpenAlex source record.

openalex-author · arXiv (Cornell University)

Recurrent Independent Mechanisms

Learning modular structures which reflect the dynamics of the environment can lead to better generalization and robustness to changes which only affect a few of the underlying causes. We propose Recurrent Independent Mechanisms (RIMs), a new recurrent architecture in which multiple groups of recurrent cells operate with nearly independent transition dynamics, communicate only sparingly through the bottleneck of attention, and are only updated at time steps where they are most relevant. We show that this leads to specialization amongst the RIMs, which in turn allows for dramatically improved generalization on tasks where some factors of variation differ systematically between training and evaluation.

openalex-author · arXiv (Cornell University)

Avoidance Learning Using Observational Reinforcement Learning

Imitation learning seeks to learn an expert policy from sampled demonstrations. However, in the real world, it is often difficult to find a perfect expert and avoiding dangerous behaviors becomes relevant for safety reasons. We present the idea of \textit{learning to avoid}, an objective opposite to imitation learning in some sense, where an agent learns to avoid a demonstrator policy given an environment. We define avoidance learning as the process of optimizing the agent's reward while avoiding dangerous behaviors given by a demonstrator. In this work we develop a framework of avoidance learning by defining a suitable objective function for these problems which involves the \emph{distance} of state occupancy distributions of the expert and demonstrator policies. We use density estimates for state occupancy measures and use the aforementioned distance as the reward bonus for avoiding the demonstrator. We validate our theory with experiments using a wide range of partially observable environments. Experimental results show that we are able to improve sample efficiency during training compared to state of the art policy optimization and safety methods.

openalex-author · arXiv (Cornell University)

Torchmeta: A Meta-Learning library for PyTorch

The constant introduction of standardized benchmarks in the literature has helped accelerating the recent advances in meta-learning research. They offer a way to get a fair comparison between different algorithms, and the wide range of datasets available allows full control over the complexity of this evaluation. However, for a large majority of code available online, the data pipeline is often specific to one dataset, and testing on another dataset requires significant rework. We introduce Torchmeta, a library built on top of PyTorch that enables seamless and consistent evaluation of meta-learning algorithms on multiple datasets, by providing data-loaders for most of the standard benchmarks in few-shot classification and regression, with a new meta-dataset abstraction. It also features some extensions for PyTorch to simplify the development of models compatible with meta-learning algorithms. The code is available here: https://github.com/tristandeleu/pytorch-meta

openalex-author · Interspeech 2019

Learning Problem-Agnostic Speech Representations from Multiple Self-Supervised Tasks

Learning good representations without supervision is still an open issue in machine learning, and is particularly challenging for speech signals, which are often characterized by long sequences with a complex hierarchical structure. Some recent works, however, have shown that it is possible to derive useful speech representations by employing a self-supervised encoder-discriminator approach. This paper proposes an improved self-supervised method, where a single neural encoder is followed by multiple workers that jointly solve different self-supervised tasks. The needed consensus across different tasks naturally imposes meaningful constraints to the encoder, contributing to discover general representations and to minimize the risk of learning superficial ones. Experiments show that the proposed approach can learn transferable, robust, and problem-agnostic features that carry on relevant information from the speech signal, such as speaker identity, phonemes, and even higher-level features such as emotional cues. In addition, a number of design choices make the encoder easily exportable, facilitating its direct usage or adaptation to different problems.

openalex-author · Interspeech 2019

Learning Speaker Representations with Mutual Information

Learning good representations is of crucial importance in deep learning. Mutual Information (MI) or similar measures of statistical dependence are promising tools for learning these representations in an unsupervised way. Even though the mutual information between two random variables is hard to measure directly in high dimensional spaces, some recent studies have shown that an implicit optimization of MI can be achieved with an encoder-discriminator architecture similar to that of Generative Adversarial Networks (GANs). In this work, we learn representations that capture speaker identities by maximizing the mutual information between the encoded representations of chunks of speech randomly sampled from the same sentence. The proposed encoder relies on the SincNet architecture and transforms raw speech waveform into a compact feature vector. The discriminator is fed by either positive samples (of the joint distribution of encoded chunks) or negative samples (from the product of the marginals) and is trained to separate them. We report experiments showing that this approach effectively learns useful speaker representations, leading to promising results on speaker identification and verification tasks. Our experiments consider both unsupervised and semi-supervised settings and compare the performance achieved with different objective functions.

openalex-author · Interspeech 2019

Speech Model Pre-Training for End-to-End Spoken Language Understanding

Whereas conventional spoken language understanding (SLU) systems map speech to text, and then text to intent, end-to-end SLU systems map speech directly to intent through a single trainable model. Achieving high accuracy with these end-to-end models without a large amount of training data is difficult. We propose a method to reduce the data requirements of end-to-end SLU in which the model is first pre-trained to predict words and phonemes, thus learning good features for SLU. We introduce a new SLU dataset, Fluent Speech Commands, and show that our method improves performance both when the full dataset is used for training and when only a small subset is used. We also describe preliminary experiments to gauge the model's ability to generalize to new phrases not heard during training.

openalex-author · arXiv (Cornell University)

Data-Driven Approach to Encoding and Decoding 3-D Crystal Structures

Generative models have achieved impressive results in many domains including image and text generation. In the natural sciences, generative models have led to rapid progress in automated drug discovery. Many of the current methods focus on either 1-D or 2-D representations of typically small, drug-like molecules. However, many molecules require 3-D descriptors and exceed the chemical complexity of commonly used dataset. We present a method to encode and decode the position of atoms in 3-D molecules from a dataset of nearly 50,000 stable crystal unit cells that vary from containing 1 to over 100 atoms. We construct a smooth and continuous 3-D density representation of each crystal based on the positions of different atoms. Two different neural networks were trained on a dataset of over 120,000 three-dimensional samples of single and repeating crystal structures, made by rotating the single unit cells. The first, an Encoder-Decoder pair, constructs a compressed latent space representation of each molecule and then decodes this description into an accurate reconstruction of the input. The second network segments the resulting output into atoms and assigns each atom an atomic number. By generating compressed, continuous latent spaces representations of molecules we are able to decode random samples, interpolate between two molecules, and alter known molecules.

openalex-author · Neural Networks

Interpolation Consistency Training for Semi-supervised Learning

We introduce Interpolation Consistency Training (ICT), a simple and computation efficient algorithm for training Deep Neural Networks in the semi-supervised learning paradigm. ICT encourages the prediction at an interpolation of unlabeled points to be consistent with the interpolation of the predictions at those points. In classification problems, ICT moves the decision boundary to low-density regions of the data distribution. Our experiments show that ICT achieves state-of-the-art performance when applied to standard neural network architectures on the CIFAR-10 and SVHN benchmark datasets. Our theoretical analysis shows that ICT corresponds to a certain type of data-adaptive regularization with unlabeled points which reduces overfitting to labeled points under high confidence values.

openalex-author · Proceedings of the AAAI Conference on Artificial Intelligence

Combined Reinforcement Learning via Abstract Representations

In the quest for efficient and robust reinforcement learning methods, both model-free and model-based approaches offer advantages. In this paper we propose a new way of explicitly bridging both approaches via a shared low-dimensional learned encoding of the environment, meant to capture summarizing abstractions. We show that the modularity brought by this approach leads to good generalization while being computationally efficient, with planning happening in a smaller latent state space. In addition, this approach recovers a sufficient low-dimensional representation of the environment, which opens up new strategies for interpretable AI, exploration and transfer learning.

openalex-author · Proceedings of the AAAI Conference on Artificial Intelligence

Towards Non-Saturating Recurrent Units for Modelling Long-Term Dependencies

Modelling long-term dependencies is a challenge for recurrent neural networks. This is primarily due to the fact that gradients vanish during training, as the sequence length increases. Gradients can be attenuated by transition operators and are attenuated or dropped by activation functions. Canonical architectures like LSTM alleviate this issue by skipping information through a memory mechanism. We propose a new recurrent architecture (Non-saturating Recurrent Unit; NRU) that relies on a memory mechanism but forgoes both saturating activation functions and saturating gates, in order to further alleviate vanishing gradients. In a series of synthetic and real world tasks, we demonstrate that the proposed model is the only model that performs among the top 2 models across all tasks with and without long-term dependencies, when compared against a range of other architectures.

openalex-author · arXiv (Cornell University)

Weakly-supervised Knowledge Graph Alignment with Adversarial Learning

This paper studies aligning knowledge graphs from different sources or languages. Most existing methods train supervised methods for the alignment, which usually require a large number of aligned knowledge triplets. However, such a large number of aligned knowledge triplets may not be available or are expensive to obtain in many domains. Therefore, in this paper we propose to study aligning knowledge graphs in fully-unsupervised or weakly-supervised fashion, i.e., without or with only a few aligned triplets. We propose an unsupervised framework to align the entity and relation embddings of different knowledge graphs with an adversarial learning framework. Moreover, a regularization term which maximizes the mutual information between the embeddings of different knowledge graphs is used to mitigate the problem of mode collapse when learning the alignment functions. Such a framework can be further seamlessly integrated with existing supervised methods by utilizing a limited number of aligned triples as guidance. Experimental results on multiple datasets prove the effectiveness of our proposed approach in both the unsupervised and the weakly-supervised settings.

openalex-author · Neural Networks

Depth with nonlinearity creates no bad local minima in ResNets

In this paper, we prove that depth with nonlinearity creates no bad local minima in a type of arbitrarily deep ResNets with arbitrary nonlinear activation functions, in the sense that the values of all local minima are no worse than the global minimum value of corresponding classical machine-learning models, and are guaranteed to further improve via residual representations. As a result, this paper provides an affirmative answer to an open question stated in a paper in the conference on Neural Information Processing Systems 2018. This paper advances the optimization theory of deep learning only for ResNets and not for other network architectures.

openalex-author · arXiv (Cornell University)

Unsupervised State Representation Learning in Atari

State representation learning, or the ability to capture latent generative factors of an environment, is crucial for building intelligent agents that can perform a wide variety of tasks. Learning such representations without supervision from rewards is a challenging open problem. We introduce a method that learns state representations by maximizing mutual information across spatially and temporally distinct features of a neural encoder of the observations. We also introduce a new benchmark based on Atari 2600 games where we evaluate representations based on how well they capture the ground truth state variables. We believe this new framework for evaluating representation learning models will be crucial for future representation learning research. Finally, we compare our technique with other state-of-the-art generative and contrastive representation learning methods. The code associated with this work is available at https://github.com/mila-iqia/atari-representation-learning

openalex-author · arXiv (Cornell University)

Information matrices and generalization

This work revisits the use of information criteria to characterize the generalization of deep learning models. In particular, we empirically demonstrate the effectiveness of the Takeuchi information criterion (TIC), an extension of the Akaike information criterion (AIC) for misspecified models, in estimating the generalization gap, shedding light on why quantities such as the number of parameters cannot quantify generalization. The TIC depends on both the Hessian of the loss H and the covariance of the gradients C. By exploring the similarities and differences between these two matrices as well as the Fisher information matrix F, we study the interplay between noise and curvature in deep models. We also address the question of whether C is a reasonable approximation to F, as is commonly assumed.

openalex-author · arXiv (Cornell University)

Conditional Computation for Continual Learning

Catastrophic forgetting of connectionist neural networks is caused by the global sharing of parameters among all training examples. In this study, we analyze parameter sharing under the conditional computation framework where the parameters of a neural network are conditioned on each input example. At one extreme, if each input example uses a disjoint set of parameters, there is no sharing of parameters thus no catastrophic forgetting. At the other extreme, if the parameters are the same for every example, it reduces to the conventional neural network. We then introduce a clipped version of maxout networks which lies in the middle, i.e. parameters are shared partially among examples. Based on the parameter sharing analysis, we can locate a limited set of examples that are interfered when learning a new example. We propose to perform rehearsal on this set to prevent forgetting, which is termed as conditional rehearsal. Finally, we demonstrate the effectiveness of the proposed method in an online non-stationary setup, where updates are made after each new example and the distribution of the received example shifts over time.

openalex-author · arXiv (Cornell University)

Learning Powerful Policies by Using Consistent Dynamics Model

Model-based Reinforcement Learning approaches have the promise of being sample efficient. Much of the progress in learning dynamics models in RL has been made by learning models via supervised learning. But traditional model-based approaches lead to `compounding errors' when the model is unrolled step by step. Essentially, the state transitions that the learner predicts (by unrolling the model for multiple steps) and the state transitions that the learner experiences (by acting in the environment) may not be consistent. There is enough evidence that humans build a model of the environment, not only by observing the environment but also by interacting with the environment. Interaction with the environment allows humans to carry out experiments: taking actions that help uncover true causal relationships which can be used for building better dynamics models. Analogously, we would expect such interactions to be helpful for a learning agent while learning to model the environment dynamics. In this paper, we build upon this intuition by using an auxiliary cost function to ensure consistency between what the agent observes (by acting in the real world) and what it imagines (by acting in the `learned' world). We consider several tasks - Mujoco based control tasks and Atari games - and show that the proposed approach helps to train powerful policies and better dynamics models.

openalex-author · arXiv (Cornell University)

How to Initialize your Network? Robust Initialization for WeightNorm & ResNets

Residual networks (ResNet) and weight normalization play an important role in various deep learning applications. However, parameter initialization strategies have not been studied previously for weight normalized networks and, in practice, initialization methods designed for un-normalized networks are used as a proxy. Similarly, initialization for ResNets have also been studied for un-normalized networks and often under simplified settings ignoring the shortcut connection. To address these issues, we propose a novel parameter initialization strategy that avoids explosion/vanishment of information across layers for weight normalized networks with and without residual connections. The proposed strategy is based on a theoretical analysis using mean field approximation. We run over 2,500 experiments and evaluate our proposal on image datasets showing that the proposed initialization outperforms existing initialization methods in terms of generalization performance, robustness to hyper-parameter values and variance between seeds, especially when networks get deeper in which case existing methods fail to even start training. Finally, we show that using our initialization in conjunction with learning rate warmup is able to reduce the gap between the performance of weight normalized and batch normalized networks.

openalex-author · arXiv (Cornell University)

Updates of Equilibrium Prop Match Gradients of Backprop Through Time in\n an RNN with Static Input

Equilibrium Propagation (EP) is a biologically inspired learning algorithm\nfor convergent recurrent neural networks, i.e. RNNs that are fed by a static\ninput x and settle to a steady state. Training convergent RNNs consists in\nadjusting the weights until the steady state of output neurons coincides with a\ntarget y. Convergent RNNs can also be trained with the more conventional\nBackpropagation Through Time (BPTT) algorithm. In its original formulation EP\nwas described in the case of real-time neuronal dynamics, which is\ncomputationally costly. In this work, we introduce a discrete-time version of\nEP with simplified equations and with reduced simulation time, bringing EP\ncloser to practical machine learning tasks. We first prove theoretically, as\nwell as numerically that the neural and weight updates of EP, computed by\nforward-time dynamics, are step-by-step equal to the ones obtained by BPTT,\nwith gradients computed backward in time. The equality is strict when the\ntransition function of the dynamics derives from a primitive function and the\nsteady state is maintained long enough. We then show for more standard\ndiscrete-time neural network dynamics that the same property is approximately\nrespected and we subsequently demonstrate training with EP with equivalent\nperformance to BPTT. In particular, we define the first convolutional\narchitecture trained with EP achieving ~ 1% test error on MNIST, which is the\nlowest error reported with EP. These results can guide the development of deep\nneural networks trained with EP.\n

openalex-author · PolyPublie (École Polytechnique de Montréal)

Non-normal Recurrent Neural Network (nnRNN): learning long time dependencies while improving expressivity with transient dynamics

A recent strategy to circumvent the exploding and vanishing gradient problem in RNNs, and to allow the stable propagation of signals over long time scales, is to constrain recurrent connectivity matrices to be orthogonal or unitary. This ensures eigenvalues with unit norm and thus stable dynamics and training. However this comes at the cost of reduced expressivity due to the limited variety of orthogonal transformations. We propose a novel connectivity structure based on the Schur decomposition and a splitting of the Schur form into normal and non-normal parts. This allows to parametrize matrices with unit-norm eigenspectra without orthogonality constraints on eigenbases. The resulting architecture ensures access to a larger space of spectrally constrained matrices, of which orthogonal matrices are a subset. This crucial difference retains the stability advantages and training speed of orthogonal RNNs while enhancing expressivity, especially on tasks that require computations over ongoing input sequences.

openalex-author · arXiv (Cornell University)

N-BEATS: Neural basis expansion analysis for interpretable time series forecasting

We focus on solving the univariate times series point forecasting problem using deep learning. We propose a deep neural architecture based on backward and forward residual links and a very deep stack of fully-connected layers. The architecture has a number of desirable properties, being interpretable, applicable without modification to a wide array of target domains, and fast to train. We test the proposed architecture on several well-known datasets, including M3, M4 and TOURISM competition datasets containing time series from diverse domains. We demonstrate state-of-the-art performance for two configurations of N-BEATS for all the datasets, improving forecast accuracy by 11% over a statistical benchmark and by 3% over last year's winner of the M4 competition, a domain-adjusted hand-crafted hybrid between neural network and statistical time series models. The first configuration of our model does not employ any time-series-specific components and its performance on heterogeneous datasets strongly suggests that, contrarily to received wisdom, deep learning primitives such as residual blocks are by themselves sufficient to solve a wide range of forecasting problems. Finally, we demonstrate how the proposed architecture can be augmented to provide outputs that are interpretable without considerable loss in accuracy.

openalex-author · International Conference on Machine Learning

State-Reification Networks: Improving Generalization by Modeling the Distribution of Hidden Representations

No abstract available from the OpenAlex source record.

openalex-author · arXiv (Cornell University)

The Journey is the Reward: Unsupervised Learning of Influential Trajectories

Unsupervised exploration and representation learning become increasingly important when learning in diverse and sparse environments. The information-theoretic principle of empowerment formalizes an unsupervised exploration objective through an agent trying to maximize its influence on the future states of its environment. Previous approaches carry certain limitations in that they either do not employ closed-loop feedback or do not have an internal state. As a consequence, a privileged final state is taken as an influence measure, rather than the full trajectory. We provide a model-free method which takes into account the whole trajectory while still offering the benefits of option-based approaches. We successfully apply our approach to settings with large action spaces, where discovery of meaningful action sequences is particularly difficult.

openalex-author · arXiv (Cornell University)

GMNN: Graph Markov Neural Networks

This paper studies semi-supervised object classification in relational data, which is a fundamental problem in relational data modeling. The problem has been extensively studied in the literature of both statistical relational learning (e.g. relational Markov networks) and graph neural networks (e.g. graph convolutional networks). Statistical relational learning methods can effectively model the dependency of object labels through conditional random fields for collective classification, whereas graph neural networks learn effective object representations for classification through end-to-end training. In this paper, we propose the Graph Markov Neural Network (GMNN) that combines the advantages of both worlds. A GMNN models the joint distribution of object labels with a conditional random field, which can be effectively trained with the variational EM algorithm. In the E-step, one graph neural network learns effective object representations for approximating the posterior distributions of object labels. In the M-step, another graph neural network is used to model the local label dependency. Experiments on object classification, link classification, and unsupervised node representation learning show that GMNN achieves state-of-the-art results.

openalex-author · arXiv (Cornell University)

Visualizing the Consequences of Climate Change Using Cycle-Consistent Adversarial Networks

We present a project that aims to generate images that depict accurate, vivid, and personalized outcomes of climate change using Cycle-Consistent Adversarial Networks (CycleGANs). By training our CycleGAN model on street-view images of houses before and after extreme weather events (e.g. floods, forest fires, etc.), we learn a mapping that can then be applied to images of locations that have not yet experienced these events. This visual transformation is paired with climate model predictions to assess likelihood and type of climate-related events in the long term (50 years) in order to bring the future closer in the viewers mind. The eventual goal of our project is to enable individuals to make more informed choices about their climate future by creating a more visceral understanding of the effects of climate change, while maintaining scientific credibility by drawing on climate model projections.

openalex-author · 2019 International Conference on Robotics and Automation (ICRA)

A Data-Efficient Framework for Training and Sim-to-Real Transfer of Navigation Policies

Learning effective visuomotor policies for robots purely from data is challenging, but also appealing since a learning-based system should not require manual tuning or calibration. In the case of a robot operating in a real environment the training process can be costly, time-consuming, and even dangerous since failures are common at the start of training. For this reason, it is desirable to be able to leverage simulation and off-policy data to the extent possible to train the robot. In this work, we introduce a robust framework that plans in simulation and transfers well to the real environment. Our model incorporates a gradient-descent based planning module, which, given the initial image and goal image, encodes the images to a lower dimensional latent state and plans a trajectory to reach the goal. The model, consisting of the encoder and planner modules, is first trained through a meta-learning strategy in simulation. We subsequently perform adversarial domain transfer on the encoder by using a bank of unlabelled but random images from the simulation and real environments to enable the encoder to map images from the real and simulated environments to a similarly distributed latent representation. By fine tuning the entire model (encoder + planner) with only a few real world expert demonstrations, we show successful planning performances in different navigation tasks.

openalex-author · arXiv (Cornell University)

Compositional generalization in a deep seq2seq model by separating syntax and semantics

Standard methods in deep learning for natural language processing fail to capture the compositional structure of human language that allows for systematic generalization outside of the training distribution. However, human learners readily generalize in this way, e.g. by applying known grammatical rules to novel words. Inspired by work in neuroscience suggesting separate brain systems for syntactic and semantic processing, we implement a modification to standard approaches in neural machine translation, imposing an analogous separation. The novel model, which we call Syntactic Attention, substantially outperforms standard methods in deep learning on the SCAN dataset, a compositional generalization task, without any hand-engineered features or additional supervision. Our work suggests that separating syntactic from semantic learning may be a useful heuristic for capturing compositional structure.

openalex-author · ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

The Pytorch-kaldi Speech Recognition Toolkit

The availability of open-source software is playing a remarkable role in the popularization of speech recognition and deep learning. Kaldi, for instance, is nowadays an established framework used to develop state-of-the-art speech recognizers. PyTorch is used to build neural networks with the Python language and has recently spawn tremendous interest within the machine learning community thanks to its simplicity and flexibility. The PyTorch-Kaldi project aims to bridge the gap between these popular toolkits, trying to inherit the efficiency of Kaldi and the flexibility of PyTorch. PyTorch-Kaldi is not only a simple interface between these software, but it embeds several useful features for developing modern speech recognizers. For instance, the code is specifically designed to naturally plug-in user-defined acoustic models. As an alternative, users can exploit several pre-implemented neural networks that can be customized using intuitive configuration files. PyTorch-Kaldi supports multiple feature and label streams as well as combinations of neural networks, enabling the use of complex neural architectures. The toolkit is publicly-released along with a rich documentation and is designed to properly work locally or on HPC clusters. Experiments, that are conducted on several datasets and tasks, show that PyTorch-Kaldi can effectively be used to develop modern state-of-the-art speech recognizers.

openalex-author · ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

How Transferable Are Features in Convolutional Neural Network Acoustic Models across Languages?

Characterization of the representations learned in intermediate layers of deep networks can provide valuable insight into the nature of a task and can guide the development of well-tailored learning strategies. Here we study convolutional neural network (CNN)-based acoustic models in the context of automatic speech recognition. Adapting a method proposed by [1], we measure the transferability of each layer between English, Dutch and German to assess their language-specificity. We observed three distinct regions of transferability: (1) the first two layers were entirely transferable between languages, (2) layers 2-8 were also highly transferable but we found some evidence of language specificity, (3) the subsequent fully connected layers were more language specific but could be successfully finetuned to the target language. To further probe the effect of weight freezing, we performed follow-up experiments using freeze-training [2]. Our results are consistent with the observation that CNNs converge `bottom up' during training and demonstrate the benefit of freeze training, especially for transfer learning.

openalex-author · ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

A Highly Adaptive Acoustic Model for Accurate Multi-dialect Speech Recognition

Despite the success of deep learning in speech recognition, multi-dialect speech recognition remains a difficult problem. Although dialect-specific acoustic models are known to perform well in general, they are not easy to maintain when dialect-specific data is scarce and the number of dialects for each language is large. Therefore, a single unified acoustic model (AM) that generalizes well for many dialects has been in demand. In this paper, we propose a novel acoustic modeling technique for accurate multi-dialect speech recognition with a single AM. Our proposed AM is dynamically adapted based on both dialect information and its internal representation, which results in a highly adaptive AM for handling multiple dialects simultaneously. We also propose a simple but effective training method to deal with unseen dialects. The experimental results on large scale speech datasets show that the proposed AM outperforms all the previous ones, reducing word error rates (WERs) by 8.11% relative compared to a single all-dialects AM and by 7.31% relative compared to dialect-specific AMs.

openalex-author · Paper

GradMask: Reduce Overfitting by Regularizing Saliency.

With too few samples or too many model parameters, overfitting can inhibit the ability to generalise predictions to new data. Within medical imaging, this can occur when features are incorrectly assigned importance such as distinct hospital specific artifacts, leading to poor performance on a new dataset from a different institution without those features, which is undesirable. Most regularization methods do not explicitly penalize the incorrect association of these features to the target class and hence fail to address this issue. We propose a regularization method, GradMask, which penalizes saliency maps inferred from the classifier gradients when they are not consistent with the lesion segmentation. This prevents non-tumor related features to contribute to the classification of unhealthy samples. We demonstrate that this method can improve test accuracy between 1-3% compared to the baseline without GradMask, showing that it has an impact on reducing overfitting.

openalex-author · ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Representation Mixing for TTS Synthesis

Recent character and phoneme-based parametric TTS systems using deep learning have shown strong performance in natural speech generation. However, the choice between character or phoneme input can create serious limitations for practical deployment, as direct control of pronunciation is crucial in certain cases. We demonstrate a simple method for combining multiple types of linguistic information in a single encoder, named representation mixing, enabling flexible choice between character, phoneme, or mixed representations during inference. Experiments and user studies on a public audiobook corpus show the efficacy of our approach.

openalex-author · arXiv (Cornell University)

Reinforced Imitation in Heterogeneous Action Space

Imitation learning is an effective alternative approach to learn a policy when the reward function is sparse. In this paper, we consider a challenging setting where an agent and an expert use different actions from each other. We assume that the agent has access to a sparse reward function and state-only expert observations. We propose a method which gradually balances between the imitation learning cost and the reinforcement learning objective. In addition, this method adapts the agent's policy based on either mimicking expert behavior or maximizing sparse reward. We show, through navigation scenarios, that (i) an agent is able to efficiently leverage sparse rewards to outperform standard state-only imitation learning, (ii) it can learn a policy even when its actions are different from the expert, and (iii) the performance of the agent is not bounded by that of the expert, due to the optimized usage of sparse rewards.

openalex-author · arXiv (Cornell University)

Wasserstein Dependency Measure for Representation Learning

Mutual information maximization has emerged as a powerful learning objective for unsupervised representation learning obtaining state-of-the-art performance in applications such as object recognition, speech recognition, and reinforcement learning. However, such approaches are fundamentally limited since a tight lower bound of mutual information requires sample size exponential in the mutual information. This limits the applicability of these approaches for prediction tasks with high mutual information, such as in video understanding or reinforcement learning. In these settings, such techniques are prone to overfit, both in theory and in practice, and capture only a few of the relevant factors of variation. This leads to incomplete representations that are not optimal for downstream tasks. In this work, we empirically demonstrate that mutual information-based representation learning approaches do fail to learn complete representations on a number of designed and real-world tasks. To mitigate these problems we introduce the Wasserstein dependency measure, which learns more complete representations by using the Wasserstein distance instead of the KL divergence in the mutual information estimator. We show that a practical approximation to this theoretically motivated solution, constructed using Lipschitz constraint techniques from the GAN literature, achieves substantially improved results on tasks where incomplete representations are a major challenge.

openalex-author · arXiv (Cornell University)

InfoMask: Masked Variational Latent Representation to Localize Chest\n Disease

The scarcity of richly annotated medical images is limiting supervised deep\nlearning based solutions to medical image analysis tasks, such as localizing\ndiscriminatory radiomic disease signatures. Therefore, it is desirable to\nleverage unsupervised and weakly supervised models. Most recent weakly\nsupervised localization methods apply attention maps or region proposals in a\nmultiple instance learning formulation. While attention maps can be noisy,\nleading to erroneously highlighted regions, it is not simple to decide on an\noptimal window/bag size for multiple instance learning approaches. In this\npaper, we propose a learned spatial masking mechanism to filter out irrelevant\nbackground signals from attention maps. The proposed method minimizes mutual\ninformation between a masked variational representation and the input while\nmaximizing the information between the masked representation and class labels.\nThis results in more accurate localization of discriminatory regions. We tested\nthe proposed model on the ChestX-ray8 dataset to localize pneumonia from chest\nX-ray images without using any pixel-level or bounding-box annotations.\n

openalex-author · Paper

Multi-Class Few Shot Learning Task and Controllable Environment

No abstract available from the OpenAlex source record.

openalex-author · arXiv (Cornell University)

Towards Standardization of Data Licenses: The Montreal Data License

This paper provides a taxonomy for the licensing of data in the fields of artificial intelligence and machine learning. The paper's goal is to build towards a common framework for data licensing akin to the licensing of open source software. Increased transparency and resolving conceptual ambiguities in existing licensing language are two noted benefits of the approach proposed in the paper. In parallel, such benefits may help foster fairer and more efficient markets for data through bringing about clearer tools and concepts that better define how data can be used in the fields of AI and ML. The paper's approach is summarized in a new family of data license language - \textit{the Montreal Data License (MDL)}. Alongside this new license, the authors and their collaborators have developed a web-based tool to generate license language espousing the taxonomies articulated in this paper.

openalex-author · arXiv (Cornell University)

Online continual learning with no task boundaries.

No abstract available from the OpenAlex source record.

openalex-author · PolyPublie (École Polytechnique de Montréal)

On Adversarial Mixup Resynthesis

In this paper, we explore new approaches to combining information encoded within the learned representations of auto-encoders. We explore models that are capable of combining the attributes of multiple inputs such that a resynthesised output is trained to fool an adversarial discriminator for real versus synthesised data. Furthermore, we explore the use of such an architecture in the context of semi-supervised learning, where we learn a mixing function whose objective is to produce interpolations of hidden states, or masked combinations of latent representations that are consistent with a conditioned class label. We show quantitative and qualitative evidence that such a formulation is an interesting avenue of research.

openalex-author · arXiv (Cornell University)

Learning Dynamics Model in Reinforcement Learning by Incorporating the Long Term Future

In model-based reinforcement learning, the agent interleaves between model learning and planning. These two components are inextricably intertwined. If the model is not able to provide sensible long-term prediction, the executed planner would exploit model flaws, which can yield catastrophic failures. This paper focuses on building a model that reasons about the long-term future and demonstrates how to use this for efficient planning and exploration. To this end, we build a latent-variable autoregressive model by leveraging recent ideas in variational inference. We argue that forcing latent variables to carry future information through an auxiliary task substantially improves long-term predictions. Moreover, by planning in the latent space, the planner's solution is ensured to be within regions where the model is valid. An exploration strategy can be devised by searching for unlikely trajectories under the model. Our method achieves higher reward faster compared to baselines on a variety of tasks and environments in both the imitation learning and model-based reinforcement learning settings.

openalex-author · arXiv (Cornell University)

Hyperbolic Discounting and Learning over Multiple Horizons

Reinforcement learning (RL) typically defines a discount factor as part of the Markov Decision Process. The discount factor values future rewards by an exponential scheme that leads to theoretical convergence guarantees of the Bellman equation. However, evidence from psychology, economics and neuroscience suggests that humans and animals instead have hyperbolic time-preferences. In this work we revisit the fundamentals of discounting in RL and bridge this disconnect by implementing an RL agent that acts via hyperbolic discounting. We demonstrate that a simple approach approximates hyperbolic discount functions while still using familiar temporal-difference learning techniques in RL. Additionally, and independent of hyperbolic discounting, we make a surprising discovery that simultaneously learning value functions over multiple time-horizons is an effective auxiliary task which often improves over a strong value-based RL agent, Rainbow.

openalex-author · Neural Computation

Gated Orthogonal Recurrent Units: On Learning to Forget

We present a novel recurrent neural network (RNN)-based model that combines the remembering ability of unitary evolution RNNs with the ability of gated RNNs to effectively forget redundant or irrelevant information in its memory. We achieve this by extending restricted orthogonal evolution RNNs with a gating mechanism similar to gated recurrent unit RNNs with a reset gate and an update gate. Our model is able to outperform long short-term memory, gated recurrent units, and vanilla unitary or orthogonal RNNs on several long-term-dependency benchmark tasks. We empirically show that both orthogonal and unitary RNNs lack the ability to forget. This ability plays an important role in RNNs. We provide competitive results along with an analysis of our model on many natural sequential tasks, including question answering, speech spectrum prediction, character-level language modeling, and synthetic tasks that involve long-term dependencies such as algorithmic, denoising, and copying tasks.

openalex-author · arXiv (Cornell University)

InfoBot: Transfer and Exploration via the Information Bottleneck

A central challenge in reinforcement learning is discovering effective policies for tasks where rewards are sparsely distributed. We postulate that in the absence of useful reward signals, an effective exploration strategy should seek out {\it decision states}. These states lie at critical junctions in the state space from where the agent can transition to new, potentially unexplored regions. We propose to learn about decision states from prior experience. By training a goal-conditioned policy with an information bottleneck, we can identify decision states by examining where the model actually leverages the goal state. We find that this simple mechanism effectively identifies decision states, even in partially observed settings. In effect, the model learns the sensory cues that correlate with potential subgoals. In new environments, this model can then identify novel subgoals for further exploration, guiding the agent through a sequence of potential decision states and through new regions of the state space.

openalex-author · arXiv (Cornell University)

A Meta-Transfer Objective for Learning to Disentangle Causal Mechanisms

We propose to meta-learn causal structures based on how fast a learner adapts to new distributions arising from sparse distributional changes, e.g. due to interventions, actions of agents and other sources of non-stationarities. We show that under this assumption, the correct causal structural choices lead to faster adaptation to modified distributions because the changes are concentrated in one or just a few mechanisms when the learned knowledge is modularized appropriately. This leads to sparse expected gradients and a lower effective number of degrees of freedom needing to be relearned while adapting to the change. It motivates using the speed of adaptation to a modified distribution as a meta-learning objective. We demonstrate how this can be used to determine the cause-effect relationship between two observed variables. The distributional changes do not need to correspond to standard interventions (clamping a variable), and the learner has no direct knowledge of these interventions. We show that causal structures can be parameterized via continuous variables and learned end-to-end. We then explore how these ideas could be used to also learn an encoder that would map low-level observed variables to unobserved causal variables leading to faster adaptation out-of-distribution, learning a representation space where one can satisfy the assumptions of independent mechanisms and of small and sparse changes in these mechanisms due to actions and non-stationarities.

openalex-author · arXiv (Cornell University)

Maximum Entropy Generators for Energy-Based Models

Maximum likelihood estimation of energy-based models is a challenging problem due to the intractability of the log-likelihood gradient. In this work, we propose learning both the energy function and an amortized approximate sampling mechanism using a neural generator network, which provides an efficient approximation of the log-likelihood gradient. The resulting objective requires maximizing entropy of the generated samples, which we perform using recently proposed nonparametric mutual information estimators. Finally, to stabilize the resulting adversarial game, we use a zero-centered gradient penalty derived as a necessary condition from the score matching literature. The proposed technique can generate sharp images with Inception and FID scores competitive with recent GAN techniques, does not suffer from mode collapse, and is competitive with state-of-the-art anomaly detection techniques.

openalex-author · arXiv (Cornell University)

The Benefits of Over-parameterization at Initialization in Deep ReLU\n Networks

It has been noted in existing literature that over-parameterization in ReLU\nnetworks generally improves performance. While there could be several factors\ninvolved behind this, we prove some desirable theoretical properties at\ninitialization which may be enjoyed by ReLU networks. Specifically, it is known\nthat He initialization in deep ReLU networks asymptotically preserves variance\nof activations in the forward pass and variance of gradients in the backward\npass for infinitely wide networks, thus preserving the flow of information in\nboth directions. Our paper goes beyond these results and shows novel properties\nthat hold under He initialization: i) the norm of hidden activation of each\nlayer is equal to the norm of the input, and, ii) the norm of weight gradient\nof each layer is equal to the product of norm of the input vector and the error\nat output layer. These results are derived using the PAC analysis framework,\nand hold true for finitely sized datasets such that the width of the ReLU\nnetwork only needs to be larger than a certain finite lower bound. As we show,\nthis lower bound depends on the depth of the network and the number of samples,\nand by the virtue of being a lower bound, over-parameterized ReLU networks are\nendowed with these desirable properties. For the aforementioned hidden\nactivation norm property under He initialization, we further extend our theory\nand show that this property holds for a finite width network even when the\nnumber of data samples is infinite. Thus we overcome several limitations of\nexisting papers, and show new properties of deep ReLU networks at\ninitialization.\n

openalex-author · 2019 Conference on Cognitive Computational Neuroscience

The effect of task and training on intermediate representations in convolutional neural networks revealed with modified RV similarity analysis

Centered Kernel Alignment (CKA) was recently proposed as a similarity metric for comparing activation patterns in deep networks. Here we experiment with the modified RV-coefficient (RV2), which has very similar properties as CKA while being less sensitive to dataset size. We compare the representations of networks that received varying amounts of training on different layers: a standard trained network (all parameters updated at every step), a freeze trained network (layers gradually frozen during training), random networks (only some layers trained), and a completely untrained network. We found that RV2 was able to recover expected similarity patterns and provide interpretable similarity matrices that suggested hypotheses about how representations are affected by different training recipes. We propose that the superior performance achieved by freeze training can be attributed to representational differences in the penultimate layer. Our comparisons of random networks suggest that the inputs and targets serve as anchors on the representations in the lowest and highest layers.

openalex-author · Paper

InfoBot: Structured Exploration in ReinforcementLearning Using Information Bottleneck

No abstract available from the OpenAlex source record.

openalex-author · Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing

Interactive Language Learning by Question Answering

Xingdi Yuan, Marc-Alexandre Côté, Jie Fu, Zhouhan Lin, Chris Pal, Yoshua Bengio, Adam Trischler. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019.

openalex-author · Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Do Neural Dialog Systems Use the Conversation History Effectively? An Empirical Study

Neural generative models have been become increasingly popular when building conversational agents. They offer flexibility, can be easily adapted to new domains, and require minimal domain engineering. A common criticism of these systems is that they seldom understand or use the available dialog history effectively. In this paper, we take an empirical approach to understanding how these models use the available dialog history by studying the sensitivity of the models to artificially introduced unnatural changes or perturbations to their context at test time. We experiment with 10 different types of perturbations on 4 multi-turn dialog datasets and find that commonly used neural dialog architectures like recurrent and transformer-based seq2seq models are rarely sensitive to most perturbations such as missing or reordering utterances, shuffling words, etc. Also, by open-sourcing our code, we believe that it will serve as a useful diagnostic tool for evaluating dialog systems in the future 1 .

openalex-author · Paper

Perceptual Generative Autoencoders

Modern generative models are usually designed to match target distributions directly in the data space, where the intrinsic dimension of data can be much lower than the ambient dimension. We argue that this discrepancy may contribute to the difficulties in training generative models. We therefore propose to map both the generated and target distributions to a latent space using the encoder of a standard autoencoder, and train the generator (or decoder) to match the target distribution in the latent space. Specifically, we enforce the consistency in both the data space and the latent space with theoretically justified data and latent reconstruction losses. The resulting generative model, which we call a perceptual generative autoencoder (PGA), is then trained with a maximum likelihood or variational autoencoder (VAE) objective. With maximum likelihood, PGAs generalize the idea of reversible generative models to unrestricted neural network architectures and arbitrary number of latent dimensions. When combined with VAEs, PGAs substantially improve over the baseline VAEs in terms of sample quality. Compared to other autoencoder-based generative models using simple priors, PGAs achieve state-of-the-art FID scores on CIFAR-10 and CelebA.

openalex-author · Jastrzębski, S, Kenton, Z, Ballas, N, Fischer, A, Bengio, Y & Storkey, A 2019, 'On the Relation Between the Sharpest Directions of DNN Loss and the SGD Step Len

On the Relation Between the Sharpest Directions of DNN Loss and the SGD Step Length

Stochastic Gradient Descent (SGD) based training of neural networks with a large learning rate or a small batch-size typically ends in well-generalizing, flat regions of the weight space, as indicated by small eigenvalues of the Hessian of the training loss. However, the curvature along the SGD trajectory is poorly understood. An empirical investigation shows that initially SGD visits increasingly sharp regions, reaching a maximum sharpness determined by both the learning rate and the batch-size of SGD. When studying the SGD dynamics in relation to the sharpest directions in this initial phase, we find that the SGD step is large compared to the curvature and commonly fails to minimize the loss along the sharpest directions. Furthermore, using a reduced learning rate along these directions can improve training speed while leading to both sharper and better generalizing solutions compared to vanilla SGD. In summary, our analysis of the dynamics of SGD in the subspace of the sharpest directions shows that they influence the regions that SGD steers to (where larger learning rate or smaller batch size result in wider regions visited), the overall training speed, and the generalization ability of the final model.

openalex-author · Machine Learning - Medien, Infrastrukturen und Technologien der Künstlichen Intelligenz

»Deep Learning ist keine Religion«

No abstract available from the OpenAlex source record.

openalex-author · arXiv (Cornell University)

Quantized Guided Pruning for Efficient Hardware Implementations of\n Convolutional Neural Networks

Convolutional Neural Networks (CNNs) are state-of-the-art in numerous\ncomputer vision tasks such as object classification and detection. However, the\nlarge amount of parameters they contain leads to a high computational\ncomplexity and strongly limits their usability in budget-constrained devices\nsuch as embedded devices. In this paper, we propose a combination of a new\npruning technique and a quantization scheme that effectively reduce the\ncomplexity and memory usage of convolutional layers of CNNs, and replace the\ncomplex convolutional operation by a low-cost multiplexer. We perform\nexperiments on the CIFAR10, CIFAR100 and SVHN and show that the proposed method\nachieves almost state-of-the-art accuracy, while drastically reducing the\ncomputational and memory footprints. We also propose an efficient hardware\narchitecture to accelerate CNN operations. The proposed hardware architecture\nis a pipeline and accommodates multiple layers working at the same time to\nspeed up the inference process.\n

openalex-author · arXiv (Cornell University)

Speech and Speaker Recognition from Raw Waveform with SincNet

Deep neural networks can learn complex and abstract representations, that are progressively obtained by combining simpler ones. A recent trend in speech and speaker recognition consists in discovering these representations starting from raw audio samples directly. Differently from standard hand-crafted features such as MFCCs or FBANK, the raw waveform can potentially help neural networks discover better and more customized representations. The high-dimensional raw inputs, however, can make training significantly more challenging. This paper summarizes our recent efforts to develop a neural architecture that efficiently processes speech from audio waveforms. In particular, we propose SincNet, a novel Convolutional Neural Network (CNN) that encourages the first layer to discover meaningful filters by exploiting parametrized sinc functions. In contrast to standard CNNs, which learn all the elements of each filter, only low and high cutoff frequencies of band-pass filters are directly learned from data. This inductive bias offers a very compact way to derive a customized front-end, that only depends on some parameters with a clear physical meaning. Our experiments, conducted on both speaker and speech recognition, show that the proposed architecture converges faster, performs better, and is more computationally efficient than standard CNNs.

openalex-author · arXiv (Cornell University)

An Empirical Study of Example Forgetting during Deep Neural Network Learning

Inspired by the phenomenon of catastrophic forgetting, we investigate the learning dynamics of neural networks as they train on single classification tasks. Our goal is to understand whether a related phenomenon occurs when data does not undergo a clear distributional shift. We define a `forgetting event' to have occurred when an individual training example transitions from being classified correctly to incorrectly over the course of learning. Across several benchmark data sets, we find that: (i) certain examples are forgotten with high frequency, and some not at all; (ii) a data set's (un)forgettable examples generalize across neural architectures; and (iii) based on forgetting dynamics, a significant fraction of examples can be omitted from the training data set while still maintaining state-of-the-art generalization performance.

openalex-author · arXiv (Cornell University)

The effects of negative adaptation in Model-Agnostic Meta-Learning

The capacity of meta-learning algorithms to quickly adapt to a variety of tasks, including ones they did not experience during meta-training, has been a key factor in the recent success of these methods on few-shot learning problems. This particular advantage of using meta-learning over standard supervised or reinforcement learning is only well founded under the assumption that the adaptation phase does improve the performance of our model on the task of interest. However, in the classical framework of meta-learning, this constraint is only mildly enforced, if not at all, and we only see an improvement on average over a distribution of tasks. In this paper, we show that the adaptation in an algorithm like MAML can significantly decrease the performance of an agent in a meta-reinforcement learning setting, even on a range of meta-training tasks.

openalex-author · Cambridge University Engineering Department Publications Database

MetaGAN: an adversarial approach to few-shot learning

In this paper, we propose a conceptually simple and general framework called MetaGAN for few-shot learning problems. Most state-of-the-art few-shot classification models can be integrated with MetaGAN in a principled and straightforward way. By introducing an adversarial generator conditioned on tasks, we augment vanilla few-shot classification models with the ability to discriminate between real and fake data. We argue that this GAN-based approach can help few-shot classifiers to learn sharper decision boundary, which could generalize better. We show that with our MetaGAN framework, we can extend supervised few-shot learning models to naturally cope with unlabeled data. Different from previous work in semi-supervised few-shot learning, our algorithms can deal with semi-supervision at both sample-level and task-level. We give theoretical justifications of the strength of MetaGAN, and validate the effectiveness of MetaGAN on challenging few-shot image classification benchmarks.

openalex-author · 2018 IEEE Spoken Language Technology Workshop (SLT)

Speaker Recognition from Raw Waveform with SincNet

Deep learning is progressively gaining popularity as a viable alternative to i-vectors for speaker recognition. Promising results have been recently obtained with Convolutional Neural Networks (CNNs) when fed by raw speech samples directly. Rather than employing standard hand-crafted features, the latter CNNs learn low-level speech representations from waveforms, potentially allowing the network to better capture important narrow-band speaker characteristics such as pitch and formants. Proper design of the neural network is crucial to achieve this goal.This paper proposes a novel CNN architecture, called SincNet, that encourages the first convolutional layer to discover more meaningful filters. SincNet is based on parametrized sinc functions, which implement band-pass filters. In contrast to standard CNNs, that learn all elements of each filter, only low and high cutoff frequencies are directly learned from data with the proposed method. This offers a very compact and efficient way to derive a customized filter bank specifically tuned for the desired application.Our experiments, conducted on both speaker identification and speaker verification tasks, show that the proposed architecture converges faster and performs better than a standard CNN on raw waveforms.

openalex-author · arXiv (Cornell University)

Keep Drawing It: Iterative language-based image generation and editing.

Conditional text-to-image generation approaches commonly focus on generating a single image in a single step. One practical extension beyond one-step generation is an interactive system that generates an image iteratively, conditioned on ongoing linguistic input / feedback. This is significantly more challenging as such a system must understand and keep track of the ongoing context and history. In this work, we present a recurrent image generation model which takes into account both the generated output up to the current step as well as all past instructions for generation. We show that our model is able to generate the background, add new objects, apply simple transformations to existing objects, and correct previous mistakes. We believe our approach is an important step toward interactive generation.

openalex-author · arXiv (Cornell University)

DEFactor: Differentiable Edge Factorization-based Probabilistic Graph Generation

Generating novel molecules with optimal properties is a crucial step in many industries such as drug discovery. Recently, deep generative models have shown a promising way of performing de-novo molecular design. Although graph generative models are currently available they either have a graph size dependency in their number of parameters, limiting their use to only very small graphs or are formulated as a sequence of discrete actions needed to construct a graph, making the output graph non-differentiable w.r.t the model parameters, therefore preventing them to be used in scenarios such as conditional graph generation. In this work we propose a model for conditional graph generation that is computationally efficient and enables direct optimisation of the graph. We demonstrate favourable performance of our model on prototype-based molecular graph conditional generation tasks.

openalex-author · arXiv (Cornell University)

Interpretable Convolutional Filters with SincNet

Deep learning is currently playing a crucial role toward higher levels of artificial intelligence. This paradigm allows neural networks to learn complex and abstract representations, that are progressively obtained by combining simpler ones. Nevertheless, the internal "black-box" representations automatically discovered by current neural architectures often suffer from a lack of interpretability, making of primary interest the study of explainable machine learning techniques. This paper summarizes our recent efforts to develop a more interpretable neural model for directly processing speech from the raw waveform. In particular, we propose SincNet, a novel Convolutional Neural Network (CNN) that encourages the first layer to discover more meaningful filters by exploiting parametrized sinc functions. In contrast to standard CNNs, which learn all the elements of each filter, only low and high cutoff frequencies of band-pass filters are directly learned from data. This inductive bias offers a very compact way to derive a customized filter-bank front-end, that only depends on some parameters with a clear physical meaning. Our experiments, conducted on both speaker and speech recognition, show that the proposed architecture converges faster, performs better, and is more interpretable than standard CNNs.

openalex-author · arXiv (Cornell University)

Towards Training Recurrent Neural Networks for Lifelong Learning

Catastrophic forgetting and capacity saturation are the central challenges of any parametric lifelong learning system. In this work, we study these challenges in the context of sequential supervised learning with an emphasis on recurrent neural networks. To evaluate the models in the lifelong learning setting, we propose a curriculum-based, simple, and intuitive benchmark where the models are trained on tasks with increasing levels of difficulty. To measure the impact of catastrophic forgetting, the model is tested on all the previous tasks as it completes any task. As a step towards developing true lifelong learning systems, we unify Gradient Episodic Memory (a catastrophic forgetting alleviation approach) and Net2Net(a capacity expansion approach). Both these models are proposed in the context of feedforward networks and we evaluate the feasibility of using them for recurrent networks. Evaluation on the proposed benchmark shows that the unified model is more suitable than the constituent models for lifelong learning setting.

openalex-author · arXiv (Cornell University)

Dendritic cortical microcircuits approximate the backpropagation\n algorithm

Deep learning has seen remarkable developments over the last years, many of\nthem inspired by neuroscience. However, the main learning mechanism behind\nthese advances - error backpropagation - appears to be at odds with\nneurobiology. Here, we introduce a multilayer neuronal network model with\nsimplified dendritic compartments in which error-driven synaptic plasticity\nadapts the network towards a global desired output. In contrast to previous\nwork our model does not require separate phases and synaptic learning is driven\nby local dendritic prediction errors continuously in time. Such errors\noriginate at apical dendrites and occur due to a mismatch between predictive\ninput from lateral interneurons and activity from actual top-down feedback.\nThrough the use of simple dendritic compartments and different cell-types our\nmodel can represent both error and normal activity within a pyramidal neuron.\nWe demonstrate the learning capabilities of the model in regression and\nclassification tasks, and show analytically that it approximates the error\nbackpropagation algorithm. Moreover, our framework is consistent with recent\nobservations of learning between brain areas and the architecture of cortical\nmicrocircuits. Overall, we introduce a novel view of learning on dendritic\ncortical circuits and on how the brain may solve the long-standing synaptic\ncredit assignment problem.\n

openalex-author · arXiv (Cornell University)

BabyAI: First Steps Towards Grounded Language Learning With a Human In the Loop.

Allowing humans to interactively train artificial agents to understand language instructions is desirable for both practical and scientific reasons, but given the poor data efficiency of the current learning methods, this goal may require substantial research efforts. Here, we introduce the BabyAI research platform to support investigations towards including humans in the loop for grounded language learning. The BabyAI platform comprises an extensible suite of 19 levels of increasing difficulty. The levels gradually lead the agent towards acquiring a combinatorially rich synthetic language which is a proper subset of English. The platform also provides a heuristic expert agent for the purpose of simulating a human teacher. We report baseline results and estimate the amount of human involvement that would be required to train a neural network-based agent on some of the BabyAI levels. We put forward strong evidence that current deep learning methods are not yet sufficiently sample efficient when it comes to learning a language with compositional properties.

openalex-author · Paper

Deep Learning. Das umfassende Handbuch

No abstract available from the OpenAlex source record.

openalex-author · arXiv (Cornell University)

Towards the Latent Transcriptome

In this work we propose a method to compute continuous embeddings for kmers from raw RNA-seq data, without the need for alignment to a reference genome. The approach uses an RNN to transform kmers of the RNA-seq reads into a 2 dimensional representation that is used to predict abundance of each kmer. We report that our model captures information of both DNA sequence similarity as well as DNA sequence abundance in the embedding latent space, that we call the Latent Transcriptome. We confirm the quality of these vectors by comparing them to known gene sub-structures and report that the latent space recovers exon information from raw RNA-Seq data from acute myeloid leukemia patients. Furthermore we show that this latent space allows the detection of genomic abnormalities such as translocations as well as patient-specific mutations, making this representation space both useful for visualization as well as analysis.

openalex-author · arXiv (Cornell University)

h-detach: Modifying the LSTM Gradient Towards Better Optimization

Recurrent neural networks are known for their notorious exploding and vanishing gradient problem (EVGP). This problem becomes more evident in tasks where the information needed to correctly solve them exist over long time scales, because EVGP prevents important gradient components from being back-propagated adequately over a large number of steps. We introduce a simple stochastic algorithm (\textit{h}-detach) that is specific to LSTM optimization and targeted towards addressing this problem. Specifically, we show that when the LSTM weights are large, the gradient components through the linear path (cell state) in the LSTM computational graph get suppressed. Based on the hypothesis that these components carry information about long term dependencies (which we show empirically), their suppression can prevent LSTMs from capturing them. Our algorithm\footnote{Our code is available at https://github.com/bhargav104/h-detach.} prevents gradients flowing through this path from getting suppressed, thus allowing the LSTM to capture such dependencies better. We show significant improvements over vanilla LSTM gradient based training in terms of convergence speed, robustness to seed and learning rate, and generalization using our modification of LSTM gradient on various benchmark datasets.

openalex-author · arXiv (Cornell University)

Adversarial Domain Adaptation for Stable Brain-Machine Interfaces

Brain-Machine Interfaces (BMIs) have recently emerged as a clinically viable option to restore voluntary movements after paralysis. These devices are based on the ability to extract information about movement intent from neural signals recorded using multi-electrode arrays chronically implanted in the motor cortices of the brain. However, the inherent loss and turnover of recorded neurons requires repeated recalibrations of the interface, which can potentially alter the day-to-day user experience. The resulting need for continued user adaptation interferes with the natural, subconscious use of the BMI. Here, we introduce a new computational approach that decodes movement intent from a low-dimensional latent representation of the neural data. We implement various domain adaptation methods to stabilize the interface over significantly long times. This includes Canonical Correlation Analysis used to align the latent variables across days; this method requires prior point-to-point correspondence of the time series across domains. Alternatively, we match the empirical probability distributions of the latent variables across days through the minimization of their Kullback-Leibler divergence. These two methods provide a significant and comparable improvement in the performance of the interface. However, implementation of an Adversarial Domain Adaptation Network trained to match the empirical probability distribution of the residuals of the reconstructed neural signals outperforms the two methods based on latent variables, while requiring remarkably few data points to solve the domain adaptation problem.

openalex-author · Paper

Manifold Mixup: Learning Better Representations by Interpolating Hidden States

Deep networks often perform well on the data distribution on which they are trained, yet give incorrect (and often very confident) answers when evaluated on points from off of the training distribution. This is exemplified by the adversarial examples phenomenon but can also be seen in terms of model generalization and domain shift. Ideally, a model would assign lower confidence to points unlike those from the training distribution. We propose a regularizer which addresses this issue by training with interpolated hidden states and encouraging the classifier to be less confident at these points. Because the hidden states are learned, this has an important effect of encouraging the hidden states for a class to be concentrated in such a way so that interpolations within the same class or between two different classes do not intersect with the real data points from other classes. This has a major advantage in that it avoids the underfitting which can result from interpolating in the input space. We prove that the exact condition for this problem of underfitting to be avoided by Manifold Mixup is that the dimensionality of the hidden states exceeds the number of classes, which is often the case in practice. Additionally, this concentration can be seen as making the features in earlier layers more discriminative. We show that despite requiring no significant additional computation, Manifold Mixup achieves large improvements over strong baselines in supervised learning, robustness to single-step adversarial attacks, semi-supervised learning, and Negative Log-Likelihood on held out samples.

openalex-author · Paper

Convergence Properties of Deep Neural Networks on Separable Data

No abstract available from the OpenAlex source record.

openalex-author · International Conference on Learning Representations

Modeling the Long Term Future in Model-Based Reinforcement Learning

No abstract available from the OpenAlex source record.

openalex-author · Paper

Unsupervised one-to-many image translation

No abstract available from the OpenAlex source record.

openalex-author · Paper

EnGAN: Latent Space MCMC and Maximum Entropy Generators for Energy-based Models

No abstract available from the OpenAlex source record.

openalex-author · Paper

Reinforced Imitation Learning from Observations

No abstract available from the OpenAlex source record.

openalex-author · arXiv (Cornell University)

How can deep learning advance computational modeling of sensory information processing?

Deep learning, computational neuroscience, and cognitive science have overlapping goals related to understanding intelligence such that perception and behaviour can be simulated in computational systems. In neuroimaging, machine learning methods have been used to test computational models of sensory information processing. Recently, these model comparison techniques have been used to evaluate deep neural networks (DNNs) as models of sensory information processing. However, the interpretation of such model evaluations is muddied by imprecise statistical conclusions. Here, we make explicit the types of conclusions that can be drawn from these existing model comparison techniques and how these conclusions change when the model in question is a DNN. We discuss how DNNs are amenable to new model comparison techniques that allow for stronger conclusions to be made about the computational mechanisms underlying sensory information processing.

openalex-author · arXiv (Cornell University)

On the Learning Dynamics of Deep Neural Networks

While a lot of progress has been made in recent years, the dynamics of learning in deep nonlinear neural networks remain to this day largely misunderstood. In this work, we study the case of binary classification and prove various properties of learning in such networks under strong assumptions such as linear separability of the data. Extending existing results from the linear case, we confirm empirical observations by proving that the classification error also follows a sigmoidal shape in nonlinear architectures. We show that given proper initialization, learning expounds parallel independent modes and that certain regions of parameter space might lead to failed training. We also demonstrate that input norm and features' frequency in the dataset lead to distinct convergence speeds which might shed some light on the generalization capabilities of deep neural networks. We provide a comparison between the dynamics of learning with cross-entropy and hinge losses, which could prove useful to understand recent progress in the training of generative adversarial networks. Finally, we identify a phenomenon that we baptize gradient starvation where the most frequent features in a dataset prevent the learning of other less frequent but equally informative features.

openalex-author · Interspeech 2018

Quaternion Convolutional Neural Networks for End-to-End Automatic Speech Recognition

Recently, the connectionist temporal classification (CTC) model coupled with recurrent (RNN) or convolutional neural networks (CNN), made it easier to train speech recognition systems in an end-to-end fashion. However in real-valued models, time frame components such as mel-filter-bank energies and the cepstral coefficients obtained from them, together with their first and second order derivatives, are processed as individual elements, while a natural alternative is to process such components as composed entities. We propose to group such elements in the form of quaternions and to process these quaternions using the established quaternion algebra. Quaternion numbers and quaternion neural networks have shown their efficiency to process multidimensional inputs as entities, to encode internal dependencies, and to solve many tasks with less learning parameters than real-valued models. This paper proposes to integrate multiple feature views in quaternion-valued convolutional neural network (QCNN), to be used for sequence-to-sequence mapping with the CTC model. Promising results are reported using simple QCNNs in phoneme recognition experiments with the TIMIT corpus. More precisely, QCNNs obtain a lower phoneme error rate (PER) with less learning parameters than a competing model based on real-valued CNNs.

openalex-author · Interspeech 2018

Twin Regularization for Online Speech Recognition

Online speech recognition is crucial for developing natural human-machine interfaces. This modality, however, is significantly more challenging than off-line ASR, since real-time/low-latency constraints inevitably hinder the use of future information, that is known to be very helpful to perform robust predictions. A popular solution to mitigate this issue consists of feeding neural acoustic models with context windows that gather some future frames. This introduces a latency which depends on the number of employed look-ahead features. This paper explores a different approach, based on estimating the future rather than waiting for it. Our technique encourages the hidden representations of a unidirectional recurrent network to embed some useful information about the future. Inspired by a recently proposed technique called Twin Networks, we add a regularization term that forces forward hidden states to be as close as possible to cotemporal backward ones, computed by a "twin" neural network running backwards in time. The experiments, conducted on a number of datasets, recurrent architectures, input features, and acoustic conditions, have shown the effectiveness of this approach. One important advantage is that our method does not introduce any additional computation at test time if compared to standard unidirectional recurrent networks.

openalex-author · arXiv (Cornell University)

Learning deep representations by mutual information estimation and maximization

In this work, we perform unsupervised learning of representations by maximizing mutual information between an input and the output of a deep neural network encoder. Importantly, we show that structure matters: incorporating knowledge about locality of the input to the objective can greatly influence a representation's suitability for downstream tasks. We further control characteristics of the representation by matching to a prior distribution adversarially. Our method, which we call Deep InfoMax (DIM), outperforms a number of popular unsupervised learning methods and competes with fully-supervised learning on several classification tasks. DIM opens new avenues for unsupervised learning of representations and is an important step towards flexible formulations of representation-learning objectives for specific end-goals.

openalex-author · arXiv (Cornell University)

Generalization of Equilibrium Propagation to Vector Field Dynamics

The biological plausibility of the backpropagation algorithm has long been doubted by neuroscientists. Two major reasons are that neurons would need to send two different types of signal in the forward and backward phases, and that pairs of neurons would need to communicate through symmetric bidirectional connections. We present a simple two-phase learning procedure for fixed point recurrent networks that addresses both these issues. In our model, neurons perform leaky integration and synaptic weights are updated through a local mechanism. Our learning method generalizes Equilibrium Propagation to vector field dynamics, relaxing the requirement of an energy function. As a consequence of this generalization, the algorithm does not compute the true gradient of the objective function, but rather approximates it at a precision which is proven to be directly related to the degree of symmetry of the feedforward and feedback weights. We show experimentally that our algorithm optimizes the objective function.

openalex-author · Paper

Predicting Solution Summaries to Integer Linear Programs under Imperfect Information with Machine Learning.

No abstract available from the OpenAlex source record.

openalex-author · arXiv (Cornell University)

DNN's Sharpest Directions Along the SGD Trajectory.

openalex-author · Distill

Feature-wise transformations

Many real-world problems require integrating multiple sources of information. Sometimes these problems involve multiple, distinct modalities of information — vision, language, audio, etc. — as is required to understand a scene in a movie or answer a question about an image. Other times, these problems involve multiple sources of the same kind of input, i.e. when summarizing several documents or drawing one image in the style of another.

openalex-author · PolyPublie (École Polytechnique de Montréal)

Focused Hierarchical RNNs for Conditional Sequence Processing

Recurrent Neural Networks (RNNs) with attention mechanisms have obtained state-of-the-art results for many sequence processing tasks. Most of these models use a simple form of encoder with attention that looks over the entire sequence and assigns a weight to each token independently. We present a mechanism for focusing RNN encoders for sequence modelling tasks which allows them to attend to key parts of the input as needed. We formulate this using a multi-layer conditional sequence encoder that reads in one token at a time and makes a discrete decision on whether the token is relevant to the context or question being asked. The discrete gating mechanism takes in the context embedding and the current hidden state as inputs and controls information flow into the layer above. We train it using policy gradient methods. We evaluate this method on several types of tasks with different attributes. First, we evaluate the method on synthetic tasks which allow us to evaluate the model for its generalization ability and probe the behavior of the gates in more controlled settings. We then evaluate this approach on large scale Question Answering tasks including the challenging MS MARCO and SearchQA tasks. Our models shows consistent improvements for both tasks over prior work and our baselines. It has also shown to generalize significantly better on synthetic tasks as compared to the baselines.

openalex-author · International Conference on Machine Learning

Mutual Information Neural Estimation.

No abstract available from the OpenAlex source record.

openalex-author · 2018 International Joint Conference on Neural Networks (IJCNN)

MaD TwinNet: Masker-Denoiser Architecture with Twin Networks for Monaural Sound Source Separation

Monaural singing voice separation task focuses on the prediction of the singing voice from a single channel music mixture signal. Current state of the art (SOTA) results in monaural singing voice separation are obtained with deep learning based methods. In this work we present a novel recurrent neural approach that learns long-term temporal patterns and structures of a musical piece. We build upon the recently proposed Masker-Denoiser (MaD) architecture and we enhance it with the Twin Networks, a technique to regularize a recurrent generative network using a backward running copy of the network. We evaluate our method using the Demixing Secret Dataset and we obtain an increment to signal-to-distortion ratio (SDR) of 0.37 dB and to signal-to-interference ratio (SIR) of 0.23 dB, compared to previous SOTA results.

openalex-author · arXiv (Cornell University)

On the Spectral Bias of Deep Neural Networks

It is well known that over-parametrized deep neural networks (DNNs) are an overly expressive class of functions that can memorize even random data with $100\%$ training accuracy. This raises the question why they do not easily overfit real data. To answer this question, we study deep networks using Fourier analysis. We show that deep networks with finite weights (or trained for finite number of steps) are inherently biased towards representing smooth functions over the input space. Specifically, the magnitude of a particular frequency component ($k$) of deep ReLU network function decays at least as fast as $\mathcal{O}(k^{-2})$, with width and depth helping polynomially and exponentially (respectively) in modeling higher frequencies. This shows for instance why DNNs cannot perfectly \textit{memorize} peaky delta-like functions. We also show that DNNs can exploit the geometry of low dimensional data manifolds to approximate complex functions that exist along the manifold with simple functions when seen with respect to the input space. As a consequence, we find that all samples (including adversarial samples) classified by a network to belong to a certain class are connected by a path such that the prediction of the network along that path does not change. Finally we find that DNN parameters corresponding to functions with higher frequency components occupy a smaller volume in the parameter.

openalex-author · arXiv (Cornell University)

Towards Gene Expression Convolutions using Gene Interaction Graphs

We study the challenges of applying deep learning to gene expression data. We find experimentally that there exists non-linear signal in the data, however is it not discovered automatically given the noise and low numbers of samples used in most research. We discuss how gene interaction graphs (same pathway, protein-protein, co-expression, or research paper text association) can be used to impose a bias on a deep model similar to the spatial bias imposed by convolutions on an image. We explore the usage of Graph Convolutional Neural Networks coupled with dropout and gene embeddings to utilize the graph information. We find this approach provides an advantage for particular tasks in a low data regime but is very dependent on the quality of the graph used. We conclude that more work should be done in this direction. We design experiments that show why existing methods fail to capture signal that is present in the data when features are added which clearly isolates the problem that needs to be addressed.

openalex-author · arXiv (Cornell University)

Modularity Matters: Learning Invariant Relational Reasoning Tasks

We focus on two supervised visual reasoning tasks whose labels encode a semantic relational rule between two or more objects in an image: the MNIST Parity task and the colorized Pentomino task. The objects in the images undergo random translation, scaling, rotation and coloring transformations. Thus these tasks involve invariant relational reasoning. We report uneven performance of various deep CNN models on these two tasks. For the MNIST Parity task, we report that the VGG19 model soundly outperforms a family of ResNet models. Moreover, the family of ResNet models exhibits a general sensitivity to random initialization for the MNIST Parity task. For the colorized Pentomino task, now both the VGG19 and ResNet models exhibit sluggish optimization and very poor test generalization, hovering around 30% test error. The CNN we tested all learn hierarchies of fully distributed features and thus encode the distributed representation prior. We are motivated by a hypothesis from cognitive neuroscience which posits that the human visual cortex is modularized, and this allows the visual cortex to learn higher order invariances. To this end, we consider a modularized variant of the ResNet model, referred to as a Residual Mixture Network (ResMixNet) which employs a mixture-of-experts architecture to interleave distributed representations with more specialized, modular representations. We show that very shallow ResMixNets are capable of learning each of the two tasks well, attaining less than 2% and 1% test error on the MNIST Parity and the colorized Pentomino tasks respectively. Most importantly, the ResMixNet models are extremely parameter efficient: generalizing better than various non-modular CNNs that have over 10x the number of parameters. These experimental results support the hypothesis that modularity is a robust prior for learning invariant relational reasoning.

openalex-author · Bioinformatics

Deep convolutional networks for quality assessment of protein folds

Supplementary data are available at Bioinformatics online.

openalex-author · arXiv (Cornell University)

Manifold Mixup: Encouraging Meaningful On-Manifold Interpolation as a Regularizer.

Deep networks often perform well on the data manifold on which they are trained, yet give incorrect (and often very confident) answers when evaluated on points from off of the training distribution. This is exemplified by the adversarial examples phenomenon but can also be seen in terms of model generalization and domain shift. We propose Manifold Mixup which encourages the network to produce more reasonable and less confident predictions at points with combinations of attributes not seen in the training set. This is accomplished by training on convex combinations of the hidden state representations of data samples. Using this method, we demonstrate improved semi-supervised learning, learning with limited labeled data, and robustness to adversarial examples. Manifold Mixup requires no (significant) additional computation. Analytical experiments on both real data and synthetic data directly support our hypothesis for why the Manifold Mixup method improves results.

openalex-author · arXiv (Cornell University)

Quaternion Recurrent Neural Networks

Recurrent neural networks (RNNs) are powerful architectures to model\nsequential data, due to their capability to learn short and long-term\ndependencies between the basic elements of a sequence. Nonetheless, popular\ntasks such as speech or images recognition, involve multi-dimensional input\nfeatures that are characterized by strong internal dependencies between the\ndimensions of the input vector. We propose a novel quaternion recurrent neural\nnetwork (QRNN), alongside with a quaternion long-short term memory neural\nnetwork (QLSTM), that take into account both the external relations and these\ninternal structural dependencies with the quaternion algebra. Similarly to\ncapsules, quaternions allow the QRNN to code internal dependencies by composing\nand processing multidimensional features as single entities, while the\nrecurrent operation reveals correlations between the elements composing the\nsequence. We show that both QRNN and QLSTM achieve better performances than RNN\nand LSTM in a realistic application of automatic speech recognition. Finally,\nwe show that QRNN and QLSTM reduce by a maximum factor of 3.3x the number of\nfree parameters needed, compared to real-valued RNNs and LSTMs to reach better\nresults, leading to a more compact representation of the relevant information.\n

openalex-author · arXiv (Cornell University)

Learning to rank for censored survival data

Survival analysis is a type of semi-supervised ranking task where the target output (the survival time) is often right-censored. Utilizing this information is a challenge because it is not obvious how to correctly incorporate these censored examples into a model. We study how three categories of loss functions, namely partial likelihood methods, rank methods, and our classification method based on a Wasserstein metric (WM) and the non-parametric Kaplan Meier estimate of the probability density to impute the labels of censored examples, can take advantage of this information. The proposed method allows us to have a model that predict the probability distribution of an event. If a clinician had access to the detailed probability of an event over time this would help in treatment planning. For example, determining if the risk of kidney graft rejection is constant or peaked after some time. Also, we demonstrate that this approach directly optimizes the expected C-index which is the most common evaluation metric for ranking survival models.

openalex-author · 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

On the Iterative Refinement of Densely Connected Representation Levels for Semantic Segmentation

State-of-the-art semantic segmentation approaches increase the receptive field of their models by using either a downsampling path composed of poolings/strided convolutions or successive dilated convolutions. However, it is not clear which operation leads to best results. In this paper, we systematically study the differences introduced by distinct receptive field enlargement methods and their impact on the performance of a novel architecture, called Fully Convolutional DenseResNet (FC-DRN). FC-DRN has a densely connected backbone composed of residual networks. Following standard image segmentation architectures, receptive field enlargement operations that change the representation level are interleaved among residual networks. This allows the model to exploit the benefits of both residual and dense connectivity patterns, namely: gradient flow, iterative refinement of representations, multi-scale feature combination and deep supervision. In order to highlight the potential of our model, we test it on the challenging CamVid urban scene understanding benchmark and make the following observations: 1) downsampling operations outperform dilations when the model is trained from scratch, 2) dilations are useful during the finetuning step of the model, 3) coarser representations require less refinement steps, and 4) ResNets (by model construction) are good regularizers, since they can reduce the model capacity when needed. Finally, we compare our architecture to alternative methods and report state-of-the-art result on the Camvid dataset, with at least twice fewer parameters.

openalex-author · arXiv (Cornell University)

Image-to-image translation for cross-domain disentanglement

Deep image translation methods have recently shown excellent results, outputting high-quality images covering multiple modes of the data distribution. There has also been increased interest in disentangling the internal representations learned by deep methods to further improve their performance and achieve a finer control. In this paper, we bridge these two objectives and introduce the concept of cross-domain disentanglement. We aim to separate the internal representation into three parts. The shared part contains information for both domains. The exclusive parts, on the other hand, contain only factors of variation that are particular to each domain. We achieve this through bidirectional image translation based on Generative Adversarial Networks and cross-domain autoencoders, a novel network component. Our model offers multiple advantages. We can output diverse samples covering multiple modes of the distributions of both domains, perform domain-specific image transfer and interpolation, and cross-domain retrieval without the need of labeled data, only paired images. We compare our model to the state-of-the-art in multi-modal image translation and achieve better results for translation on challenging datasets as well as for cross-domain retrieval on realistic datasets.

openalex-author · arXiv (Cornell University)

Low-memory convolutional neural networks through incremental depth-first processing

We introduce an incremental processing scheme for convolutional neural network (CNN) inference, targeted at embedded applications with limited memory budgets. Instead of processing layers one by one, individual input pixels are propagated through all parts of the network they can influence under the given structural constraints. This depth-first updating scheme comes with hard bounds on the memory footprint: the memory required is constant in the case of 1D input and proportional to the square root of the input dimension in the case of 2D input.

openalex-author · arXiv (Cornell University)

Commonsense mining as knowledge base completion? A study on the impact\n of novelty

Commonsense knowledge bases such as ConceptNet represent knowledge in the\nform of relational triples. Inspired by the recent work by Li et al., we\nanalyse if knowledge base completion models can be used to mine commonsense\nknowledge from raw text. We propose novelty of predicted triples with respect\nto the training set as an important factor in interpreting results. We\ncritically analyse the difficulty of mining novel commonsense knowledge, and\nshow that a simple baseline method outperforms the previous state of the art on\npredicting more novel.\n

openalex-author · Paper

Iteratively unveiling new regions of interest in Deep Learning models

No abstract available from the OpenAlex source record.

openalex-author · Paper

Convolutional neural networks for mesh-based parcellation of the cerebral cortex

No abstract available from the OpenAlex source record.

openalex-author · arXiv (Cornell University)

Fortified Networks: Improving the Robustness of Deep Networks by Modeling the Manifold of Hidden Representations

Deep networks have achieved impressive results across a variety of important tasks. However a known weakness is a failure to perform well when evaluated on data which differ from the training distribution, even if these differences are very small, as is the case with adversarial examples. We propose Fortified Networks, a simple transformation of existing networks, which fortifies the hidden layers in a deep network by identifying when the hidden states are off of the data manifold, and maps these hidden states back to parts of the data manifold where the network performs well. Our principal contribution is to show that fortifying these hidden states improves the robustness of deep networks and our experiments (i) demonstrate improved robustness to standard adversarial attacks in both black-box and white-box threat models; (ii) suggest that our improvements are not primarily due to the gradient masking problem and (iii) show the advantage of doing this fortification in the hidden layers instead of the input space.

openalex-author · arXiv (Cornell University)

Recall Traces: Backtracking Models for Efficient Reinforcement Learning

In many environments only a tiny subset of all states yield high reward. In these cases, few of the interactions with the environment provide a relevant learning signal. Hence, we may want to preferentially train on those high-reward states and the probable trajectories leading to them. To this end, we advocate for the use of a backtracking model that predicts the preceding states that terminate at a given high-reward state. We can train a model which, starting from a high value state (or one that is estimated to have high value), predicts and sample for which the (state, action)-tuples may have led to that high value state. These traces of (state, action) pairs, which we refer to as Recall Traces, sampled from this backtracking model starting from a high value state, are informative as they terminate in good states, and hence we can use these traces to improve a policy. We provide a variational interpretation for this idea and a practical algorithm in which the backtracking model samples from an approximate posterior distribution over trajectories which lead to large rewards. Our method improves the sample efficiency of both on- and off-policy RL algorithms across several environments and tasks.

openalex-author · 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Towards End-to-end Spoken Language Understanding

Spoken language understanding system is traditionally designed as a pipeline of a number of components. First, the audio signal is processed by an automatic speech recognizer for transcription or n-best hypotheses. With the recognition results, a natural language understanding system classifies the text to structured data as domain, intent and slots for down-streaming consumers, such as dialog system, hands-free applications. These components are usually developed and optimized independently. In this paper, we present our study on an end-to-end learning system for spoken language understanding. With this unified approach, we can infer the semantic meaning directly from audio features without the intermediate text representation. This study showed that the trained model can achieve reasonable good result and demonstrated that the model can capture the semantic attention directly from the audio features.

openalex-author · 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Dynamic Frame Skipping for Fast Speech Recognition in Recurrent Neural Network Based Acoustic Models

A recurrent neural network is a powerful tool for modeling sequential data such as text and speech. While recurrent neural networks have achieved record-breaking results in speech recognition, one remaining challenge is their slow processing speed. The main cause comes from the nature of recurrent neural networks that read only one frame at each time step. Therefore, reducing the number of reads is an effective approach to reducing processing time. In this paper, we propose a novel recurrent neural network architecture called Skip-RNN, which dynamically skips speech frames that are less important. The Skip-RNN consists of an acoustic model network and skip-policy network that are jointly trained to classify speech frames and determine how many frames to skip. We evaluate our proposed approach on the Wall Street Journal corpus and show that it can accelerate acoustic model computation by up to 2.4 times without any noticeable degradation in transcription accuracy.

openalex-author · PolyPublie (École Polytechnique de Montréal)

Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

A lot of the recent success in natural language processing (NLP) has been driven by distributed vector representations of words trained on large amounts of text in an unsupervised manner. These representations are typically used as general purpose features for words across a range of NLP problems. However, extending this success to learning representations of sequences of words, such as sentences, remains an open problem. Recent work has explored unsupervised as well as supervised learning techniques with different training objectives to learn general purpose fixed-length sentence representations. In this work, we present a simple, effective multi-task learning framework for sentence representations that combines the inductive biases of diverse training objectives in a single model. We train this model on several data sources with multiple training objectives on over 100 million sentences. Extensive experiments demonstrate that sharing a single recurrent sentence encoder across weakly related tasks leads to consistent improvements over previous methods. We present substantial improvements in the context of transfer learning and low-resource settings using our learned general-purpose representations.

openalex-author · IEEE Transactions on Emerging Topics in Computational Intelligence

Light Gated Recurrent Units for Speech Recognition

A field that has directly benefited from the recent advances in deep learning is automatic speech recognition (ASR). Despite the great achievements of the past decades, however, a natural and robust human-machine speech interaction still appears to be out of reach, especially in challenging environments characterized by significant noise and reverberation. To improve robustness, modern speech recognizers often employ acoustic models based on recurrent neural networks (RNNs) that are naturally able to exploit large time contexts and long-term speech modulations. It is thus of great interest to continue the study of proper techniques for improving the effectiveness of RNNs in processing speech signals. In this paper, we revise one of the most popular RNN models, namely, gated recurrent units (GRUs), and propose a simplified architecture that turned out to be very effective for ASR. The contribution of this work is twofold: First, we analyze the role played by the reset gate, showing that a significant redundancy with the update gate occurs. As a result, we propose to remove the former from the GRU design, leading to a more efficient and compact single-gate model. Second, we propose to replace hyperbolic tangent with rectified linear unit activations. This variation couples well with batch normalization and could help the model learn long-term dependencies without numerical issues. Results show that the proposed architecture, called light GRU, not only reduces the per-epoch training time by more than 30% over a standard GRU, but also consistently improves the recognition accuracy across different tasks, input features, noisy conditions, as well as across different ASR paradigms, ranging from standard DNN-HMM speech recognizers to end-to-end connectionist temporal classification models.

openalex-author · arXiv (Cornell University)

Disentangling the independently controllable factors of variation by interacting with the world

It has been postulated that a good representation is one that disentangles the underlying explanatory factors of variation. However, it remains an open question what kind of training framework could potentially achieve that. Whereas most previous work focuses on the static setting (e.g., with images), we postulate that some of the causal factors could be discovered if the learner is allowed to interact with its environment. The agent can experiment with different actions and observe their effects. More specifically, we hypothesize that some of these factors correspond to aspects of the environment which are independently controllable, i.e., that there exists a policy and a learnable feature for each such aspect of the environment, such that this policy can yield changes in that feature with minimal changes to other features that explain the statistical variations in the observed data. We propose a specific objective function to find such factors, and verify experimentally that it can indeed disentangle independently controllable aspects of the environment without any extrinsic reward signal.

openalex-author · arXiv (Cornell University)

Learning Anonymized Representations with Adversarial Neural Networks

Statistical methods protecting sensitive information or the identity of the data owner have become critical to ensure privacy of individuals as well as of organizations. This paper investigates anonymization methods based on representation learning and deep neural networks, and motivated by novel information theoretical bounds. We introduce a novel training objective for simultaneously training a predictor over target variables of interest (the regular labels) while preventing an intermediate representation to be predictive of the private labels. The architecture is based on three sub-networks: one going from input to representation, one from representation to predicted regular labels, and one from representation to predicted private labels. The training procedure aims at learning representations that preserve the relevant part of the information (about regular labels) while dismissing information about the private labels which correspond to the identity of a person. We demonstrate the success of this approach for two distinct classification versus anonymization tasks (handwritten digits and sentiment analysis).

openalex-author · arXiv (Cornell University)

A Walk with SGD.

We present novel empirical observations regarding how stochastic gradient descent (SGD) navigates the loss landscape of over-parametrized deep neural networks (DNNs). These observations expose the qualitatively different roles of learning rate and batch-size in DNN optimization and generalization. Specifically we study the DNN loss surface along the trajectory of SGD by interpolating the loss surface between parameters from consecutive \textit{iterations} and tracking various metrics during training. We find that the loss interpolation between parameters before and after each training iteration's update is roughly convex with a minimum (\textit{valley floor}) in between for most of the training. Based on this and other metrics, we deduce that for most of the training update steps, SGD moves in valley like regions of the loss surface by jumping from one valley wall to another at a height above the valley floor. This 'bouncing between walls at a height' mechanism helps SGD traverse larger distance for small batch sizes and large learning rates which we find play qualitatively different roles in the dynamics. While a large learning rate maintains a large height from the valley floor, a small batch size injects noise facilitating exploration. We find this mechanism is crucial for generalization because the valley floor has barriers and this exploration above the valley floor allows SGD to quickly travel far away from the initialization point (without being affected by barriers) and find flatter regions, corresponding to better generalization.

openalex-author · arXiv (Cornell University)

ChatPainter: Improving Text to Image Generation using Dialogue

Synthesizing realistic images from text descriptions on a dataset like Microsoft Common Objects in Context (MS COCO), where each image can contain several objects, is a challenging task. Prior work has used text captions to generate images. However, captions might not be informative enough to capture the entire image and insufficient for the model to be able to understand which objects in the images correspond to which words in the captions. We show that adding a dialogue that further describes the scene leads to significant improvement in the inception score and in the quality of generated images on the MS COCO dataset.

openalex-author · arXiv (Cornell University)

Generalization in Machine Learning via Analytical Learning Theory

This paper introduces a novel measure-theoretic theory for machine learning that does not require statistical assumptions. Based on this theory, a new regularization method in deep learning is derived and shown to outperform previous methods in CIFAR-10, CIFAR-100, and SVHN. Moreover, the proposed theory provides a theoretical basis for a family of practically successful regularization methods in deep learning. We discuss several consequences of our results on one-shot learning, representation learning, deep learning, and curriculum learning. Unlike statistical learning theory, the proposed learning theory analyzes each problem instance individually via measure theory, rather than a set of problem instances via statistics. As a result, it provides different types of results and insights when compared to statistical learning theory.

openalex-author · PolyPublie (École Polytechnique de Montréal)

Twin Networks: Matching the Future for Sequence Generation

We propose a simple technique for encouraging generative RNNs to plan ahead. We train a “backward” recurrent network to generate a given sequence in reverse order, and we encourage states of the forward model to predict cotemporal states of the backward model. The backward network is used only during training, and plays no role during sampling or inference. We hypothesize that our approach eases modeling of long-term dependencies by implicitly forcing the forward states to hold information about the longer-term future (as contained in the backward states). We show empirically that our approach achieves 9% relative improvement for a speech recognition task, and achieves significant improvement on a COCO caption generation task.

openalex-author · Paper

Combining Model-based and Model-free RL via Multi-step Control Variates

Model-free deep reinforcement learning algorithms are able to successfully solve a wide range of continuous control tasks, but typically require many on-policy samples to achieve good performance. Model-based RL algorithms are sample-efficient on the other hand, while learning accurate global models of complex dynamic environments has turned out to be tricky in practice, which leads to the unsatisfactory performance of the learned policies. In this work, we combine the sample-efficiency of model-based algorithms and the accuracy of model-free algorithms. We leverage multi-step neural network based predictive models by embedding real trajectories into imaginary rollouts of the model, and use the imaginary cumulative rewards as control variates for model-free algorithms. In this way, we achieved the strengths of both sides and derived an estimator which is not only sample-efficient, but also unbiased and of very low variance. We present our evaluation on the MuJoCo and OpenAI Gym benchmarks.

openalex-author · International Conference on Learning Representations

Extending the Framework of Equilibrium Propagation to General Dynamics

The biological plausibility of the backpropagation algorithm has long been doubted by neuroscientists. Two major reasons are that neurons would need to send two different types of signal in the forward and backward phases, and that pairs of neurons would need to communicate through symmetric bidirectional connections. We present a simple two-phase learning procedure for fixed point recurrent networks that addresses both these issues. In our model, neurons perform leaky integration and synaptic weights are updated through a local mechanism. Our learning method extends the framework of Equilibrium Propagation to general dynamics, relaxing the requirement of an energy function. As a consequence of this generalization, the algorithm does not compute the true gradient of the objective function, but rather approximates it at a precision which is proven to be directly related to the degree of symmetry of the feedforward and feedback weights. We show experimentally that the intrinsic properties of the system lead to alignment of the feedforward and feedback weights, and that our algorithm optimizes the objective function.

openalex-author · Paper

Learning Generative Models with Locally Disentangled Latent Factors

One of the most successful techniques in generative models has been decomposing a complicated generation task into a series of simpler generation tasks. For example, generating an image at a low resolution and then learning to refine that into a high resolution image often improves results substantially. Here we explore a novel strategy for decomposing generation for complicated objects in which we first generate latent variables which describe a subset of the observed variables, and then map from these latent variables to the observed space. We show that this allows us to achieve decoupled training of complicated generative models and present both theoretical and experimental results supporting the benefit of such an approach.

openalex-author · International Conference on Learning Representations

Boundary Seeking GANs

Generative adversarial networks are a learning framework that rely on training a discriminator to estimate a measure of difference between a target and generated distributions. GANs, as normally formulated, rely on the generated samples being completely differentiable w.r.t. the generative parameters, and thus do not work for discrete data. We introduce a method for training GANs with discrete data that uses the estimated difference measure from the discriminator to compute importance weights for generated samples, thus providing a policy gradient for training the generator. The importance weights have a strong connection to the decision boundary of the discriminator, and we call our method boundary-seeking GANs (BGANs). We demonstrate the effectiveness of the proposed algorithm with discrete image and character-based natural language generation. In addition, the boundary-seeking objective extends to continuous data, which can be used to improve stability of training, and we demonstrate this on Celeba, Large-scale Scene Understanding (LSUN) bedrooms, and Imagenet without conditioning.

openalex-author · International Conference on Learning Representations

Finding Flatter Minima with SGD

No abstract available from the OpenAlex source record.

openalex-author · International Conference on Learning Representations

Universal Successor Representations for Transfer Reinforcement Learning

The objective of transfer reinforcement learning is to generalize from a set of previous tasks to unseen new tasks. In this work, we focus on the transfer scenario where the dynamics among tasks are the same, but their goals differ. Although general value function (Sutton et al., 2011) has been shown to be useful for knowledge transfer, learning a universal value function can be challenging in practice. To attack this, we propose (1) to use universal successor representations (USR) to represent the transferable knowledge and (2) a USR approximator (USRA) that can be trained by interacting with the environment. Our experiments show that USR can be effectively applied to new tasks, and the agent initialized by the trained USRA can achieve the goal considerably faster than random initialization.

openalex-author · Paper

Graph Priors for Deep Neural Networks

No abstract available from the OpenAlex source record.

openalex-author · Zenodo (CERN European Organization for Nuclear Research)

MaD TwinNet Pre-trained Weights

This .zip file contains the pre-trained weights for the MaD TwinNet. \n\nIn order to use them, you need the code of MaD TwinNet (available from here). Code of MaD TwinNet is based on PyTorch framework. \n\nAll weights are serialized (i.e. saved to hard disk) to the disk using Python 3.6v pickle package and\n\nprotocol=2\n\nNote bold: Software license of the MaD TwinNet code is also applied to these pre-trained weights.

openalex-author · Neurocomputing

Fine-grained attention mechanism for neural machine translation

No abstract available from the OpenAlex source record.

openalex-author · Neural Computation

Dynamic Neural Turing Machine with Continuous and Discrete Addressing Schemes

We extend the neural Turing machine (NTM) model into a dynamic neural Turing machine (D-NTM) by introducing trainable address vectors. This addressing scheme maintains for each memory cell two separate vectors, content and address vectors. This allows the D-NTM to learn a wide variety of location-based addressing strategies, including both linear and nonlinear ones. We implement the D-NTM with both continuous and discrete read and write mechanisms. We investigate the mechanisms and effects of learning to read and write into a memory through experiments on Facebook bAbI tasks using both a feedforward and GRU controller. We provide extensive analysis of our model and compare different variations of neural Turing machines on this task. We show that our model outperforms long short-term memory and NTM variants. We provide further experimental results on the sequential [Formula: see text]MNIST, Stanford Natural Language Inference, associative recall, and copy tasks.

openalex-author · arXiv (Cornell University)

A Deep Reinforcement Learning Chatbot (Short Version)

We present MILABOT: a deep reinforcement learning chatbot developed by the Montreal Institute for Learning Algorithms (MILA) for the Amazon Alexa Prize competition. MILABOT is capable of conversing with humans on popular small talk topics through both speech and text. The system consists of an ensemble of natural language generation and retrieval models, including neural network and template-based models. By applying reinforcement learning to crowdsourced data and real-world user interactions, the system has been trained to select an appropriate response from the models in its ensemble. The system has been evaluated through A/B testing with real-world users, where it performed significantly better than other systems. The results highlight the potential of coupling ensemble systems with deep reinforcement learning as a fruitful path for developing real-world, open-domain conversational agents.

openalex-author · arXiv (Cornell University)

MINE: Mutual Information Neural Estimation

We argue that the estimation of mutual information between high dimensional continuous random variables can be achieved by gradient descent over neural networks. We present a Mutual Information Neural Estimator (MINE) that is linearly scalable in dimensionality as well as in sample size, trainable through back-prop, and strongly consistent. We present a handful of applications on which MINE can be used to minimize or maximize mutual information. We apply MINE to improve adversarially trained generative models. We also use MINE to implement Information Bottleneck, applying it to supervised classification; our results demonstrate substantial improvement in flexibility and performance in these settings.

openalex-author · arXiv (Cornell University)

A3T: Adversarially Augmented Adversarial Training

Recent research showed that deep neural networks are highly sensitive to so-called adversarial perturbations, which are tiny perturbations of the input data purposely designed to fool a machine learning classifier. Most classification models, including deep learning models, are highly vulnerable to adversarial attacks. In this work, we investigate a procedure to improve adversarial robustness of deep neural networks through enforcing representation invariance. The idea is to train the classifier jointly with a discriminator attached to one of its hidden layer and trained to filter the adversarial noise. We perform preliminary experiments to test the viability of the approach and to compare it to other standard adversarial training methods.

openalex-author · Neural Computation

Equivalence of Equilibrium Propagation and Recurrent Backpropagation

Recurrent backpropagation and equilibrium propagation are supervised learning algorithms for fixed-point recurrent neural networks, which differ in their second phase. In the first phase, both algorithms converge to a fixed point that corresponds to the configuration where the prediction is made. In the second phase, equilibrium propagation relaxes to another nearby fixed point corresponding to smaller prediction error, whereas recurrent backpropagation uses a side network to compute error derivatives iteratively. In this work, we establish a close connection between these two algorithms. We show that at every moment in the second phase, the temporal derivatives of the neural activities in equilibrium propagation are equal to the error derivatives computed iteratively by recurrent backpropagation in the side network. This work shows that it is not required to have a side network for the computation of error derivatives and supports the hypothesis that in biological neural networks, temporal derivatives of neural activities may code for error signals.

openalex-author · 2018 Annual Meeting of the Organization of Human Brain Mapping (OHBM), Singapore, Singapore, 2018-06-18 - 2018-06-21

BigBrain: 1D convolutional neural networks for automated sementation of cortical layers

No abstract available from the OpenAlex source record.

openalex-author · The Springer Series on Challenges in Machine Learning

The First Conversational Intelligence Challenge

No abstract available from the OpenAlex source record.

openalex-author · Lecture Notes in Computer Science

Width of Minima Reached by Stochastic Gradient Descent is Influenced by Learning Rate to Batch Size Ratio

No abstract available from the OpenAlex source record.

openalex-author · The Springer Series on Challenges in Machine Learning

Introduction to NIPS 2017 Competition Track

No abstract available from the OpenAlex source record.

openalex-author · Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Straight to the Tree: Constituency Parsing with Neural Syntactic Distance

Yikang Shen, Zhouhan Lin, Athul Paul Jacob, Alessandro Sordoni, Aaron Courville, Yoshua Bengio. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018.

openalex-author · Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, Christopher D. Manning. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018.

openalex-author · Apollo (University of Cambridge)

Deep Graph Infomax

We present Deep Graph Infomax (DGI), a general approach for learning node representations within graph-structured data in an unsupervised manner. DGI relies on maximizing mutual information between patch representations and corresponding high-level summaries of graphs---both derived using established graph convolutional network architectures. The learnt patch representations summarize subgraphs centered around nodes of interest, and can thus be reused for downstream node-wise learning tasks. In contrast to most prior approaches to unsupervised learning with GCNs, DGI does not rely on random walk objectives, and is readily applicable to both transductive and inductive learning setups. We demonstrate competitive performance on a variety of node classification benchmarks, which at times even exceeds the performance of supervised learning.

openalex-author · PolyPublie (École Polytechnique de Montréal)

Sparse Attentive Backtracking: Temporal Credit Assignment Through Reminding

Learning long-term dependencies in extended temporal sequences requires credit assignment to events far back in the past. The most common method for training recurrent neural networks, back-propagation through time (BPTT), requires credit information to be propagated backwards through every single step of the forward computation, potentially over thousands or millions of time steps. This becomes computationally expensive or even infeasible when used with long sequences. Importantly, biological brains are unlikely to perform such detailed reverse replay over very long sequences of internal states (consider days, months, or years.) However, humans are often reminded of past memories or mental states which are associated with the current mental state. We consider the hypothesis that such memory associations between past and present could be used for credit assignment through arbitrarily long sequences, propagating the credit assigned to the current state to the associated past state. Based on this principle, we study a novel algorithm which only back-propagates through a few of these temporal skip connections, realized by a learned attention mechanism that associates current states with relevant past states. We demonstrate in experiments that our method matches or outperforms regular BPTT and truncated BPTT in tasks involving particularly long-term dependencies, but without requiring the biologically implausible backward replay through the whole history of states. Additionally, we demonstrate that the proposed method transfers to longer sequences significantly better than LSTMs trained with BPTT and LSTMs trained with full self-attention.

openalex-author · 2018 Conference on Cognitive Computational Neuroscience

Ghost Units Yield Biologically Plausible Backprop in Deep Neural Networks

In the past few years, deep learning has transformed artificial intelligence research and led to impressive performance in various difficult tasks. However, it is still unclear how the brain can perform credit assignment across many areas as efficiently as backpropagation does in deep neural networks. In this paper, we introduce a model that relies on a new role for a neuronal inhibitory machinery, referred to as ghost units. By cancelling the feedback coming from the upper layer when no target signal is provided to the top layer, the ghost units enables the network to backpropagate errors and do efficient credit assignment in deep structures. While considering one-compartment neurons and requiring very few biological assumptions, it is able to approximate the error gradient and achieve good performance on classification tasks. Error backpropagation occurs through the recurrent dynamics of the network and thanks to biologically plausible local learning rules. In particular, it does not require separate feedforward and feedback circuits. Different mechanisms for cancelling the feedback were studied, ranging from complete duplication of the connectivity by long term processes to online replication of the feedback activity. This reduced system combines the essential elements to have a working biologically abstracted analogue of backpropagation with a simple formulation and proofs of the associated results. Therefore, this model is a step towards understanding how learning and memory are implemented in cortical multilayer structures, but it also raises interesting perspectives for neuromorphic hardware.

openalex-author · Paper

Proceedings of the Workshop on Generalization in the Age of Deep Learning

In this paper, we investigate the tendency of end-to-end neural Machine Reading Comprehension (MRC) models to match shallow patterns rather than perform inference-oriented reasoning on RC benchmarks. We aim to test the ability of these systems to answer questions which focus on referential inference. We propose ParallelQA, a strategy to formulate such questions using parallel passages. We also demonstrate that existing neural models fail to generalize well to this setting.

openalex-author · Proceedings of the Workshop on Machine Reading for Question Answering

Neural Models for Key Phrase Extraction and Question Generation

We propose a two-stage neural model to tackle question generation from documents. First, our model estimates the probability that word sequences in a document are ones that a human would pick when selecting candidate answers by training a neural key-phrase extractor on the answers in a question-answering corpus. Predicted key phrases then act as target answers and condition a sequence-tosequence question-generation model with a copy mechanism. Empirically, our keyphrase extraction model significantly outperforms an entity-tagging baseline and existing rule-based approaches. We further demonstrate that our question generation system formulates fluent, answerable questions from key phrases. This twostage system could be used to augment or generate reading comprehension datasets, which may be leveraged to improve machine reading systems or in educational settings.

openalex-author · Proceedings of The Third Workshop on Representation Learning for NLP

Learning Hierarchical Structures On-The-Fly with a Recurrent-Recursive Model for Sequences

We propose a hierarchical model for sequential data that learns a tree on-thefly, i.e. while reading the sequence. In the model, a recurrent network adapts its structure and reuses recurrent weights in a recursive manner. This creates adaptive skip-connections that ease the learning of long-term dependencies. The tree structure can either be inferred without supervision through reinforcement learning, or learned in a supervised manner. We provide preliminary experiments in a novel Math Expression Evaluation (MEE) task, which is explicitly crafted to have a hierarchical tree structure that can be used to study the effectiveness of our model. Additionally, we test our model in a wellknown propositional logic and language modelling tasks. Experimental results show the potential of our approach.

openalex-author · arXiv (Cornell University)

Dendritic error backpropagation in deep cortical microcircuits

Animal behaviour depends on learning to associate sensory stimuli with the desired motor command. Understanding how the brain orchestrates the necessary synaptic modifications across different brain areas has remained a longstanding puzzle. Here, we introduce a multi-area neuronal network model in which synaptic plasticity continuously adapts the network towards a global desired output. In this model synaptic learning is driven by a local dendritic prediction error that arises from a failure to predict the top-down input given the bottom-up activities. Such errors occur at apical dendrites of pyramidal neurons where both long-range excitatory feedback and local inhibitory predictions are integrated. When local inhibition fails to match excitatory feedback an error occurs which triggers plasticity at bottom-up synapses at basal dendrites of the same pyramidal neurons. We demonstrate the learning capabilities of the model in a number of tasks and show that it approximates the classical error backpropagation algorithm. Finally, complementing this cortical circuit with a disinhibitory mechanism enables attention-like stimulus denoising and generation. Our framework makes several experimental predictions on the function of dendritic integration and cortical microcircuits, is consistent with recent observations of cross-area learning, and suggests a biological implementation of deep learning.

openalex-author · Nature

Use machine learning to find energy materials

No abstract available from the OpenAlex source record.

openalex-author · arXiv (Cornell University)

ObamaNet: Photo-realistic lip-sync from text

We present ObamaNet, the first architecture that generates both audio and synchronized photo-realistic lip-sync videos from any new text. Contrary to other published lip-sync approaches, ours is only composed of fully trainable neural modules and does not rely on any traditional computer graphics methods. More precisely, we use three main modules: a text-to-speech network based on Char2Wav, a time-delayed LSTM to generate mouth-keypoints synced to the audio, and a network based on Pix2Pix to generate the video frames conditioned on the keypoints.

openalex-author · International Conference on Learning Representations

FigureQA: An Annotated Figure Dataset for Visual Reasoning

We introduce FigureQA, a visual reasoning corpus of over one million question-answer pairs grounded in over 100,000 images. The images are synthetic, scientific-style figures from five classes: line plots, dot-line plots, vertical and horizontal bar graphs, and pie charts. We formulate our reasoning task by generating questions from 15 templates; questions concern various relationships between plot elements and examine characteristics like the maximum, the minimum, area-under-the-curve, smoothness, and intersection. To resolve, such questions often require reference to multiple plot elements and synthesis of information distributed spatially throughout a figure. To facilitate the training of machine learning systems, the corpus also includes side data that can be used to formulate auxiliary objectives. In particular, we provide the numerical data used to generate each figure as well as bounding-box annotations for all plot elements. We study the proposed visual reasoning task by training several models, including the recently proposed Relation Network as strong baseline. Preliminary results indicate that the task poses a significant machine learning challenge. We envision FigureQA as a first step towards developing models that can intuitively recognize patterns from visual representations of data.

openalex-author · arXiv (Cornell University)

Measuring the tendency of CNNs to Learn Surface Statistical Regularities

Deep CNNs are known to exhibit the following peculiarity: on the one hand they generalize extremely well to a test set, while on the other hand they are extremely sensitive to so-called adversarial perturbations. The extreme sensitivity of high performance CNNs to adversarial examples casts serious doubt that these networks are learning high level abstractions in the dataset. We are concerned with the following question: How can a deep CNN that does not learn any high level semantics of the dataset manage to generalize so well? The goal of this article is to measure the tendency of CNNs to learn surface statistical regularities of the dataset. To this end, we use Fourier filtering to construct datasets which share the exact same high level abstractions but exhibit qualitatively different surface statistical regularities. For the SVHN and CIFAR-10 datasets, we present two Fourier filtered variants: a low frequency variant and a randomly filtered variant. Each of the Fourier filtering schemes is tuned to preserve the recognizability of the objects. Our main finding is that CNNs exhibit a tendency to latch onto the Fourier image statistics of the training dataset, sometimes exhibiting up to a 28% generalization gap across the various test sets. Moreover, we observe that significantly increasing the depth of a network has a very marginal impact on closing the aforementioned generalization gap. Thus we provide quantitative evidence supporting the hypothesis that deep CNNs tend to learn surface statistical regularities in the dataset rather than higher-level abstract concepts.

openalex-author · Neural Information Processing Systems

Plan, Attend, Generate: Planning for Sequence-to-Sequence Models

We investigate the integration of a planning mechanism into sequence-to-sequence models using attention. We develop a model which can plan ahead in the future when it computes its alignments between input and output sequences, constructing a matrix of proposed future alignments and a commitment vector that governs whether to follow or recompute the plan. This mechanism is inspired by the recently proposed strategic attentive reader and writer (STRAW) model for Reinforcement Learning. Our proposed model is end-to-end trainable using primarily differentiable operations. We show that it outperforms a strong baseline on character-level translation tasks from WMT'15, the algorithmic task of finding Eulerian circuits of graphs, and question generation from the text. Our analysis demonstrates that the model computes qualitatively intuitive alignments, converges faster than the baselines, and achieves superior performance with fewer parameters.

openalex-author · arXiv (Cornell University)

Trained Models for "Monaural Singing Voice Separation with Skip-Filtering Connections and Recurrent Inference of Time-Frequency Mask"

Support material (binary files) for the following work: S.I. Mimilakis, K. Drossos, J.F. Santos, G. Schuller, T. Virtanen, Y. Bengio , "Monaural Singing Voice Separation with Skip-Filtering Connections and Recurrent Inference of Time-Frequency Mask", in arXiv:1711.01437 [cs.SD], Nov. 2017. To be used here: https://github.com/Js-Mim/mss_pytorch

openalex-author · arXiv (Cornell University)

Variational Bi-LSTMs

Recurrent neural networks like long short-term memory (LSTM) are important architectures for sequential prediction tasks. LSTMs (and RNNs in general) model sequences along the forward time direction. Bidirectional LSTMs (Bi-LSTMs) on the other hand model sequences along both forward and backward directions and are generally known to perform better at such tasks because they capture a richer representation of the data. In the training of Bi-LSTMs, the forward and backward paths are learned independently. We propose a variant of the Bi-LSTM architecture, which we call Variational Bi-LSTM, that creates a channel between the two paths (during training, but which may be omitted during inference); thus optimizing the two paths jointly. We arrive at this joint objective for our model by minimizing a variational lower bound of the joint likelihood of the data sequence. Our model acts as a regularizer and encourages the two networks to inform each other in making their respective predictions using distinct information. We perform ablation studies to better understand the different components of our model and evaluate the method on various benchmarks, showing state-of-the-art performance.

openalex-author · PolyPublie (École Polytechnique de Montréal)

Z-Forcing: Training Stochastic Recurrent Networks

Many efforts have been devoted to training generative latent variable models with autoregressive decoders, such as recurrent neural networks (RNN). Stochastic recurrent models have been successful in capturing the variability observed in natural sequential data such as speech. We unify successful ideas from recently proposed architectures into a stochastic recurrent model: each step in the sequence is associated with a latent variable that is used to condition the recurrent dynamics for future steps. Training is performed with amortized variational inference where the approximate posterior is augmented with a RNN that runs backward through the sequence. In addition to maximizing the variational lower bound, we ease training of the latent variables by adding an auxiliary cost which forces them to reconstruct the state of the backward recurrent network. This provides the latent variables with a task-independent objective that enhances the performance of the overall model. We found this strategy to perform better than alternative approaches such as KL annealing. Although being conceptually simple, our model achieves state-of-the-art results on standard speech benchmarks such as TIMIT and Blizzard and competitive performance on sequential MNIST. Finally, we apply our model to language modeling on the IMDB dataset where the auxiliary cost helps in learning interpretable latent variables. Source Code: \url{https://github.com/anirudh9119/zforcing_nips17}

openalex-author · Medical Image Analysis

Learning normalized inputs for iterative estimation in medical image segmentation

No abstract available from the OpenAlex source record.

openalex-author · arXiv (Cornell University)

Three Factors Influencing Minima in SGD

We investigate the dynamical and convergent properties of stochastic gradient descent (SGD) applied to Deep Neural Networks (DNNs). Characterizing the relation between learning rate, batch size and the properties of the final minima, such as width or generalization, remains an open question. In order to tackle this problem we investigate the previously proposed approximation of SGD by a stochastic differential equation (SDE). We theoretically argue that three factors - learning rate, batch size and gradient covariance - influence the minima found by SGD. In particular we find that the ratio of learning rate to batch size is a key determinant of SGD dynamics and of the width of the final minima, and that higher values of the ratio lead to wider minima and often better generalization. We confirm these findings experimentally. Further, we include experiments which show that learning rate schedules can be replaced with batch size schedules and that the ratio of learning rate to batch size is an important factor influencing the memorization process.

openalex-author · arXiv (Cornell University)

ACtuAL: Actor-Critic Under Adversarial Learning

Generative Adversarial Networks (GANs) are a powerful framework for deep generative modeling. Posed as a two-player minimax problem, GANs are typically trained end-to-end on real-valued data and can be used to train a generator of high-dimensional and realistic images. However, a major limitation of GANs is that training relies on passing gradients from the discriminator through the generator via back-propagation. This makes it fundamentally difficult to train GANs with discrete data, as generation in this case typically involves a non-differentiable function. These difficulties extend to the reinforcement learning setting when the action space is composed of discrete decisions. We address these issues by reframing the GAN framework so that the generator is no longer trained using gradients through the discriminator, but is instead trained using a learned critic in the actor-critic framework with a Temporal Difference (TD) objective. This is a natural fit for sequence modeling and we use it to achieve improvements on language modeling tasks over the standard Teacher-Forcing methods.

openalex-author · PolyPublie (École Polytechnique de Montréal)

Variational Walkback: Learning a Transition Operator as a Stochastic Recurrent Net

We propose a novel method to directly learn a stochastic transition operator whose repeated application provides generated samples. Traditional undirected graphical models approach this problem indirectly by learning a Markov chain model whose stationary distribution obeys detailed balance with respect to a parameterized energy function. The energy function is then modified so the model and data distributions match, with no guarantee on the number of steps required for the Markov chain to converge. Moreover, the detailed balance condition is highly restrictive: energy based models corresponding to neural networks must have symmetric weights, unlike biological neural circuits. In contrast, we develop a method for directly learning arbitrarily parameterized transition operators capable of expressing non-equilibrium stationary distributions that violate detailed balance, thereby enabling us to learn more biologically plausible asymmetric neural networks and more general non-energy based dynamical systems. The proposed training objective, which we derive via principled variational methods, encourages the transition operator to "walk back" in multi-step trajectories that start at data-points, as quickly as possible back to the original data points. We present a series of experimental results illustrating the soundness of the proposed approach, Variational Walkback (VW), on the MNIST, CIFAR-10, SVHN and CelebA datasets, demonstrating superior samples compared to earlier attempts to learn a transition operator. We also show that although each rapid training trajectory is limited to a finite but variable number of steps, our transition operator continues to generate good samples well past the length of such trajectories, thereby demonstrating the match of its non-equilibrium stationary distribution to the data distribution. Source Code: http://github.com/anirudh9119/walkback_nips17

openalex-author · arXiv (Cornell University)

Sparse Attentive Backtracking: Long-Range Credit Assignment in Recurrent Networks

A major drawback of backpropagation through time (BPTT) is the difficulty of learning long-term dependencies, coming from having to propagate credit information backwards through every single step of the forward computation. This makes BPTT both computationally impractical and biologically implausible. For this reason, full backpropagation through time is rarely used on long sequences, and truncated backpropagation through time is used as a heuristic. However, this usually leads to biased estimates of the gradient in which longer term dependencies are ignored. Addressing this issue, we propose an alternative algorithm, Sparse Attentive Backtracking, which might also be related to principles used by brains to learn long-term dependencies. Sparse Attentive Backtracking learns an attention mechanism over the hidden states of the past and selectively backpropagates through paths with high attention weights. This allows the model to learn long term dependencies while only backtracking for a small number of time steps, not just from the recent past but also from attended relevant past states.

openalex-author · arXiv (Cornell University)

Fraternal Dropout

Recurrent neural networks (RNNs) are important class of architectures among neural networks useful for language modeling and sequential prediction. However, optimizing RNNs is known to be harder compared to feed-forward neural networks. A number of techniques have been proposed in literature to address this problem. In this paper we propose a simple technique called fraternal dropout that takes advantage of dropout to achieve this goal. Specifically, we propose to train two identical copies of an RNN (that share parameters) with different dropout masks while minimizing the difference between their (pre-softmax) predictions. In this way our regularization encourages the representations of RNNs to be invariant to dropout mask, thus being robust. We show that our regularization term is upper bounded by the expectation-linear dropout objective which has been shown to address the gap due to the difference between the train and inference phases of dropout. We evaluate our model and achieve state-of-the-art results in sequence modeling tasks on two benchmark datasets - Penn Treebank and Wikitext-2. We also show that our approach leads to performance improvement by a significant margin in image captioning (Microsoft COCO) and semi-supervised (CIFAR-10) tasks.

openalex-author · arXiv (Cornell University)

Graph Attention Networks

We present graph attention networks (GATs), novel neural network architectures that operate on graph-structured data, leveraging masked self-attentional layers to address the shortcomings of prior methods based on graph convolutions or their approximations. By stacking layers in which nodes are able to attend over their neighborhoods' features, we enable (implicitly) specifying different weights to different nodes in a neighborhood, without requiring any kind of costly matrix operation (such as inversion) or depending on knowing the graph structure upfront. In this way, we address several key challenges of spectral-based graph neural networks simultaneously, and make our model readily applicable to inductive as well as transductive problems. Our GAT models have achieved or matched state-of-the-art results across four established transductive and inductive graph benchmarks: the Cora, Citeseer and Pubmed citation network datasets, as well as a protein-protein interaction dataset (wherein test graphs remain unseen during training).

openalex-author · arXiv (Cornell University)

Residual Connections Encourage Iterative Inference

Residual networks (Resnets) have become a prominent architecture in deep learning. However, a comprehensive understanding of Resnets is still a topic of ongoing research. A recent view argues that Resnets perform iterative refinement of features. We attempt to further expose properties of this aspect. To this end, we study Resnets both analytically and empirically. We formalize the notion of iterative refinement in Resnets by showing that residual connections naturally encourage features of residual blocks to move along the negative gradient of loss as we go from one block to the next. In addition, our empirical analysis suggests that Resnets are able to perform both representation learning and iterative refinement. In general, a Resnet block tends to concentrate representation learning behavior in the first few layers while higher layers perform iterative refinement of features. Finally we observe that sharing residual layers naively leads to representation explosion and counterintuitively, overfitting, and we show that simple existing strategies can help alleviating this problem.

openalex-author · arXiv (Cornell University)

Learning Independent Features with Adversarial Nets for Non-linear ICA

Reliable measures of statistical dependence could be useful tools for learning independent features and performing tasks like source separation using Independent Component Analysis (ICA). Unfortunately, many of such measures, like the mutual information, are hard to estimate and optimize directly. We propose to learn independent features with adversarial objectives which optimize such measures implicitly. These objectives compare samples from the joint distribution and the product of the marginals without the need to compute any probability densities. We also propose two methods for obtaining samples from the product of the marginals using either a simple resampling trick or a separate parametric distribution. Our experiments show that this strategy can easily be applied to different types of model architectures and solve both linear and non-linear ICA problems.

openalex-author · 2017 IEEE International Conference on Computer Vision Workshops (ICCVW)

Count-ception: Counting by Fully Convolutional Redundant Counting

Counting objects in digital images is a process that should be replaced by machines. This tedious task is time consuming and prone to errors due to fatigue of human annotators. The goal is to have a system that takes as input an image and returns a count of the objects inside and justification for the prediction in the form of object localization. We repose a problem, originally posed by Lempitsky and Zisserman, to instead predict a count map which contains redundant counts based on the receptive field of a smaller regression network. The regression network predicts a count of the objects that exist inside this frame. By processing the image in a fully convolutional way each pixel is going to be accounted for some number of times, the number of windows which include it, which is the size of each window, (i.e., 32x32 = 1024). To recover the true count we take the average over the redundant predictions. Our contribution is redundant counting instead of predicting a density map in order to average over errors. We also propose a novel deep neural network architecture adapted from the Inception family of networks called the Count-ception network. Together our approach results in a 20% relative improvement (2.9 to 2.3 MAE) over the state of the art method by Xie, Noble, and Zisserman in 2016.

openalex-author · arXiv (Cornell University)

The Consciousness Prior

A new prior is proposed for learning representations of high-level concepts of the kind we manipulate with language. This prior can be combined with other priors in order to help disentangling abstract factors from each other. It is inspired by cognitive neuroscience theories of consciousness, seen as a bottleneck through which just a few elements, after having been selected by attention from a broader pool, are then broadcast and condition further processing, both in perception and decision-making. The set of recently selected elements one becomes aware of is seen as forming a low-dimensional conscious state. This conscious state is combining the few concepts constituting a conscious thought, i.e., what one is immediately conscious of at a particular moment. We claim that this architectural and information-processing constraint corresponds to assumptions about the joint distribution between high-level concepts. To the extent that these assumptions are generally true (and the form of natural language seems consistent with them), they can form a useful prior for representation learning. A low-dimensional thought or conscious state is analogous to a sentence: it involves only a few variables and yet can make a statement with very high probability of being true. This is consistent with a joint distribution (over high-level concepts) which has the form of a sparse factor graph, i.e., where the dependencies captured by each factor of the factor graph involve only very few variables while creating a strong dip in the overall energy function. The consciousness prior also makes it natural to map conscious states to natural language utterances or to express classical AI knowledge in a form similar to facts and rules, albeit capturing uncertainty as well as efficient search mechanisms implemented by attention mechanisms.

openalex-author · arXiv (Cornell University)

A Deep Reinforcement Learning Chatbot

We present MILABOT: a deep reinforcement learning chatbot developed by the Montreal Institute for Learning Algorithms (MILA) for the Amazon Alexa Prize competition. MILABOT is capable of conversing with humans on popular small talk topics through both speech and text. The system consists of an ensemble of natural language generation and retrieval models, including template-based models, bag-of-words models, sequence-to-sequence neural network and latent variable neural network models. By applying reinforcement learning to crowdsourced data and real-world user interactions, the system has been trained to select an appropriate response from the models in its ensemble. The system has been evaluated through A/B testing with real-world users, where it performed significantly better than many competing systems. Due to its machine learning architecture, the system is likely to improve with additional data.

openalex-author · arXiv (Cornell University)

Towards an Automatic Turing Test: Learning to Evaluate Dialogue\n Responses

Automatically evaluating the quality of dialogue responses for unstructured\ndomains is a challenging problem. Unfortunately, existing automatic evaluation\nmetrics are biased and correlate very poorly with human judgements of response\nquality. Yet having an accurate automatic evaluation procedure is crucial for\ndialogue research, as it allows rapid prototyping and testing of new models\nwith fewer expensive human evaluations. In response to this challenge, we\nformulate automatic dialogue evaluation as a learning problem. We present an\nevaluation model (ADEM) that learns to predict human-like scores to input\nresponses, using a new dataset of human response scores. We show that the ADEM\nmodel's predictions correlate significantly, and at a level much higher than\nword-overlap metrics such as BLEU, with human judgements at both the utterance\nand system-level. We also show that ADEM can generalize to evaluating dialogue\nmodels unseen during training, an important step for automatic dialogue\nevaluation.\n

openalex-author · arXiv (Cornell University)

Twin Networks: Using the Future as a Regularizer.

Being able to model long-term dependencies in sequential data, such as text, has been among the long-standing challenges of recurrent neural networks (RNNs). This issue is strictly related to the absence of explicit planning in current RNN architectures. More explicitly, the RNNs are trained to predict only the next token given previous ones. In this paper, we introduce a simple way of encouraging the RNNs to plan for the future. In order to accomplish this, we introduce an additional neural network which is trained to generate the sequence in reverse order, and we require closeness between the states of the forward RNN and backward RNN that predict the same token. At each step, the states of the forward RNN are required to match the future information contained in the backward states. We hypothesize that the approach eases modeling of long-term dependencies thus helping in generating more globally consistent samples. The model trained with conditional generation for a speech recognition task achieved 12\% relative improvement (CER of 6.7 compared to a baseline of 7.6).

openalex-author · Interspeech 2017

Improving Speech Recognition by Revising Gated Recurrent Units

Speech recognition is largely taking advantage of deep learning, showing that substantial benefits can be obtained by modern Recurrent Neural Networks (RNNs). The most popular RNNs are Long Short-Term Memory (LSTMs), which typically reach state-of-the-art performance in many tasks thanks to their ability to learn long-term dependencies and robustness to vanishing gradients. Nevertheless, LSTMs have a rather complex design with three multiplicative gates, that might impair their efficient implementation. An attempt to simplify LSTMs has recently led to Gated Recurrent Units (GRUs), which are based on just two multiplicative gates. This paper builds on these efforts by further revising GRUs and proposing a simplified architecture potentially more suitable for speech recognition. The contribution of this work is two-fold. First, we suggest to remove the reset gate in the GRU design, resulting in a more efficient single-gate architecture. Second, we propose to replace tanh with ReLU activations in the state update equations. Results show that, in our implementation, the revised architecture reduces the per-epoch training time with more than 30% and consistently improves recognition performance across different tasks, input features, and noisy conditions when compared to a standard GRU.

openalex-author · Interspeech 2017

Dynamic Layer Normalization for Adaptive Neural Acoustic Modeling in Speech Recognition

Layer normalization is a recently introduced technique for normalizing the activities of neurons in deep neural networks to improve the training speed and stability.In this paper, we introduce a new layer normalization technique called Dynamic Layer Normalization (DLN) for adaptive neural acoustic modeling in speech recognition.By dynamically generating the scaling and shifting parameters in layer normalization, DLN adapts neural acoustic models to the acoustic variability arising from various factors such as speakers, channel noises, and environments.Unlike other adaptive acoustic models, our proposed approach does not require additional adaptation data or speaker information such as i-vectors.Moreover, the model size is fixed as it dynamically generates adaptation parameters.We apply our proposed DLN to deep bidirectional LSTM acoustic models and evaluate them on two benchmark datasets for large vocabulary ASR experiments: WSJ and TED-LIUM release 2. The experimental results show that our DLN improves neural acoustic models in terms of transcription accuracy by dynamically adapting to various speakers and environments.

openalex-author · Jagiellonian University Repository (Jagiellonian University)

A closer look at memorization in deep networks

We examine the role of memorization in deep learning, drawing connections to capacity, generalization, and adversarial robustness. While deep networks are capable of memorizing noise data, our results suggest that they tend to prioritize learning simple patterns first. In our experiments, we expose qualitative differences in gradient-based optimization of deep neural networks (DNNs) on noise vs. real data. We also demonstrate that for appropriately tuned explicit regularization (e.g., dropout) we can degrade DNN training performance on noise datasets without compromising generalization on real data. Our analysis suggests that the notions of effective capacity which are dataset independent are unlikely to explain the generalization performance of deep networks when trained with gradient based methods because training data itself plays an important role in determining the degree of memorization.

openalex-author · arXiv (Cornell University)

Independently Controllable Factors

It has been postulated that a good representation is one that disentangles the underlying explanatory factors of variation. However, it remains an open question what kind of training framework could potentially achieve that. Whereas most previous work focuses on the static setting (e.g., with images), we postulate that some of the causal factors could be discovered if the learner is allowed to interact with its environment. The agent can experiment with different actions and observe their effects. More specifically, we hypothesize that some of these factors correspond to aspects of the environment which are independently controllable, i.e., that there exists a policy and a learnable feature for each such aspect of the environment, such that this policy can yield changes in that feature with minimal changes to other features that explain the statistical variations in the observed data. We propose a specific objective function to find such factors and verify experimentally that it can indeed disentangle independently controllable aspects of the environment without any extrinsic reward signal.

openalex-author · arXiv (Cornell University)

Multiscale sequence modeling with a learned dictionary

We propose a generalization of neural network sequence models. Instead of predicting one symbol at a time, our multi-scale model makes predictions over multiple, potentially overlapping multi-symbol tokens. A variation of the byte-pair encoding (BPE) compression algorithm is used to learn the dictionary of tokens that the model is trained with. When applied to language modelling, our model has the flexibility of character-level models while maintaining many of the performance benefits of word-level models. Our experiments show that this model performs better than a regular LSTM on language modeling tasks, especially for smaller models.

openalex-author · arXiv (Cornell University)

Variance Regularizing Adversarial Learning

We introduce a novel approach for training adversarial models by replacing the discriminator score with a bi-modal Gaussian distribution over the real/fake indicator variables. In order to do this, we train the Gaussian classifier to match the target bi-modal distribution implicitly through meta-adversarial training. We hypothesize that this approach ensures a non-zero gradient to the generator, even in the limit of a perfect classifier. We test our method against standard benchmark image datasets as well as show the classifier output distribution is smooth and has overlap between the real and fake modes.

openalex-author · 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation

State-of-the-art approaches for semantic image segmentation are built on Convolutional Neural Networks (CNNs). The typical segmentation architecture is composed of (a) a downsampling path responsible for extracting coarse semantic features, followed by (b) an upsampling path trained to recover the input image resolution at the output of the model and, optionally, (c) a post-processing module (e.g. Conditional Random Fields) to refine the model predictions. Recently, a new CNN architecture, Densely Connected Convolutional Networks (DenseNets), has shown excellent results on image classification tasks. The idea of DenseNets is based on the observation that if each layer is directly connected to every other layer in a feed-forward fashion then the network will be more accurate and easier to train. In this paper, we extend DenseNets to deal with the problem of semantic segmentation. We achieve state-of-the-art results on urban scene benchmark datasets such as CamVid and Gatech, without any further post-processing module nor pretraining. Moreover, due to smart construction of the model, our approach has much less parameters than currently published best entries for these datasets.

openalex-author · 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space

Generating high-resolution, photo-realistic images has been a long-standing goal in machine learning. Recently, Nguyen et al. [37] showed one interesting way to synthesize novel images by performing gradient ascent in the latent space of a generator network to maximize the activations of one or multiple neurons in a separate classifier network. In this paper we extend this method by introducing an additional prior on the latent code, improving both sample quality and sample diversity, leading to a state-of-the-art generative model that produces high quality images at higher resolutions (227 × 227) than previous generative models, and does so for all 1000 ImageNet categories. In addition, we provide a unified probabilistic interpretation of related activation maximization methods and call the general class of models Plug and Play Generative Networks. PPGNs are composed of 1) a generator network G that is capable of drawing a wide range of image types and 2) a replaceable condition network C that tells the generator what to draw. We demonstrate the generation of images conditioned on a class (when C is an ImageNet or MIT Places classification network) and also conditioned on a caption (when C is an image captioning network). Our method also improves the state of the art of Multifaceted Feature Visualization [40], which generates the set of synthetic inputs that activate a neuron in order to better understand how deep neural networks operate. Finally, we show that our model performs reasonably well at the task of image inpainting. While image models are used in this paper, the approach is modality-agnostic and can be applied to many types of data.

openalex-author · Proceedings of the Workshop on Trends in Machine-Learning (and impact on computer architecture)

Towards more hardware-friendly deep learning

No abstract available.

openalex-author · Paper

Neural Models for Key Phrase Detection and Question Generation

We propose a two-stage neural model to tackle question generation from documents. Our model first estimates the probability that word sequences in a document compose “interesting” answers using a neural model trained on a question-answering corpus. We thus take a data-driven approach to interestingness. Predicted key phrases then act as target answers that condition a sequence-to-sequence question generation model with a copy mechanism. Empirically, our neural key phrase detection model significantly outperforms an entity-tagging baseline system and existing rule-based approaches. We demonstrate that the question generator formulates good quality natural language questions from extracted key phrases, and a human study indicates that our system’s generated question-answer pairs are competitive with those of an earlier approach. We foresee our system being used in an educational setting to assess reading comprehension and also as a data augmentation technique for semi-supervised learning.

openalex-author · arXiv (Cornell University)

Plan, Attend, Generate: Character-level Neural Machine Translation with Planning in the Decoder

We investigate the integration of a planning mechanism into an encoder-decoder architecture with an explicit alignment for character-level machine translation. We develop a model that plans ahead when it computes alignments between the source and target sequences, constructing a matrix of proposed future alignments and a commitment vector that governs whether to follow or recompute the plan. This mechanism is inspired by the strategic attentive reader and writer (STRAW) model. Our proposed model is end-to-end trainable with fully differentiable operations. We show that it outperforms a strong baseline on three character-level decoder neural machine translation on WMT'15 corpus. Our analysis demonstrates that our model can compute qualitatively intuitive alignments and achieves superior performance with fewer parameters.

openalex-author · arXiv (Cornell University)

Learning to Compute Word Embeddings On the Fly

Words in natural language follow a Zipfian distribution whereby some words are frequent but most are rare. Learning representations for words in the "long tail" of this distribution requires enormous amounts of data. Representations of rare words trained directly on end tasks are usually poor, requiring us to pre-train embeddings on external data, or treat all rare words as out-of-vocabulary words with a unique representation. We provide a method for predicting embeddings of rare words on the fly from small amounts of auxiliary data with a network trained end-to-end for the downstream task. We show that this improves results against baselines where embeddings are trained on the end task for reading comprehension, recognizing textual entailment and language modeling.

openalex-author · PolyPublie (École Polytechnique de Montréal)

Deep Learning for Patient-Specific Kidney Graft Survival Analysis

An accurate model of patient-specific kidney graft survival distributions can help to improve shared-decision making in the treatment and care of patients. In this paper, we propose a deep learning method that directly models the survival function instead of estimating the hazard function to predict survival times for graft patients based on the principle of multi-task learning. By learning to jointly predict the time of the event, and its rank in the cox partial log likelihood framework, our deep learning approach outperforms, in terms of survival time prediction quality and concordance index, other common methods for survival analysis, including the Cox Proportional Hazards model and a network trained on the cox partial log-likelihood.

openalex-author · PolyPublie (École Polytechnique de Montréal)

Deep Complex Networks

At present, the vast majority of building blocks, techniques, and architectures for deep learning are based on real-valued operations and representations. However, recent work on recurrent neural networks and older fundamental theoretical analysis suggests that complex numbers could have a richer representational capacity and could also facilitate noise-robust memory retrieval mechanisms. Despite their attractive properties and potential for opening up entirely new neural architectures, complex-valued deep neural networks have been marginalized due to the absence of the building blocks required to design such models. In this work, we provide the key atomic components for complex-valued deep neural networks and apply them to convolutional feed-forward networks and convolutional LSTMs. More precisely, we rely on complex convolutions and present algorithms for complex batch-normalization, complex weight initialization strategies for complex-valued neural nets and we use them in experiments with end-to-end training schemes. We demonstrate that such complex-valued models are competitive with their real-valued counterparts. We test deep complex models on several computer vision tasks, on music transcription using the MusicNet dataset and on Speech Spectrum Prediction using the TIMIT dataset. We achieve state-of-the-art performance on these audio-related tasks.

openalex-author · arXiv (Cornell University)

Image Segmentation by Iterative Inference from Conditional Score Estimation

Inspired by the combination of feedforward and iterative computations in the virtual cortex, and taking advantage of the ability of denoising autoencoders to estimate the score of a joint distribution, we propose a novel approach to iterative inference for capturing and exploiting the complex joint distribution of output variables conditioned on some input variables. This approach is applied to image pixel-wise segmentation, with the estimated conditional score used to perform gradient ascent towards a mode of the estimated conditional distribution. This extends previous work on score estimation by denoising autoencoders to the case of a conditional distribution, with a novel use of a corrupted feedforward predictor replacing Gaussian corruption. An advantage of this approach over more classical ways to perform iterative inference for structured outputs, like conditional random fields (CRFs), is that it is not any more necessary to define an explicit energy function linking the output variables. To keep computations tractable, such energy function parametrizations are typically fairly constrained, involving only a few neighbors of each of the output variables in each clique. We experimentally find that the proposed iterative inference from conditional score estimation by conditional denoising autoencoders performs better than comparable models based on CRFs or those not using any explicit modeling of the conditional joint distribution of outputs.

openalex-author · Frontiers in Computational Neuroscience

Equilibrium Propagation: Bridging the Gap between Energy-Based Models and Backpropagation

We introduce Equilibrium Propagation, a learning framework for energy-based models. It involves only one kind of neural computation, performed in both the first phase (when the prediction is made) and the second phase of training (after the target or prediction error is revealed). Although this algorithm computes the gradient of an objective function just like Backpropagation, it does not need a special computation or circuit for the second phase, where errors are implicitly propagated. Equilibrium Propagation shares similarities with Contrastive Hebbian Learning and Contrastive Divergence while solving the theoretical issues of both algorithms: our algorithm computes the gradient of a well-defined objective function. Because the objective function is defined in terms of local perturbations, the second phase of Equilibrium Propagation corresponds to only nudging the prediction (fixed point or stationary distribution) toward a configuration that reduces prediction error. In the case of a recurrent multi-layer supervised network, the output units are slightly nudged toward their target in the second phase, and the perturbation introduced at the output layer propagates backward in the hidden layers. We show that the signal "back-propagated" during this second phase corresponds to the propagation of error derivatives and encodes the gradient of the objective function, when the synaptic update corresponds to a standard form of spike-timing dependent plasticity. This work makes it more plausible that a mechanism similar to Backpropagation could be implemented by brains, since leaky integrator neural computation performs both inference and error back-propagation in our model. The only local difference between the two phases is whether synaptic changes are allowed or not. We also show experimentally that multi-layer recurrently connected networks with 1, 2, and 3 hidden layers can be trained by Equilibrium Propagation on the permutation-invariant MNIST task.

openalex-author · 2017 International Joint Conference on Neural Networks (IJCNN)

A robust adaptive stochastic gradient method for deep learning

Stochastic gradient algorithms are the main focus of large-scale optimization problems and led to important successes in the recent advancement of the deep learning algorithms. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose an adaptive learning rate algorithm, which utilizes stochastic curvature information of the loss function for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.

openalex-author · Machine Translation

The representational geometry of word meanings acquired by neural machine translation models

No abstract available from the OpenAlex source record.

openalex-author · Paper

The Variational Walkback Algorithm

A recognized obstacle to training undirected graphical models with latent variables such as Boltzmann machines is that the maximum likelihood training procedure requires sampling from Monte-Carlo Markov chains which may not mix well, in the inner loop of training, for each example. We first propose the idea that it is sufficient to locally carve the energy function everywhere so that its gradient points in the right direction (i.e., towards generating the data). Following on previous work on contrastive divergence, denoising autoencoders, generative stochastic networks and unsupervised learning using non-equilibrium dynamics, we propose a variational bound on the marginal log-likelihood of the data which corresponds to a new learning procedure that first walks away from data points by following the model transition operator and then trains that operator to walk backwards for each of these steps, back towards the training example. The tightness of the variational bound relies on gradually increasing temperature as we walk away from the data, at each step providing a gradient on the parameters to maximize the probability that the transition operator returns to its previous state. Interestingly, this algorithm admits a variant where there is no explicit energy function, i.e., the parameters are used to directly define the transition operator. This also eliminates the explicit need for symmetric weights which previous Boltzmann machine or Hopfield net models require, and which makes these models less biologically plausible.

openalex-author · International Conference on Learning Representations

Improving Generative Adversarial Networks with Denoising Feature Matching

We propose an augmented training procedure for generative adversarial networks designed to address shortcomings of the original by directing the generator towards probable configurations of abstract discriminator features. We estimate and track the distribution of these features, as computed from data, with a denoising auto-encoder, and use it to propose high-level targets for the generator. We combine this new loss with the original and evaluate the hybrid criterion on the task of unsupervised image synthesis from datasets comprising a diverse set of visual categories, noting a qualitative and quantitative improvement in the ``objectness'' of the resulting samples.

openalex-author · IEEE Transactions on Pattern Analysis and Machine Intelligence

Drawing and Recognizing Chinese Characters with Recurrent Neural Network

Recent deep learning based approaches have achieved great success on handwriting recognition. Chinese characters are among the most widely adopted writing systems in the world. Previous research has mainly focused on recognizing handwritten Chinese characters. However, recognition is only one aspect for understanding a language, another challenging and interesting task is to teach a machine to automatically write (pictographic) Chinese characters. In this paper, we propose a framework by using the recurrent neural network (RNN) as both a discriminative model for recognizing Chinese characters and a generative model for drawing (generating) Chinese characters. To recognize Chinese characters, previous methods usually adopt the convolutional neural network (CNN) models which require transforming the online handwriting trajectory into image-like representations. Instead, our RNN based approach is an end-to-end system which directly deals with the sequential structure and does not require any domain-specific knowledge. With the RNN system (combining an LSTM and GRU), state-of-the-art performance can be achieved on the ICDAR-2013 competition database. Furthermore, under the RNN framework, a conditional generative model with character embedding is proposed for automatically drawing recognizable Chinese characters. The generated characters (in vector format) are human-readable and also can be recognized by the discriminative RNN model with high accuracy. Experimental results verify the effectiveness of using RNNs as both generative and discriminative models for the tasks of drawing and recognizing Chinese characters.

openalex-author · arXiv (Cornell University)

Independently Controllable Features

Finding features that disentangle the different causes of variation in real data is a difficult task, that has nonetheless received considerable attention in static domains like natural images. Interactive environments, in which an agent can deliberately take actions, offer an opportunity to tackle this task better, because the agent can experiment with different actions and observe their effects. We introduce the idea that in interactive environments, latent factors that control the variation in observed data can be identified by figuring out what the agent can control. We propose a naive method to find factors that explain or measure the effect of the actions of a learner, and test it in illustrative experiments.

openalex-author · Computer Speech & Language

On integrating a language model into neural machine translation

No abstract available from the OpenAlex source record.

openalex-author · arXiv (Cornell University)

Sharp Minima Can Generalize For Deep Nets

Despite their overwhelming capacity to overfit, deep learning architectures tend to generalize relatively well to unseen data, allowing them to be deployed in practice. However, explaining why this is the case is still an open area of research. One standing hypothesis that is gaining popularity, e.g. Hochreiter & Schmidhuber (1997); Keskar et al. (2017), is that the flatness of minima of the loss function found by stochastic gradient based methods results in good generalization. This paper argues that most notions of flatness are problematic for deep models and can not be directly applied to explain generalization. Specifically, when focusing on deep networks with rectifier units, we can exploit the particular geometry of parameter space induced by the inherent symmetries that these architectures exhibit to build equivalent models corresponding to arbitrarily sharper minima. Furthermore, if we allow to reparametrize a function, the geometry of its parameters can change drastically without affecting its generalization properties.

openalex-author · International Conference on Learning Representations

A Structured Self-Attentive Sentence Embedding.

This paper proposes a new model for extracting an interpretable sentence embedding by introducing self-attention. Instead of using a vector, we use a 2-D matrix to represent the embedding, with each row of the matrix attending on a different part of the sentence. We also propose a self-attention mechanism and a special regularization term for the model. As a side effect, the embedding comes with an easy way of visualizing what specific parts of the sentence are encoded into the embedding. We evaluate our model on 3 different tasks: author profiling, sentiment classification, and textual entailment. Results show that our model yields a significant performance gain compared to other sentence embedding methods in all of the 3 tasks.

openalex-author · Gestion

Les données au service du savoir

L’analyse de grands ensembles de données pour les traduire en connaissances et en décisions constitue aujourd’hui un défi majeur dans tous les secteurs de l’activité humaine. De concert avec l’industrie et ses institutions partenaires, Campus Montréal, par le truchement d’un projet phare élaboré avec l’Institut de valorisation des données, occupe une position idéale pour faire des percées d’envergure mondiale dans le domaine de l’innovation guidée par les données.

openalex-author · 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

On random weights for texture generation in one layer CNNS

Recent work in the literature has shown experimentally that one can use the lower layers of a trained convolutional neural network (CNN) to model natural textures. More interestingly, it has also been experimentally shown that only one layer with random filters can also model textures although with less variability. In this paper we ask the question as to why one layer CNNs with random filters are so effective in generating textures? We theoretically show that one layer convolutional architectures (without a non-linearity) paired with the an energy function used in previous literature, can in fact preserve and modulate frequency coefficients in a manner so that random weights and pretrained weights will generate the same type of images. Based on the results of this analysis we question whether similar properties hold in the case where one uses one convolution layer with a non-linearity. We show that in the case of ReLu non-linearity there are situations where only one input will give the minimum possible energy whereas in the case of no nonlinearity, there are always infinite solutions that will give the minimum possible energy. Thus we can show that in certain situations adding a ReLu non-linearity generates less variable images.

openalex-author · 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

A network of deep neural networks for Distant Speech Recognition

Despite the remarkable progress recently made in distant speech recognition, state-of-the-art technology still suffers from a lack of robustness, especially when adverse acoustic conditions characterized by non-stationary noises and reverberation are met. A prominent limitation of current systems lies in the lack of matching and communication between the various technologies involved in the distant speech recognition process. The speech enhancement and speech recognition modules are, for instance, often trained independently. Moreover, the speech enhancement normally helps the speech recognizer, but the output of the latter is not commonly used, in turn, to improve the speech enhancement. To address both concerns, we propose a novel architecture based on a network of deep neural networks, where all the components are jointly trained and better cooperate with each other thanks to a full communication scheme between them. Experiments, conducted using different datasets, tasks and acoustic conditions, revealed that the proposed framework can overtake other competitive solutions, including recent joint training approaches.

openalex-author · arXiv (Cornell University)

Boundary-Seeking Generative Adversarial Networks

Generative adversarial networks (GANs) are a learning framework that rely on training a discriminator to estimate a measure of difference between a target and generated distributions. GANs, as normally formulated, rely on the generated samples being completely differentiable w.r.t. the generative parameters, and thus do not work for discrete data. We introduce a method for training GANs with discrete data that uses the estimated difference measure from the discriminator to compute importance weights for generated samples, thus providing a policy gradient for training the generator. The importance weights have a strong connection to the decision boundary of the discriminator, and we call our method boundary-seeking GANs (BGANs). We demonstrate the effectiveness of the proposed algorithm with discrete image and character-based natural language generation. In addition, the boundary-seeking objective extends to continuous data, which can be used to improve stability of training, and we demonstrate this on Celeba, Large-scale Scene Understanding (LSUN) bedrooms, and Imagenet without conditioning.

openalex-author · arXiv (Cornell University)

Maximum-Likelihood Augmented Discrete Generative Adversarial Networks

Despite the successes in capturing continuous distributions, the application of generative adversarial networks (GANs) to discrete settings, like natural language tasks, is rather restricted. The fundamental reason is the difficulty of back-propagation through discrete random variables combined with the inherent instability of the GAN training objective. To address these problems, we propose Maximum-Likelihood Augmented Discrete Generative Adversarial Networks. Instead of directly optimizing the GAN objective, we derive a novel and low-variance objective using the discriminator's output that follows corresponds to the log-likelihood. Compared with the original, the new objective is proved to be consistent in theory and beneficial in practice. The experimental results on various discrete datasets demonstrate the effectiveness of the proposed approach.

openalex-author · Computer Speech & Language

Context-dependent word representation for neural machine translation

No abstract available from the OpenAlex source record.

openalex-author · International Conference on Learning Representations

Char2Wav: End-to-End Speech Synthesis

No abstract available from the OpenAlex source record.

openalex-author · Proceedings of the AAAI Conference on Artificial Intelligence

Denoising Criterion for Variational Auto-Encoding Framework

Denoising autoencoders (DAE) are trained to reconstruct their clean inputs with noise injected at the input level, while variational autoencoders (VAE) are trained with noise injected in their stochastic hidden layer, with a regularizer that encourages this noise injection. In this paper, we show that injecting noise both in input and in the stochastic hidden layer can be advantageous and we propose a modified variational lower bound as an improved objective function in this setup. When input is corrupted, then the standard VAE lower bound involves marginalizing the encoder conditional distribution over the input noise, which makes the training criterion intractable. Instead, we propose a modified training criterion which corresponds to a tractable bound when input is corrupted. Experimentally, we find that the proposed denoising variational autoencoder (DVAE) yields better average log-likelihood than the VAE and the importance weighted autoencoder on the MNIST and Frey Face datasets.

openalex-author · Proceedings of the AAAI Conference on Artificial Intelligence

Multiresolution Recurrent Neural Networks: An Application to Dialogue Response Generation

We introduce a new class of models called multiresolution recurrent neural networks, which explicitly model natural language generation at multiple levels of abstraction. The models extend the sequence-to-sequence framework to generate two parallel stochastic processes: a sequence of high-level coarse tokens, and a sequence of natural language words (e.g. sentences). The coarse sequences follow a latent stochastic process with a factorial representation, which helps the models generalize to new examples. The coarse sequences can also incorporate task-specific knowledge, when available. In our experiments, the coarse sequences are extracted using automatic procedures, which are designed to capture compositional structure and semantics. These procedures enable training the multiresolution recurrent neural networks by maximizing the exact joint log-likelihood over both sequences. We apply the models to dialogue response generation in the technical support domain and compare them with several competing models. The multiresolution recurrent neural networks outperform competing models by a substantial margin, achieving state-of-the-art results according to both a human evaluation study and automatic evaluation metrics. Furthermore, experiments show the proposed models generate more fluent, relevant and goal-oriented responses.

openalex-author · Proceedings of the AAAI Conference on Artificial Intelligence

A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues

Sequential data often possesses hierarchical structures with complex dependencies between sub-sequences, such as found between the utterances in a dialogue. To model these dependencies in a generative framework, we propose a neural network-based generative architecture, with stochastic latent variables that span a variable number of time steps. We apply the proposed model to the task of dialogue response generation and compare it with other recent neural-network architectures. We evaluate the model performance through a human evaluation study. The experiments demonstrate that our model improves upon recently proposed models and that the latent variables facilitate both the generation of meaningful, long and diverse responses and maintaining dialogue state.

openalex-author · arXiv (Cornell University)

Memory Augmented Neural Networks with Wormhole Connections

Recent empirical results on long-term dependency tasks have shown that neural networks augmented with an external memory can learn the long-term dependency tasks more easily and achieve better generalization than vanilla recurrent neural networks (RNN). We suggest that memory augmented neural networks can reduce the effects of vanishing gradients by creating shortcut (or wormhole) connections. Based on this observation, we propose a novel memory augmented neural network model called TARDIS (Temporal Automatic Relation Discovery in Sequences). The controller of TARDIS can store a selective set of embeddings of its own previous hidden states into an external memory and revisit them as and when needed. For TARDIS, memory acts as a storage for wormhole connections to the past to propagate the gradients more effectively and it helps to learn the temporal dependencies. The memory structure of TARDIS has similarities to both Neural Turing Machines (NTM) and Dynamic Neural Turing Machines (D-NTM), but both read and write operations of TARDIS are simpler and more efficient. We use discrete addressing for read/write operations which helps to substantially to reduce the vanishing gradient problem with very long sequences. Read and write operations in TARDIS are tied with a heuristic once the memory becomes full, and this makes the learning problem simpler when compared to NTM or D-NTM type of architectures. We provide a detailed analysis on the gradient propagation in general for MANNs. We evaluate our models on different long-term dependency tasks and report competitive results in all of them.

openalex-author · Neural Computation

STDP-Compatible Approximation of Backpropagation in an Energy-Based Model

We show that Langevin Markov chain Monte Carlo inference in an energy-based model with latent variables has the property that the early steps of inference, starting from a stationary point, correspond to propagating error gradients into internal layers, similar to backpropagation. The backpropagated error is with respect to output units that have received an outside driving force pushing them away from the stationary point. Backpropagated error gradients correspond to temporal derivatives with respect to the activation of hidden units. These lead to a weight update proportional to the product of the presynaptic firing rate and the temporal rate of change of the postsynaptic firing rate. Simulations and a theoretical argument suggest that this rate-based update rule is consistent with those associated with spike-timing-dependent plasticity. The ideas presented in this article could be an element of a theory for explaining how brains perform credit assignment in deep hierarchies as efficiently as backpropagation does, with neural computation corresponding to both approximate inference in continuous-valued latent variables and error backpropagation, at the same time.

openalex-author · Neural Information Processing Systems

GibbsNet: Iterative Adversarial Inference for Deep Graphical Models

Directed latent variable models that formulate the joint distribution as $p(x,z) = p(z) p(x \mid z)$ have the advantage of fast and exact sampling. However, these models have the weakness of needing to specify $p(z)$, often with a simple fixed prior that limits the expressiveness of the model. Undirected latent variable models discard the requirement that $p(z)$ be specified with a prior, yet sampling from them generally requires an iterative procedure such as blocked Gibbs-sampling that may require many steps to draw samples from the joint distribution $p(x, z)$. We propose a novel approach to learning the joint distribution between the data and a latent code which uses an adversarially learned iterative procedure to gradually refine the joint distribution, $p(x, z)$, to better match with the data distribution on each step. GibbsNet is the best of both worlds both in theory and in practice. Achieving the speed and simplicity of a directed latent variable model, it is guaranteed (assuming the adversarial game reaches the virtual training criteria global minimum) to produce samples from $p(x, z)$ with only a few sampling iterations. Achieving the expressiveness and flexibility of an undirected latent variable model, GibbsNet does away with the need for an explicit $p(z)$ and has the ability to do attribute prediction, class-conditional generation, and joint image-attribute modeling in a single model which is not trained for any of these specific tasks. We show empirically that GibbsNet is able to learn a more complex $p(z)$ and show that this leads to improved inpainting and iterative refinement of $p(x, z)$ for dozens of steps and stable generation without collapse for thousands of steps, despite being trained on only a few steps.

openalex-author · Proceedings of the 2nd Workshop on Representation Learning for NLP

Plan, Attend, Generate: Character-Level Neural Machine Translation with Planning

We investigate the integration of a planning mechanism into an encoder-decoder architecture with attention. We develop a model that can plan ahead when it computes alignments between the source and target sequences not only for a single time-step, but for the next k timesteps as well by constructing a matrix of proposed future alignments and a commitment vector that governs whether to follow or recompute the plan. This mechanism is inspired by strategic attentive reader and writer (STRAW) model, a recent neural architecture for planning with hierarchical reinforcement learning that can also learn higher level temporal abstractions. Our proposed model is end-to-end trainable with differentiable operations. We show that our model outperforms strong baselines on character-level translation task from WMT'15 with less parameters and computes alignments that are qualitatively intuitive.

openalex-author · IEEE Transactions on Human-Machine Systems

End-to-End Online Writer Identification With Recurrent Neural Network

Writer identification is an important topic for pattern recognition and artificial intelligence. Traditional methods rely heavily on sophisticated hand-crafted features to represent the characteristics of different writers. In this paper, we propose an end-to-end framework for online text-independent writer identification by using a recurrent neural network (RNN). Specifically, the handwriting data of a particular writer are represented by a set of random hybrid strokes (RHSs). Each RHS is a randomly sampled short sequence representing pen tip movements ($xy$-coordinates) and pen-down or pen-up states. RHS is independent of the content and language involved in handwriting; therefore, writer identification at the RHS level is more general and convenient than the character level or the word level, which also requires character/word segmentation. The RNN model with bidirectional long short-term memory is used to encode each RHS into a fixed-length vector for final classification. All the RHSs of a writer are classified independently, and then, the posterior probabilities are averaged to make the final decision. The proposed framework is end-to-end and does not require any domain knowledge for handwriting data analysis. Experiments on both English (133 writers) and Chinese (186 writers) databases verify the advantages of our method compared with other state-of-the-art approaches.

openalex-author · arXiv (Cornell University)

SampleRNN: An Unconditional End-to-End Neural Audio Generation Model

In this paper we propose a novel model for unconditional audio generation based on generating one audio sample at a time. We show that our model, which profits from combining memory-less modules, namely autoregressive multilayer perceptrons, and stateful recurrent neural networks in a hierarchical structure is able to capture underlying sources of variations in the temporal sequences over very long time spans, on three datasets of different nature. Human evaluation on the generated samples indicate that our model is preferred over competing models. We also show how each component of the model contributes to the exhibited performance.

openalex-author · arXiv (Cornell University)

On Random Weights for Texture Generation in One Layer Neural Networks

openalex-author · arXiv (Cornell University)

Generalizable Features From Unsupervised Learning

Humans learn a predictive model of the world and use this model to reason about future events and the consequences of actions. In contrast to most machine predictors, we exhibit an impressive ability to generalize to unseen scenarios and reason intelligently in these settings. One important aspect of this ability is physical intuition(Lake et al., 2016). In this work, we explore the potential of unsupervised learning to find features that promote better generalization to settings outside the supervised training distribution. Our task is predicting the stability of towers of square blocks. We demonstrate that an unsupervised model, trained to predict future frames of a video sequence of stable and unstable block configurations, can yield features that support extrapolating stability prediction to blocks configurations outside the training set distribution

openalex-author · arXiv (Cornell University)

Mode Regularized Generative Adversarial Networks

Although Generative Adversarial Networks achieve state-of-the-art results on a variety of generative tasks, they are regarded as highly unstable and prone to miss modes. We argue that these bad behaviors of GANs are due to the very particular functional shape of the trained discriminators in high dimensional spaces, which can easily make training stuck or push probability mass in the wrong direction, towards that of higher concentration than that of the data generating distribution. We introduce several ways of regularizing the objective, which can dramatically stabilize the training of GAN models. We also show that our regularizers can help the fair distribution of probability mass across the modes of the data generating distribution, during the early phases of training and thus providing a unified solution to the missing modes problem.

openalex-author · 2016 IEEE Spoken Language Technology Workshop (SLT)

Batch-normalized joint training for DNN-based distant speech recognition

Improving distant speech recognition is a crucial step towards flexible human-machine interfaces. Current technology, however, still exhibits a lack of robustness, especially when adverse acoustic conditions are met. Despite the significant progress made in the last years on both speech enhancement and speech recognition, one potential limitation of state-of-the-art technology lies in composing modules that are not well matched because they are not trained jointly.

openalex-author · Transactions of the Association for Computational Linguistics

Learning to Understand Phrases by Embedding the Dictionary

Distributional models that learn rich semantic word representations are a success story of recent NLP research. However, developing models that learn useful representations of phrases and sentences has proved far harder. We propose using the definitions found in everyday dictionaries as a means of bridging this gap between lexical and phrasal semantics. Neural language embedding models can be effectively trained to map dictionary definitions (phrases) to (lexical) representations of the words defined by those definitions. We present two applications of these architectures: reverse dictionaries that return the name of a concept given a definition or description and general-knowledge crossword question answerers. On both tasks, neural language embedding models trained on definitions from a handful of freely-available lexical resources perform as well or better than existing commercial systems that rely on significant task-specific engineering. The results highlight the effectiveness of both neural embedding architectures and definition-based training for developing models that understand phrases and sentences.

openalex-author · arXiv (Cornell University)

Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space

Generating high-resolution, photo-realistic images has been a long-standing goal in machine learning. Recently, Nguyen et al. (2016) showed one interesting way to synthesize novel images by performing gradient ascent in the latent space of a generator network to maximize the activations of one or multiple neurons in a separate classifier network. In this paper we extend this method by introducing an additional prior on the latent code, improving both sample quality and sample diversity, leading to a state-of-the-art generative model that produces high quality images at higher resolutions (227x227) than previous generative models, and does so for all 1000 ImageNet categories. In addition, we provide a unified probabilistic interpretation of related activation maximization methods and call the general class of models "Plug and Play Generative Networks". PPGNs are composed of 1) a generator network G that is capable of drawing a wide range of image types and 2) a replaceable "condition" network C that tells the generator what to draw. We demonstrate the generation of images conditioned on a class (when C is an ImageNet or MIT Places classification network) and also conditioned on a caption (when C is an image captioning network). Our method also improves the state of the art of Multifaceted Feature Visualization, which generates the set of synthetic inputs that activate a neuron in order to better understand how deep neural networks operate. Finally, we show that our model performs reasonably well at the task of image inpainting. While image models are used in this paper, the approach is modality-agnostic and can be applied to many types of data.

openalex-author · arXiv (Cornell University)

Diet Networks: Thin Parameters for Fat Genomics

Learning tasks such as those involving genomic data often poses a serious challenge: the number of input features can be orders of magnitude larger than the number of training examples, making it difficult to avoid overfitting, even when using the known regularization techniques. We focus here on tasks in which the input is a description of the genetic variation specific to a patient, the single nucleotide polymorphisms (SNPs), yielding millions of ternary inputs. Improving the ability of deep learning to handle such datasets could have an important impact in precision medicine, where high-dimensional data regarding a particular patient is used to make predictions of interest. Even though the amount of data for such tasks is increasing, this mismatch between the number of examples and the number of inputs remains a concern. Naive implementations of classifier neural networks involve a huge number of free parameters in their first layer: each input feature is associated with as many parameters as there are hidden units. We propose a novel neural network parametrization which considerably reduces the number of free parameters. It is based on the idea that we can first learn or provide a distributed representation for each input feature (e.g. for each position in the genome where variations are observed), and then learn (with another neural network called the parameter prediction network) how to map a feature's distributed representation to the vector of parameters specific to that feature in the classifier neural network (the weights which link the value of the feature to each of the hidden units). We show experimentally on a population stratification task of interest to medical studies that the proposed approach can significantly reduce both the number of parameters and the error rate of the classifier.

openalex-author · arXiv (Cornell University)

Invariant Representations for Noisy Speech Recognition

Modern automatic speech recognition (ASR) systems need to be robust under acoustic variability arising from environmental, speaker, channel, and recording conditions. Ensuring such robustness to variability is a challenge in modern day neural network-based ASR systems, especially when all types of variability are not seen during training. We attempt to address this problem by encouraging the neural network acoustic model to learn invariant feature representations. We use ideas from recent research on image generation using Generative Adversarial Networks and domain adaptation ideas extending adversarial gradient-based training. A recent work from Ganin et al. proposes to use adversarial training for image domain adaptation by using an intermediate representation from the main target classification network to deteriorate the domain classifier performance through a separate neural network. Our work focuses on investigating neural architectures which produce representations invariant to noise conditions for ASR. We evaluate the proposed architecture on the Aurora-4 task, a popular benchmark for noise robust ASR. We show that our method generalizes better than the standard multi-condition training especially when only a few noise categories are seen during training.

openalex-author · arXiv (Cornell University)

Recurrent Neural Networks With Limited Numerical Precision

Recurrent Neural Networks (RNNs) produce state-of-art performance on many machine learning tasks but their demand on resources in terms of memory and computational power are often high. Therefore, there is a great interest in optimizing the computations performed with these models especially when considering development of specialized low-power hardware for deep networks. One way of reducing the computational needs is to limit the numerical precision of the network weights and biases, and this will be addressed for the case of RNNs. We present results from the use of different stochastic and deterministic reduced precision training methods applied to two major RNN types, which are then tested on three datasets. The results show that the stochastic and deterministic ternarization, pow2- ternarization, and exponential quantization methods gave rise to low-precision RNNs that produce similar and even higher accuracy on certain datasets, therefore providing a path towards training more efficient implementations of RNNs in specialized hardware.

openalex-author · MIT Press eBooks

Deep Learning

Deep learning is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts. Because the computer gathers knowledge from experience, there is no need for a human computer operator to formally specify all the knowledge that the computer needs. The hierarchy of concepts allows the computer to learn complicated concepts by building them out of simpler ones; a graph of these hierarchies would be many layers deep. This book introduces a broad range of topics in deep learning. The text offers mathematical and conceptual background, covering relevant concepts in linear algebra, probability theory and information theory, numerical computation, and machine learning. It describes deep learning techniques used by practitioners in industry, including deep feedforward networks, regularization, optimization algorithms, convolutional networks, sequence modeling, and practical methodology; and it surveys such applications as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames. Finally, the book offers research perspectives, covering such theoretical topics as linear factor models, autoencoders, representation learning, structured probabilistic models, Monte Carlo methods, the partition function, approximate inference, and deep generative models. Deep Learning can be used by undergraduate or graduate students planning careers in either industry or research, and by software engineers who want to begin using deep learning in their products or platforms. A website offers supplementary material for both readers and instructors.

openalex-author · Computer Speech & Language

Multi-way, multilingual neural machine translation

No abstract available from the OpenAlex source record.

openalex-author · arXiv (Cornell University)

Professor Forcing: A New Algorithm for Training Recurrent Networks

The Teacher Forcing algorithm trains recurrent networks by supplying observed sequence values as inputs during training and using the network's own one-step-ahead predictions to do multi-step sampling. We introduce the Professor Forcing algorithm, which uses adversarial domain adaptation to encourage the dynamics of the recurrent network to be the same when training the network and when sampling from the network over multiple time steps. We apply Professor Forcing to language modeling, vocal synthesis on raw waveforms, handwriting generation, and image generation. Empirically we find that Professor Forcing acts as a regularizer, improving test likelihood on character level Penn Treebank and sequential MNIST. We also find that the model qualitatively improves samples, especially when sampling for a large number of time steps. This is supported by human evaluation of sample quality. Trade-offs between Professor Forcing and Scheduled Sampling are discussed. We produce T-SNEs showing that Professor Forcing successfully makes the dynamics of the network during training and sampling more similar.

openalex-author · arXiv (Cornell University)

Understanding intermediate layers using linear classifier probes

Neural network models have a reputation for being black boxes. We propose to monitor the features at every layer of a model and measure how suitable they are for classification. We use linear classifiers, which we refer to as "probes", trained entirely independently of the model itself. This helps us better understand the roles and dynamics of the intermediate layers. We demonstrate how this can be used to develop a better intuition about models and to diagnose potential problems. We apply this technique to the popular models Inception v3 and Resnet-50. Among other things, we observe experimentally that the linear separability of features increase monotonically along the depth of the model.

openalex-author · arXiv (Cornell University)

Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations

We introduce a method to train Quantized Neural Networks (QNNs) --- neural networks with extremely low precision (e.g., 1-bit) weights and activations, at run-time. At train-time the quantized weights and activations are used for computing the parameter gradients. During the forward pass, QNNs drastically reduce memory size and accesses, and replace most arithmetic operations with bit-wise operations. As a result, power consumption is expected to be drastically reduced. We trained QNNs over the MNIST, CIFAR-10, SVHN and ImageNet datasets. The resulting QNNs achieve prediction accuracy comparable to their 32-bit counterparts. For example, our quantized version of AlexNet with 1-bit weights and 2-bit activations achieves $51\%$ top-1 accuracy. Moreover, we quantize the parameter gradients to 6-bits as well which enables gradients computation using only bit-wise operation. Quantized recurrent neural networks were tested over the Penn Treebank dataset, and achieved comparable accuracy as their 32-bit counterparts using only 4-bits. Last but not least, we programmed a binary matrix multiplication GPU kernel with which it is possible to run our MNIST QNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy. The QNN code is available online.

openalex-author · arXiv (Cornell University)

Hierarchical Multiscale Recurrent Neural Networks

Learning both hierarchical and temporal representation has been among the long-standing challenges of recurrent neural networks. Multiscale recurrent neural networks have been considered as a promising approach to resolve this issue, yet there has been a lack of empirical evidence showing that this type of models can actually capture the temporal dependencies by discovering the latent hierarchical structure of the sequence. In this paper, we propose a novel multiscale approach, called the hierarchical multiscale recurrent neural networks, which can capture the latent hierarchical structure in the sequence by encoding the temporal dependencies with different timescales using a novel update mechanism. We show some evidence that our proposed multiscale architecture can discover underlying hierarchical structure in the sequences without using explicit boundary information. We evaluate our proposed model on character-level language modelling and handwriting sequence modelling.

openalex-author · Interspeech 2016

Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks

Convolutional Neural Networks (CNNs) are effective models for reducing spectral variations and modeling spectral correlations in acoustic features for automatic speech recognition (ASR).Hybrid speech recognition systems incorporating CNNs with Hidden Markov Models/Gaussian Mixture Models (HMMs/GMMs) have achieved the state-of-the-art in various benchmarks.Meanwhile, Connectionist Temporal Classification (CTC) with Recurrent Neural Networks (RNNs), which is proposed for labeling unsegmented sequences, makes it feasible to train an 'end-to-end' speech recognition system instead of hybrid settings.However, RNNs are computationally expensive and sometimes difficult to train.In this paper, inspired by the advantages of both CNNs and the CTC approach, we propose an end-to-end speech framework for sequence labeling, by combining hierarchical CNNs with CTC directly without recurrent connections.By evaluating the approach on the TIMIT phoneme recognition task, we show that the proposed model is not only computationally efficient, but also competitive with the existing baseline systems.Moreover, we argue that CNNs have the capability to model temporal correlations with appropriate context information.

openalex-author · arXiv (Cornell University)

Mollifying Networks

The optimization of deep neural networks can be more challenging than traditional convex optimization problems due to the highly non-convex nature of the loss function, e.g. it can involve pathological landscapes such as saddle-surfaces that can be difficult to escape for algorithms based on simple gradient descent. In this paper, we attack the problem of optimization of highly non-convex neural networks by starting with a smoothed -- or \textit{mollified} -- objective function that gradually has a more non-convex energy landscape during the training. Our proposition is inspired by the recent studies in continuation methods: similar to curriculum methods, we begin learning an easier (possibly convex) objective function and let it evolve during the training, until it eventually goes back to being the original, difficult to optimize, objective function. The complexity of the mollified networks is controlled by a single hyperparameter which is annealed during the training. We show improvements on various difficult optimization tasks and establish a relationship with recent works on continuation methods for neural networks and mollifiers.

openalex-author · Pattern Recognition

Online and offline handwritten Chinese character recognition: A comprehensive study and new benchmark

No abstract available from the OpenAlex source record.

openalex-author · arXiv (Cornell University)

A Neural Knowledge Language Model

Current language models have a significant limitation in the ability to encode and decode factual knowledge. This is mainly because they acquire such knowledge from statistical co-occurrences although most of the knowledge words are rarely observed. In this paper, we propose a Neural Knowledge Language Model (NKLM) which combines symbolic knowledge provided by the knowledge graph with the RNN language model. By predicting whether the word to generate has an underlying fact or not, the model can generate such knowledge-related words by copying from the description of the predicted fact. In experiments, we show that the NKLM significantly improves the performance while generating a much smaller number of unknown words.

openalex-author · arXiv (Cornell University)

An Actor-Critic Algorithm for Structured Prediction

We present an approach to training neural networks to generate sequences using actor-critic methods from reinforcement learning (RL). Current log-likelihood training methods are limited by the discrepancy between their training and testing modes, as models must generate tokens conditioned on their previous guesses rather than the ground-truth tokens. We address this problem by introducing a \textit{critic} network that is trained to predict the value of an output token, given the policy of an \textit{actor} network. This results in a training procedure that is much closer to the test phase, and allows us to directly optimize for a task-specific score such as BLEU. Crucially, since we leverage these techniques in the supervised learning setting rather than the traditional RL setting, we condition the critic network on the ground-truth output. We show that our method leads to improved performance on both a synthetic task, and for German-English machine translation. Our analysis paves the way for such methods to be applied in natural language generation tasks, such as machine translation, caption generation, and dialogue modelling.

openalex-author · arXiv (Cornell University)

Dynamic Neural Turing Machine with Soft and Hard Addressing Schemes

We extend neural Turing machine (NTM) model into a dynamic neural Turing machine (D-NTM) by introducing a trainable memory addressing scheme. This addressing scheme maintains for each memory cell two separate vectors, content and address vectors. This allows the D-NTM to learn a wide variety of location-based addressing strategies including both linear and nonlinear ones. We implement the D-NTM with both continuous, differentiable and discrete, non-differentiable read/write mechanisms. We investigate the mechanisms and effects of learning to read and write into a memory through experiments on Facebook bAbI tasks using both a feedforward and GRUcontroller. The D-NTM is evaluated on a set of Facebook bAbI tasks and shown to outperform NTM and LSTM baselines. We have done extensive analysis of our model and different variations of NTM on bAbI task. We also provide further experimental results on sequential pMNIST, Stanford Natural Language Inference, associative recall and copy tasks.

openalex-author · arXiv (Cornell University)

On Multiplicative Integration with Recurrent Neural Networks

We introduce a general and simple structural design called Multiplicative Integration (MI) to improve recurrent neural networks (RNNs). MI changes the way in which information from difference sources flows and is integrated in the computational building block of an RNN, while introducing almost no extra parameters. The new structure can be easily embedded into many popular RNN models, including LSTMs and GRUs. We empirically analyze its learning behaviour and conduct evaluations on several tasks using different RNN models. Our experimental results demonstrate that Multiplicative Integration can provide a substantial performance boost over many of the existing RNN models.

openalex-author · International Conference on Machine Learning

Bidirectional Helmholtz machines

Efficient unsupervised training and inference in deep generative models remains a challenging problem. One basic approach, called Helmholtz machine or Variational Autoencoder, involves training a top-down directed generative model together with a bottom-up auxiliary model used for approximate inference. Recent results indicate that better generative models can be obtained with better approximate inference procedures. Instead of improving the inference procedure, we here propose a new model, the bidirectional Helmholtz machine, which guarantees that the top-down and bottom-up distributions can efficiently invert each other. We achieve this by interpreting both the top-down and the bottom-up directed models as approximate inference distributions and by defining the model distribution to be the geometric mean of these two. We present a lower-bound for the likelihood of this model and we show that optimizing this bound regularizes the model so that the Bhattacharyya distance between the bottom-up and top-down approximate distributions is minimized. This approach results in state of the art generative models which prefer significantly deeper architectures while it allows for orders of magnitude more efficient likelihood estimation.

openalex-author · arXiv (Cornell University)

Deep Directed Generative Models with Energy-Based Probability Estimation

Training energy-based probabilistic models is confronted with apparently intractable sums, whose Monte Carlo estimation requires sampling from the estimated probability distribution in the inner loop of training. This can be approximately achieved by Markov chain Monte Carlo methods, but may still face a formidable obstacle that is the difficulty of mixing between modes with sharp concentrations of probability. Whereas an MCMC process is usually derived from a given energy function based on mathematical considerations and requires an arbitrarily long time to obtain good and varied samples, we propose to train a deep directed generative model (not a Markov chain) so that its sampling distribution approximately matches the energy function that is being trained. Inspired by generative adversarial networks, the proposed framework involves training of two models that represent dual views of the estimated probability distribution: the energy function (mapping an input configuration to a scalar energy value) and the generator (mapping a noise vector to a generated configuration), both represented by deep neural networks.

openalex-author · arXiv (Cornell University)

Iterative Alternating Neural Attention for Machine Reading

We propose a novel neural attention architecture to tackle machine comprehension tasks, such as answering Cloze-style queries with respect to a document. Unlike previous models, we do not collapse the query into a single vector, instead we deploy an iterative alternating attention mechanism that allows a fine-grained exploration of both the query and the document. Our model outperforms state-of-the-art baselines in standard machine comprehension benchmarks such as CNN news articles and the Children's Book Test (CBT) dataset.

openalex-author · arXiv (Cornell University)

Feedforward Initialization for Fast Inference of Deep Generative Networks is biologically plausible

We consider deep multi-layered generative models such as Boltzmann machines or Hopfield nets in which computation (which implements inference) is both recurrent and stochastic, but where the recurrence is not to model sequential structure, only to perform computation. We find conditions under which a simple feedforward computation is a very good initialization for inference, after the input units are clamped to observed values. It means that after the feedforward initialization, the recurrent network is very close to a fixed point of the network dynamics, where the energy gradient is 0. The main condition is that consecutive layers form a good auto-encoder, or more generally that different groups of inputs into the unit (in particular, bottom-up inputs on one hand, top-down inputs on the other hand) are consistent with each other, producing the same contribution into the total weighted sum of inputs. In biological terms, this would correspond to having each dendritic branch correctly predicting the aggregate input from all the dendritic branches, i.e., the soma potential. This is consistent with the prediction that the synaptic weights into dendritic branches such as those of the apical and basal dendrites of pyramidal cells are trained to minimize the prediction error made by the dendritic branch when the target is the somatic activity. Whereas previous work has shown how to achieve fast negative phase inference (when the model is unclamped) in a predictive recurrent model, this contribution helps to achieve fast positive phase inference (when the target output is clamped) in such recurrent neural models.

openalex-author · PolyPublie (École Polytechnique de Montréal)

Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations

We propose zoneout, a novel method for regularizing RNNs. At each timestep, zoneout stochastically forces some hidden units to maintain their previous values. Like dropout, zoneout uses random noise to train a pseudo-ensemble, improving generalization. But by preserving instead of dropping hidden units, gradient information and state information are more readily propagated through time, as in feedforward stochastic depth networks. We perform an empirical investigation of various RNN regularizers, and find that zoneout gives significant performance improvements across tasks. We achieve competitive results with relatively simple models in character- and word-level language modelling on the Penn Treebank and Text8 datasets, and combining with recurrent batch normalization yields state-of-the-art results on permuted sequential MNIST.

openalex-author · arXiv (Cornell University)

Hierarchical Memory Networks

Memory networks are neural networks with an explicit memory component that can be both read and written to by the network. The memory is often addressed in a soft way using a softmax function, making end-to-end training with backpropagation possible. However, this is not computationally scalable for applications which require the network to read from extremely large memories. On the other hand, it is well known that hard attention mechanisms based on reinforcement learning are challenging to train successfully. In this paper, we explore a form of hierarchical memory network, which can be considered as a hybrid between hard and soft attention memory networks. The memory is organized in a hierarchical structure such that reading from it is done with less computation than soft attention over a flat memory, while also being easier to train than hard attention over a flat memory. Specifically, we propose to incorporate Maximum Inner Product Search (MIPS) in the training and inference procedures for our hierarchical memory network. We explore the use of various state-of-the art approximate MIPS techniques and report results on SimpleQuestions, a challenging large scale factoid question answering task.

openalex-author · Medical Image Analysis

Brain tumor segmentation with Deep Neural Networks

No abstract available from the OpenAlex source record.

openalex-author · Scientific American

Machines Who Learn

No abstract available from the OpenAlex source record.

openalex-author · arXiv (Cornell University)

Generating Factoid Questions With Recurrent Neural Networks: The 30M\n Factoid Question-Answer Corpus

Over the past decade, large-scale supervised learning corpora have enabled\nmachine learning researchers to make substantial advances. However, to this\ndate, there are no large-scale question-answer corpora available. In this paper\nwe present the 30M Factoid Question-Answer Corpus, an enormous question answer\npair corpus produced by applying a novel neural network architecture on the\nknowledge base Freebase to transduce facts into natural language questions. The\nproduced question answer pairs are evaluated both by human evaluators and using\nautomatic evaluation metrics, including well-established machine translation\nand sentence similarity metrics. Across all evaluation criteria the\nquestion-generation model outperforms the competing template-based baseline.\nFurthermore, when presented to human evaluators, the generated questions appear\ncomparable in quality to real human-generated questions.\n

openalex-author · arXiv (Cornell University)

A Character-Level Decoder without Explicit Segmentation for Neural\n Machine Translation

The existing machine translation systems, whether phrase-based or neural,\nhave relied almost exclusively on word-level modelling with explicit\nsegmentation. In this paper, we ask a fundamental question: can neural machine\ntranslation generate a character sequence without any explicit segmentation? To\nanswer this question, we evaluate an attention-based encoder-decoder with a\nsubword-level encoder and a character-level decoder on four language\npairs--En-Cs, En-De, En-Ru and En-Fi-- using the parallel corpora from WMT'15.\nOur experiments show that the models with a character-level decoder outperform\nthe ones with a subword-level decoder on all of the four language pairs.\nFurthermore, the ensembles of neural models with a character-level decoder\noutperform the state-of-the-art non-neural machine translation systems on\nEn-Cs, En-De and En-Fi and perform comparably on En-Ru.\n

openalex-author · Information and Inference

GSNs: generative stochastic networks

We introduce a novel training principle for generative probabilistic models that is an alternative to maximum likelihood. The proposed Generative Stochastic Networks (GSNs) framework generalizes Denoising Auto-Encoders (DAEs), and is based on learning the transition operator of a Markov chain whose stationary distribution estimates the data distribution. The transition distribution is a conditional distribution that generally involves a small move, so it has fewer dominant modes and is unimodal in the limit of small moves. This simplifies the learning problem, making it less like density estimation and more akin to supervised function approximation, with gradients that can be obtained by backprop. The theorems provided here provide a probabilistic interpretation for DAEs and generalize them; seen in the context of this framework, auto-encoders that learn with injected noise are a special case of GSNs and can be interpreted as generative models. The theorems also provide an interesting justification for dependency networks and generalized pseudolikelihood, and define an appropriate joint distribution and sampling mechanism, even when the conditionals are not consistent. GSNs can be used with missing inputs and can be used to sample subsets of variables given the others. Experiments validating these theoretical results are conducted on both synthetic datasets and image datasets. The experiments employ a particular architecture that mimics the Deep Boltzmann Machine Gibbs sampler, but that allows training to proceed with backprop through a recurrent neural network with noise injected inside and without the need for layerwise pretraining.

openalex-author · Proceedings of the AAAI Conference on Artificial Intelligence

Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models

We investigate the task of building open domain, conversational dialogue systems based on large dialogue corpora using generative models. Generative models produce system responses that are autonomously generated word-by-word, opening up the possibility for realistic, flexible interactions. In support of this goal, we extend the recently proposed hierarchical recurrent encoder-decoder neural network to the dialogue domain, and demonstrate that this model is competitive with state-of-the-art neural language models and back-off n-gram models. We investigate the limitations of this and similar approaches, and show how its performance can be improved by bootstrapping the learning from a larger question-answer pair corpus and from pretrained word embeddings.

openalex-author · arXiv (Cornell University)

Noisy Activation Functions

Common nonlinear activation functions used in neural networks can cause training difficulties due to the saturation behavior of the activation function, which may hide dependencies that are not visible to vanilla-SGD (using first order gradients only). Gating mechanisms that use softly saturating activation functions to emulate the discrete switching of digital logic circuits are good examples of this. We propose to exploit the injection of appropriate noise so that the gradients may flow easily, even if the noiseless application of the activation function would yield zero gradient. Large noise will dominate the noise-free gradient and allow stochastic gradient descent toexplore more. By adding noise only to the problematic parts of the activation function, we allow the optimization procedure to explore the boundary between the degenerate (saturating) and the well-behaved parts of the activation function. We also establish connections to simulated annealing, when the amount of noise is annealed down, making it easier to optimize hard objective functions. We find experimentally that replacing such saturating activation functions by noisy variants helps training in many contexts, yielding state-of-the-art or competitive results on different datasets and task, especially when training seems to be the most difficult, e.g., when curriculum learning is necessary to obtain good results.

openalex-author · 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

End-to-end attention-based large vocabulary speech recognition

Many state-of-the-art Large Vocabulary Continuous Speech Recognition (LVCSR) Systems are hybrids of neural networks and Hidden Markov Models (HMMs). Recently, more direct end-to-end methods have been investigated, in which neural architectures were trained to model sequences of characters [1,2]. To our knowledge, all these approaches relied on Connectionist Temporal Classification [3] modules. We investigate an alternative method for sequence modelling based on an attention mechanism that allows a Recurrent Neural Network (RNN) to learn alignments between sequences of input frames and output labels. We show how this setup can be applied to LVCSR by integrating the decoding RNN with an n-gram language model and by speeding up its operation by constraining selections made by the attention mechanism and by reducing the source sequence lengths by pooling information over time. Recognition accuracies similar to other HMM-free RNN-based approaches are reported for the Wall Street Journal corpus.

openalex-author · 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Batch normalized recurrent neural networks

Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.

openalex-author · Neural Information Processing Systems

Architectural Complexity Measures of Recurrent Neural Networks

In this paper, we systematically analyze the connecting architectures of recurrent neural networks (RNNs). Our main contribution is twofold: first, we present a rigorous graph-theoretic framework describing the connecting architectures of RNNs in general. Second, we propose three architecture complexity measures of RNNs: (a) the recurrent depth, which captures the RNN’s over-time nonlinear complexity, (b) the feedforward depth, which captures the local input-output nonlinearity (similar to the “depth” in feedforward neural networks (FNNs)), and (c) the recurrent skip coefficient which captures how rapidly the information propagates over time. We rigorously prove each measure’s existence and computability. Our experimental results show that RNNs might benefit from larger recurrent depth and feedforward depth. We further demonstrate that increasing recurrent skip coefficient offers performance boosts on long term dependency problems.

openalex-author · Paper

Task Loss Estimation for Structured Prediction

No abstract available from the OpenAlex source record.

openalex-author · arXiv (Cornell University)

Towards a Biologically Plausible Backprop

This work contributes several new elements to the quest for a biologically plausible implementation of backprop in brains. We introduce a very general and abstract framework for machine learning, in which the quantities of interest are defined implicitly through an energy function. In this framework, only one kind of neural computation is involved both for the first phase (when the prediction is made) and the second phase (after the target is revealed), like the contrastive Hebbian learning algorithm in the continuous Hopfield model for example. Contrary to automatic differentiation in computational graphs (i.e. standard backprop), there is no need for special computation in the second phase of our framework. One advantage of our framework over contrastive Hebbian learning is that the second phase corresponds to only nudging the first-phase fixed point towards a configuration that reduces prediction error. In the case of a multi-layer supervised neural network, the output units are slightly nudged towards their target, and the perturbation introduced at the output layer propagates backward in the network. The signal 'back-propagated' during this second phase actually contains information about the error derivatives, which we use to implement a learning rule proved to perform gradient descent with respect to an objective cost function.

openalex-author · arXiv (Cornell University)

BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1

We introduce BinaryNet, a method which trains DNNs with binary weights and activations when computing parameters' gradient. We show that it is possible to train a Multi Layer Perceptron (MLP) on MNIST and ConvNets on CIFAR-10 and SVHN with BinaryNet and achieve nearly state-of-the-art results. At run-time, BinaryNet drastically reduces memory usage and replaces most multiplications by 1-bit exclusive-not-or (XNOR) operations, which might have a big impact on both general-purpose and dedicated Deep Learning hardware. We wrote a binary matrix multiplication GPU kernel with which it is possible to run our MNIST MLP 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy. The code for BinaryNet is available.

openalex-author · Procedings of the British Machine Vision Conference 2016

Oracle Performance for Visual Captioning

The task of associating images and videos with a natural language description has attracted a great amount of attention recently. Rapid progress has been made in terms of both developing novel algorithms and releasing new datasets. Indeed, the state-of-the-art results on some of the standard datasets have been pushed into the regime where it has become more and more difficult to make significant improvements. Instead of proposing new models, this work investigates the possibility of empirically establishing performance upper bounds on various visual captioning datasets without extra data labelling effort or human evaluation. In particular, it is assumed that visual captioning is decomposed into two steps: from visual inputs to visual concepts, and from visual concepts to natural language descriptions. One would be able to obtain an upper bound when assuming the first step is perfect and only requiring training a conditional language model for the second step. We demonstrate the construction of such bounds on MS-COCO, YouTube2Text and LSMDC (a combination of M-VAD and MPII-MD). Surprisingly, despite of the imperfect process we used for visual concept extraction in the first step and the simplicity of the language model for the second step, we show that current state-of-the-art models fall short when being compared with the learned upper bounds. Furthermore, with such a bound, we quantify several important factors concerning image and video captioning: the number of visual concepts captured by different models, the trade-off between the amount of visual elements captured and their accuracy, and the intrinsic difficulty and blessing of different datasets.

openalex-author · Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Pointing the Unknown Words

The problem of rare and unknown words is an important issue that can potentially effect the performance of many NLP systems, including traditional count-based and deep learning models.We propose a novel way to deal with the rare and unseen words for the neural network models using attention.Our model uses two softmax layers in order to predict the next word in conditional language models: one predicts the location of a word in the source sentence, and the other predicts a word in the shortlist vocabulary.At each timestep, the decision of which softmax layer to use is adaptively made by an MLP which is conditioned on the context.We motivate this work from a psychological evidence that humans naturally have a tendency to point towards objects in the context or the environment when the name of an object is not known.Using our proposed model, we observe improvements on two tasks, neural machine translation on the Europarl English to French parallel corpora and text summarization on the Gigaword dataset.

openalex-author · Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

NYU-MILA Neural Machine Translation Systems for WMT’16

We describe the neural machine translation system of New York University (NYU) and University of Montreal (MILA) for the translation tasks of WMT'16. The main goal of NYU-MILA submission to WMT'16 is to evaluate a new character-level decoding approach in neural machine translation on various language pairs. The proposed neural machine translation system is an attention-based encoder-decoder with a subword-level encoder and a character-level decoder. The decoder of the neural machine translation system does not require explicit segmentation, when characters are used as tokens. The character-level decoding approach provides benefits especially when translating a source language into other morphologically rich languages.

openalex-author · Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Multi-Way, Multilingual Neural Machine Translation with a Shared Attention Mechanism

We propose multi-way, multilingual neural machine translation. The proposed approach enables a single neural translation model to translate between multiple languages, with a number of parameters that grows only linearly with the number of languages. This is made possible by having a single attention mechanism that is shared across all language pairs. We train the proposed multiway, multilingual model on ten language pairs from WMT'15 simultaneously and observe clear performance improvements over models trained on only one language pair. In particular, we observe that the proposed model significantly improves the translation quality of low-resource language pairs.

openalex-author · IEEE Conference Proceedings

ReSeg:意味論的セグメンテーションのためのリカレントニューラルネットワークに基づくモデル【Powered by NICT】

No abstract available from the OpenAlex source record.

openalex-author · IEEE Conference Proceedings

エンドツーエンド注意に基づいた大語い音声認識【Powered by NICT】

No abstract available from the OpenAlex source record.

openalex-author · Lecture Notes in Computer Science

HeMIS: Hetero-Modal Image Segmentation

No abstract available from the OpenAlex source record.

openalex-author · Investigación y ciencia

Aprendizaje profundo. Tras años de decepciones, la inteligencia artiñcial está empezando a cumplir lo que prometia en sus comienzos gracias a esta potente técnica

La inteligencia artificial (IA) aparecio como disciplina seria en la decada de 1950. Entonces los investigadores creyeron que podrian emular la inteligencia humana en menos tiempo de lo que duraria su carrera. Las esperanzas se desvanecieron cuando quedo claro que los algoritmos y la potencia de computo no bastaban para culminar la tarea. Algunos consideraron que el intento era un acto de pura arrogancia. En los ultimos anos, el desarrollo de nuevas tecnicas computacionales inspiradas en las redes de neuronas del cerebro humano ha resucitado la esperanza de materializar las promesas originales de la IA. El aprendizaje profundo, una tecnica que se vale de redes neuronales complejas, permite que una maquina aprenda conceptos abstractos. En algunas tareas ya se aproxima a lo que pueden lograr los seres humanos.

openalex-author · arXiv (Cornell University)

ReSeg: A Recurrent Neural Network for Object Segmentation

We propose a structured prediction architecture for images centered around deep recurrent neural networks. The proposed network, called ReSeg, is based on the recently introduced ReNet model for object classification. We modify and extend it to perform object segmentation, noting that the avoidance of pooling can greatly simplify pixel-wise tasks for images. The ReSeg layer is composed of four recurrent neural networks that sweep the image horizontally and vertically in both directions, along with a final layer that expands the prediction back to the original image size. ReSeg combines multiple ReSeg layers with several possible input layers as well as a final layer which expands the prediction back to the original image size, making it suitable for a variety of structured prediction tasks. We evaluate ReSeg on the specific task of object segmentation with three widely-used image segmentation datasets, namely Weizmann Horse, Fashionista and Oxford Flower. The results suggest that ReSeg can challenge the state of the art in object segmentation, and may have further applications in structured prediction at large.

openalex-author · arXiv (Cornell University)

Variance Reduction in SGD by Distributed Importance Sampling

Humans are able to accelerate their learning by selecting training materials that are the most informative and at the appropriate level of difficulty. We propose a framework for distributing deep learning in which one set of workers search for the most informative examples in parallel while a single worker updates the model on examples selected by importance sampling. This leads the model to update using an unbiased estimate of the gradient which also has minimum variance when the sampling proposal is proportional to the L2-norm of the gradient. We show experimentally that this method reduces gradient variance even in a context where the cost of synchronization across machines cannot be ignored, and where the factors for importance sampling are not updated instantly across the training set.

openalex-author · arXiv (Cornell University)

Unitary Evolution Recurrent Neural Networks

Recurrent neural networks (RNNs) are notoriously difficult to train. When the eigenvalues of the hidden to hidden weight matrix deviate from absolute value 1, optimization becomes difficult due to the well studied issue of vanishing and exploding gradients, especially when trying to learn long-term dependencies. To circumvent this problem, we propose a new architecture that learns a unitary weight matrix, with eigenvalues of absolute value exactly 1. The challenge we address is that of parametrizing unitary matrices in a way that does not require expensive computations (such as eigendecomposition) after each weight update. We construct an expressive unitary weight matrix by composing several structured matrices that act as building blocks with parameters to be learned. Optimization with this parameterization becomes feasible only when considering hidden states in the complex domain. We demonstrate the potential of this architecture by achieving state of the art results in several hard tasks involving very long-term dependencies.

openalex-author · arXiv (Cornell University)

Task Loss Estimation for Sequence Prediction

Often, the performance on a supervised machine learning task is evaluated with a emph{task loss} function that cannot be optimized directly. Examples of such loss functions include the classification error, the edit distance and the BLEU score. A common workaround for this problem is to instead optimize a emph{surrogate loss} function, such as for instance cross-entropy or hinge loss. In order for this remedy to be effective, it is important to ensure that minimization of the surrogate loss results in minimization of the task loss, a condition that we call emph{consistency with the task loss}. In this work, we propose another method for deriving differentiable surrogate losses that provably meet this requirement. We focus on the broad class of models that define a score for every input-output pair. Our idea is that this score can be interpreted as an estimate of the task loss, and that the estimation error may be used as a consistent surrogate loss. A distinct feature of such an approach is that it defines the desirable value of the score for every input-output pair. We use this property to design specialized surrogate losses for Encoder-Decoder models often used for sequence prediction tasks. In our experiment, we benchmark on the task of speech recognition. Using a new surrogate loss instead of cross-entropy to train an Encoder-Decoder speech recognizer brings a significant ~13% relative improvement in terms of Character Error Rate (CER) in the case when no extra corpora are used for language modeling.

openalex-author · arXiv (Cornell University)

Deconstructing the Ladder Network Architecture

The Manual labeling of data is and will remain a costly endeavor. For this reason, semi-supervised learning remains a topic of practical importance. The recently proposed Ladder Network is one such approach that has proven to be very successful. In addition to the supervised objective, the Ladder Network also adds an unsupervised objective corresponding to the reconstruction costs of a stack of denoising autoencoders. Although the empirical results are impressive, the Ladder Network has many components intertwined, whose contributions are not obvious in such a complex architecture. In order to help elucidate and disentangle the different ingredients in the Ladder Network recipe, this paper presents an extensive experimental investigation of variants of the Ladder Network in which we replace or remove individual components to gain more insight into their relative importance. We find that all of the components are necessary for achieving optimal performance, but they do not contribute equally. For semi-supervised tasks, we conclude that the most important contribution is made by the lateral connection, followed by the application of noise, and finally the choice of what we refer to as the `combinator function' in the decoder path. We also find that as the number of labeled training examples increases, the lateral connections and reconstruction criterion become less important, with most of the improvement in generalization being due to the injection of noise in each layer. Furthermore, we present a new type of combinator function that outperforms the original design in both fully- and semi-supervised tasks, reducing record test error rates on Permutation-Invariant MNIST to 0.57% for the supervised setting, and to 0.97% and 1.0% for semi-supervised settings with 1000 and 100 labeled examples respectively.

openalex-author · arXiv (Cornell University)

Empirical performance upper bounds for image and video captioning

openalex-author · PolyPublie (École Polytechnique de Montréal)

BinaryConnect: Training Deep Neural Networks with binary weights during propagations

Deep Neural Networks (DNN) have achieved state-of-the-art results in a wide range of tasks, with the best results obtained with large training sets and large models. In the past, GPUs enabled these breakthroughs because of their greater computational speed. In the future, faster computation at both training and test time is likely to be crucial for further progress and for consumer applications on low-power devices. As a result, there is much interest in research and development of dedicated hardware for Deep Learning (DL). Binary weights, i.e., weights which are constrained to only two possible values (e.g. -1 or 1), would bring great benefits to specialized DL hardware by replacing many multiply-accumulate operations by simple accumulations, as multipliers are the most space and power-hungry components of the digital implementation of neural networks. We introduce BinaryConnect, a method which consists in training a DNN with binary weights during the forward and backward propagations, while retaining precision of the stored weights in which gradients are accumulated. Like other dropout schemes, we show that BinaryConnect acts as regularizer and we obtain near state-of-the-art results with BinaryConnect on the permutation-invariant MNIST, CIFAR-10 and SVHN.

openalex-author · 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR)

IAPR keynote lecture IV: Deep learning

Deep learning has arisen around 2006 as a renewal of neural networks research allowing such models to have more layers. Theoretical investigations have shown that functions obtained as deep compositions of simpler functions (which includes both deep and recurrent nets) can express highly varying functions (with many ups and downs and different input regions that can be distinguished) much more efficiently (with fewer parameters) than otherwise. Empirical work in a variety of applications has demonstrated that, when well trained, such deep architectures can be highly successful, remarkably breaking through previous state-of-the-art in many areas, including speech recognition, object recognition, language models, and transfer learning. This talk will summarize the advances that have made these breakthroughs possible, and end with questions about some major challenges still ahead of researchers in order to continue our climb towards AI-level competence.

openalex-author · Proceedings of the 24th ACM International on Conference on Information and Knowledge Management

A Hierarchical Recurrent Encoder-Decoder for Generative Context-Aware Query Suggestion

Users may strive to formulate an adequate textual query for their information need. Search engines assist the users by presenting query suggestions. To preserve the original search intent, suggestions should be context-aware and account for the previous queries issued by the user. Achieving context awareness is challenging due to data sparsity. We present a novel hierarchical recurrent encoder-decoder architecture that makes possible to account for sequences of previous queries of arbitrary lengths. As a result, our suggestions are sensitive to the order of queries in the context while avoiding data sparsity. Additionally, our model can suggest for rare, or long-tail, queries. The produced suggestions are synthetic and are sampled one word at a time, using computationally cheap decoding techniques. This is in contrast to current synthetic suggestion models relying upon machine learning pipelines and hand-engineered feature sets. Results show that our model outperforms existing context-aware approaches in a next query prediction setting. In addition to query suggestion, our architecture is general enough to be used in a variety of other applications.

openalex-author · arXiv (Cornell University)

Neural Networks with Few Multiplications

For most deep learning algorithms training is notoriously time consuming. Since most of the computation in training neural networks is typically spent on floating point multiplications, we investigate an approach to training that eliminates the need for most of these. Our method consists of two parts: First we stochastically binarize weights to convert multiplications involved in computing hidden states to sign changes. Second, while back-propagating error derivatives, in addition to binarizing the weights, we quantize the representations at each layer to convert the remaining multiplications into binary shifts. Experimental results across 3 popular datasets (MNIST, CIFAR10, SVHN) show that this approach not only does not hurt classification performance but can result in even better performance than standard stochastic gradient descent training, paving the way to fast, hardware-friendly training of neural networks.

openalex-author · arXiv (Cornell University)

Early Inference in Energy-Based Models Approximates Back-Propagation

We show that Langevin MCMC inference in an energy-based model with latent variables has the property that the early steps of inference, starting from a stationary point, correspond to propagating error gradients into internal layers, similarly to back-propagation. The error that is back-propagated is with respect to visible units that have received an outside driving force pushing them away from the stationary point. Back-propagated error gradients correspond to temporal derivatives of the activation of hidden units. This observation could be an element of a theory for explaining how brains perform credit assignment in deep hierarchies as efficiently as back-propagation does. In this theory, the continuous-valued latent variables correspond to averaged voltage potential (across time, spikes, and possibly neurons in the same minicolumn), and neural computation corresponds to approximate inference and error back-propagation at the same time.

openalex-author · arXiv (Cornell University)

An objective function for STDP.

We introduce a predictive objective function for the rate aspect of spike-timing dependent plasticity (STDP), i.e., ignoring the effects of synchrony of spikes but looking at spiking {\em rate changes}. The proposed weight update is proportional to the presynaptic spiking (or firing) rate times the {\em temporal change} of the integrated postsynaptic activity. We present an intuitive explanation for the relationship between spike-timing and weight change that arises when the weight change follows this rule. Spike-based simulations agree with the proposed relationship between spike timing and the temporal change of postsynaptic activity. They show a strong correlation between the biologically observed STDP behavior and the behavior obtained from simulations where the weight change follows the gradient of the predictive objective function.

openalex-author · Journal on Multimodal User Interfaces

EmoNets: Multimodal deep learning approaches for emotion recognition in video

No abstract available from the OpenAlex source record.

openalex-author · arXiv (Cornell University)

Clustering is Efficient for Approximate Maximum Inner Product Search

Efficient Maximum Inner Product Search (MIPS) is an important task that has a wide applicability in recommendation systems and classification with a large number of classes. Solutions based on locality-sensitive hashing (LSH) as well as tree-based solutions have been investigated in the recent literature, to perform approximate MIPS in sublinear time. In this paper, we compare these to another extremely simple approach for solving approximate MIPS, based on variants of the k-means clustering algorithm. Specifically, we propose to train a spherical k-means, after having reduced the MIPS problem to a Maximum Cosine Similarity Search (MCSS). Experiments on two standard recommendation system benchmarks as well as on large vocabulary word embeddings, show that this simple approach yields much higher speedups, for the same retrieval precision, than current state-of-the-art hashing-based and tree-based methods. This simple method also yields more robust retrievals when the query is corrupted by noise.

openalex-author · arXiv (Cornell University)

Hierarchical Neural Network Generative Models for Movie Dialogues.

We consider the task of generative dialogue modeling for movie scripts. To this end, we extend the recently proposed hierarchical recurrent encoder decoder neural network and demonstrate that this model is competitive with state-of-the-art neural language models and backoff n-gram models. We show that its performance can be improved considerably by bootstrapping the learning from a larger questionanswer pair corpus and from pretrained word embeddings.

openalex-author · arXiv (Cornell University)

Attention-Based Models for Speech Recognition

Recurrent sequence generators conditioned on input data through an attention mechanism have recently shown very good performance on a range of tasks in- cluding machine translation, handwriting synthesis and image caption gen- eration. We extend the attention-mechanism with features needed for speech recognition. We show that while an adaptation of the model used for machine translation in reaches a competitive 18.7% phoneme error rate (PER) on the TIMIT phoneme recognition task, it can only be applied to utterances which are roughly as long as the ones it was trained on. We offer a qualitative explanation of this failure and propose a novel and generic method of adding location-awareness to the attention mechanism to alleviate this issue. The new method yields a model that is robust to long inputs and achieves 18% PER in single utterances and 20% in 10-times longer (repeated) utterances. Finally, we propose a change to the at- tention mechanism that prevents it from concentrating too much on single frames, which further reduces PER to 17.6% level.

openalex-author · arXiv (Cornell University)

Training opposing directed models using geometric mean matching.

No abstract available from the OpenAlex source record.

openalex-author · arXiv (Cornell University)

A Recurrent Latent Variable Model for Sequential Data

In this paper, we explore the inclusion of latent random variables into the dynamic hidden state of a recurrent neural network (RNN) by combining elements of the variational autoencoder. We argue that through the use of high-level latent random variables, the variational RNN (VRNN)1 can model the kind of variability observed in highly structured sequential data such as natural speech. We empirically evaluate the proposed model against related sequential models on four speech datasets and one handwriting dataset. Our results show the important roles that latent random variables can play in the RNN dynamic hidden state.

openalex-author · arXiv (Cornell University)

Blocks and Fuel: Frameworks for deep learning

We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.

openalex-author · Zenodo (CERN European Organization for Nuclear Research)

Blocks and Fuel

This is our first stable release. It's intended to provide a stable version of Blocks and Fuel leading up to the NIPS 2015 deadline, and as a testbed for our new development model.

openalex-author · arXiv (Cornell University)

ReNet: A Recurrent Neural Network Based Alternative to Convolutional Networks

In this paper, we propose a deep neural network architecture for object recognition based on recurrent neural networks. The proposed network, called ReNet, replaces the ubiquitous convolution+pooling layer of the deep convolutional neural network with four recurrent neural networks that sweep horizontally and vertically in both directions across the image. We evaluate the proposed ReNet on three widely-used benchmark datasets; MNIST, CIFAR-10 and SVHN. The result suggests that ReNet is a viable alternative to the deep convolutional neural network, and that further investigation is needed.

openalex-author · arXiv (Cornell University)

On Using Monolingual Corpora in Neural Machine Translation

Recent work on end-to-end neural network-based architectures for machine translation has shown promising results for En-Fr and En-De translation. Arguably, one of the major factors behind this success has been the availability of high quality parallel corpora. In this work, we investigate how to leverage abundant monolingual corpora for neural machine translation. Compared to a phrase-based and hierarchical baseline, we obtain up to $1.96$ BLEU improvement on the low-resource language pair Turkish-English, and $1.59$ BLEU on the focused domain task of Chinese-English chat messages. While our method was initially targeted toward such tasks with less parallel data, we show that it also extends to high resource languages such as Cs-En and De-En where we obtain an improvement of $0.39$ and $0.47$ BLEU scores over the neural machine translation baselines, respectively.

openalex-author · arXiv (Cornell University)

RMSProp and equilibrated adaptive learning rates for non-convex optimization.

Parameter-specific adaptive learning rate methods are computationally efficient ways to reduce the ill-conditioning problems encountered when training large deep networks. Following recent work that strongly suggests that most of the critical points encountered when training such networks are saddle points, we find how considering the presence of negative eigenvalues of the Hessian could help us design better suited adaptive learning rate schemes, i.e., diagonal preconditioners. We show that the optimal preconditioner is based on taking the absolute value of the Hessian's eigenvalues, which is not what Newton and classical preconditioners like Jacobi's do. In this paper, we propose a novel adaptive learning rate scheme based on the equilibration preconditioner and show that RMSProp approximates it, which may explain some of its success in the presence of saddle points. Whereas RMSProp is a biased estimator of the equilibration preconditioner, the proposed stochastic estimator, ESGD, is unbiased and only adds a small percentage to computing time. We find that both schemes yield very similar step directions but that ESGD sometimes surpasses RMSProp in terms of convergence speed, always clearly improving over plain stochastic gradient descent.

openalex-author · arXiv (Cornell University)

Towards Biologically Plausible Deep Learning

Neuroscientists have long criticised deep learning algorithms as incompatible with current knowledge of neurobiology. We explore more biologically plausible versions of deep representation learning, focusing here mostly on unsupervised learning but developing a learning mechanism that could account for supervised, unsupervised and reinforcement learning. The starting point is that the basic learning rule believed to govern synaptic weight updates (Spike-Timing-Dependent Plasticity) arises out of a simple update rule that makes a lot of sense from a machine learning point of view and can be interpreted as gradient descent on some objective function so long as the neuronal dynamics push firing rates towards better values of the objective function (be it supervised, unsupervised, or reward-driven). The second main idea is that this corresponds to a form of the variational EM algorithm, i.e., with approximate rather than exact posteriors, implemented by neural dynamics. Another contribution of this paper is that the gradients required for updating the hidden states in the above variational interpretation can be estimated using an approximation that only requires propagating activations forward and backward, with pairs of layers learning to form a denoising auto-encoder. Finally, we extend the theory about the probabilistic interpretation of auto-encoders to justify improved sampling schemes based on the generative interpretation of denoising auto-encoders, and we validate all these ideas on generative learning tasks.

openalex-author · arXiv (Cornell University)

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Inspired by recent work in machine translation and object detection, we introduce an attention based model that automatically learns to describe the content of images. We describe how we can train this model in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound. We also show through visualization how the model is able to automatically learn to fix its gaze on salient objects while generating the corresponding words in the output sequence. We validate the use of attention with state-of-the-art performance on three benchmark datasets: Flickr8k, Flickr30k and MS COCO.

openalex-author · arXiv (Cornell University)

Gated Feedback Recurrent Neural Networks

In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.

openalex-author · Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processin

On Using Very Large Target Vocabulary for Neural Machine Translation

Sébastien Jean, Kyunghyun Cho, Roland Memisevic, Yoshua Bengio. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2015.

openalex-author · Proceedings of the Tenth Workshop on Statistical Machine Translation

Montreal Neural Machine Translation Systems for WMT’15

Neural machine translation (NMT) systems have recently achieved results comparable to the state of the art on a few translation tasks, including EnglishFrench and EnglishGerman. The main purpose of the Montreal Institute for Learning Algorithms (MILA) submission to WMT'15 is to evaluate this new approach on a greater variety of language pairs. Furthermore, the human evaluation campaign may help us and the research community to better understand the behaviour of our systems. We use the RNNsearch architecture, which adds an attention mechanism to the encoderdecoder. We also leverage some of the recent developments in NMT, including the use of large vocabularies, unknown word replacement and, to a limited degree, the inclusion of monolingual language models.

openalex-author · Lecture Notes in Computer Science

Difference Target Propagation

No abstract available from the OpenAlex source record.

openalex-author · Neural Networks

Challenges in representation learning: A report on three machine learning contests

No abstract available from the OpenAlex source record.

openalex-author · IEEE/ACM Transactions on Audio, Speech, and Language Processing

Using Recurrent Neural Networks for Slot Filling in Spoken Language Understanding

Semantic slot filling is one of the most challenging problems in spoken language understanding (SLU). In this paper, we propose to use recurrent neural networks (RNNs) for this task, and present several novel architectures designed to efficiently model past and future temporal dependencies. Specifically, we implemented and compared several important RNN architectures, including Elman, Jordan, and hybrid variants. To facilitate reproducibility, we implemented these networks with the publicly available Theano neural network toolkit and completed experiments on the well-known airline travel information system (ATIS) benchmark. In addition, we compared the approaches on two custom SLU data sets from the entertainment and movies domains. Our results show that the RNN-based models outperform the conditional random field (CRF) baseline by 2% in absolute error reduction on the ATIS benchmark. We improve the state-of-the-art by 0.5% in the Entertainment domain, and 6.7% for the movies domain.

openalex-author · arXiv (Cornell University)

Target Propagation

Back-propagation has been the workhorse of recent successes of deep learning but it relies on infinitesimal effects (partial derivatives) in order to perform credit assignment. This could become a serious issue as one considers deeper and more non-linear functions, e.g., consider the extreme case of nonlinearity where the relation between parameters and cost is actually discrete. Inspired by the biological implausibility of back-propagation, a few approaches have been proposed in the past that could play a similar credit assignment role. In this spirit, we explore a novel approach to credit assignment in deep networks that we call target propagation. The main idea is to compute targets rather than gradients, at each layer. Like gradients, they are propagated backwards. In a way that is related but different from previously proposed proxies for back-propagation which rely on a backwards network with symmetric weights, target propagation relies on auto-encoders at each layer. Unlike back-propagation, it can be applied even when units exchange stochastic bits rather than real numbers. We show that a linear correction for the imperfectness of the auto-encoders, called difference target propagation, is very effective to make target propagation actually work, leading to results comparable to back-propagation for deep networks with discrete and continuous units and denoising auto-encoders and achieving state of the art for stochastic networks.

openalex-author · arXiv (Cornell University)

Embedding Word Similarity with Neural Machine Translation

Neural language models learn word representations, or embeddings, that capture rich linguistic and conceptual information. Here we investigate the embeddings learned by neural machine translation models, a recently-developed class of neural language model. We show that embeddings from translation models outperform those learned by monolingual models at tasks that require knowledge of both conceptual similarity and lexical-syntactic role. We further show that these effects hold when translating from both English to French and English to German, and argue that the desirable properties of translation embeddings should emerge largely independently of the source and target languages. Finally, we apply a new method for training neural translation models with very large vocabularies, and show that this vocabulary expansion algorithm results in minimal degradation of embedding quality. Our embedding spaces can be queried in an online demo and downloaded from our web page. Overall, our analyses indicate that translation-based embeddings should be used in applications that require concepts to be organised according to similarity and/or lexical function, while monolingual embeddings are better suited to modelling (nonspecific) inter-word relatedness.

openalex-author · arXiv (Cornell University)

FitNets: Hints for Thin Deep Nets

While depth tends to improve network performances, it also makes gradient-based training more difficult since deeper networks tend to be more non-linear. The recently proposed knowledge distillation approach is aimed at obtaining small and fast-to-execute models, and it has shown that a student network could imitate the soft output of a larger teacher network or ensemble of networks. In this paper, we extend this idea to allow the training of a student that is deeper and thinner than the teacher, using not only the outputs but also the intermediate representations learned by the teacher as hints to improve the training process and final performance of the student. Because the student intermediate hidden layer will generally be smaller than the teacher's intermediate hidden layer, additional parameters are introduced to map the student hidden layer to the prediction of the teacher hidden layer. This allows one to train deeper students that can generalize better or run faster, a trade-off that is controlled by the chosen student capacity. For example, on CIFAR-10, a deep student network with almost 10.4 times less parameters outperforms a larger, state-of-the-art teacher network.

openalex-author · arXiv (Cornell University)

Ensemble of Generative and Discriminative Techniques for Sentiment Analysis of Movie Reviews

Sentiment analysis is a common task in natural language processing that aims to detect polarity of a text document (typically a consumer review). In the simplest settings, we discriminate only between positive and negative sentiment, turning the task into a standard binary classification problem. We compare several ma- chine learning approaches to this problem, and combine them to achieve the best possible results. We show how to use for this task the standard generative lan- guage models, which are slightly complementary to the state of the art techniques. We achieve strong results on a well-known dataset of IMDB movie reviews. Our results are easily reproducible, as we publish also the code needed to repeat the experiments. This should simplify further advance of the state of the art, as other researchers can combine their techniques with ours with little effort.

openalex-author · Neural Networks

Editorial introduction to the Neural Networks special issue on Deep Learning of Representations

No abstract available from the OpenAlex source record.

openalex-author · arXiv (Cornell University)

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

In this paper we compare different types of recurrent units in recurrent neural networks (RNNs). Especially, we focus on more sophisticated units that implement a gating mechanism, such as a long short-term memory (LSTM) unit and a recently proposed gated recurrent unit (GRU). We evaluate these recurrent units on the tasks of polyphonic music modeling and speech signal modeling. Our experiments revealed that these advanced recurrent units are indeed better than more traditional recurrent units such as tanh units. Also, we found GRU to be comparable to LSTM.

openalex-author · arXiv (Cornell University)

End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results

We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.

openalex-author · Advances in Intelligent Systems and Computing

Unsupervised Learning of Semantics of Object Detections for Scene Categorization

No abstract available from the OpenAlex source record.

openalex-author · arXiv (Cornell University)

How transferable are features in deep neural networks?

Many deep neural networks trained on natural images exhibit a curious phenomenon in common: on the first layer they learn features similar to Gabor filters and color blobs. Such first-layer features appear not to be specific to a particular dataset or task, but general in that they are applicable to many datasets and tasks. Features must eventually transition from general to specific by the last layer of the network, but this transition has not been studied extensively. In this paper we experimentally quantify the generality versus specificity of neurons in each layer of a deep convolutional neural network and report a few surprising results. Transferability is negatively affected by two distinct issues: (1) the specialization of higher layer neurons to their original task at the expense of performance on the target task, which was expected, and (2) optimization difficulties related to splitting networks between co-adapted neurons, which was not expected. In an example network trained on ImageNet, we demonstrate that either of these two issues may dominate, depending on whether features are transferred from the bottom, middle, or top of the network. We also document that the transferability of features decreases as the distance between the base task and target task increases, but that transferring features even from distant tasks can be better than using random features. A final surprising result is that initializing a network with transferred features from almost any number of layers can produce a boost to generalization that lingers even after fine-tuning to the target dataset.

openalex-author · arXiv (Cornell University)

NICE: Non-linear Independent Components Estimation

We propose a deep learning framework for modeling complex high-dimensional densities called Non-linear Independent Component Estimation (NICE). It is based on the idea that a good representation is one in which the data has a distribution that is easy to model. For this purpose, a non-linear deterministic transformation of the data is learned that maps it to a latent space so as to make the transformed data conform to a factorized distribution, i.e., resulting in independent latent variables. We parametrize this transformation so that computing the Jacobian determinant and inverse transform is trivial, yet we maintain the ability to learn complex non-linear transformations, via a composition of simple building blocks, each based on a deep neural network. The training criterion is simply the exact log-likelihood, which is tractable. Unbiased ancestral sampling is also easy. We show that this approach yields good generative models on four image datasets and can be used for inpainting.

openalex-author · arXiv (Cornell University)

BilBOWA: Fast Bilingual Distributed Representations without Word\n Alignments

We introduce BilBOWA (Bilingual Bag-of-Words without Alignments), a simple\nand computationally-efficient model for learning bilingual distributed\nrepresentations of words which can scale to large monolingual datasets and does\nnot require word-aligned parallel training data. Instead it trains directly on\nmonolingual data and extracts a bilingual signal from a smaller set of raw-text\nsentence-aligned data. This is achieved using a novel sampled bag-of-words\ncross-lingual objective, which is used to regularize two noise-contrastive\nlanguage models for efficient cross-lingual feature learning. We show that\nbilingual embeddings learned using the proposed model outperform\nstate-of-the-art methods on a cross-lingual document classification task as\nwell as a lexical translation task on WMT11 data.\n

openalex-author · arXiv (Cornell University)

Deep Directed Generative Autoencoders

For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.

openalex-author · arXiv (Cornell University)

Not All Neural Embeddings are Born Equal

Neural language models learn word representations that capture rich linguistic and conceptual information. Here we investigate the embeddings learned by neural machine translation models. We show that translation-based embeddings outperform those learned by cutting-edge monolingual models at single-language tasks requiring knowledge of conceptual similarity and/or syntactic role. The findings suggest that, while monolingual models learn information about how concepts are related, neural-translation models better capture their true ontological status.

openalex-author · arXiv (Cornell University)

Deep Tempering

Restricted Boltzmann Machines (RBMs) are one of the fundamental building blocks of deep learning. Approximate maximum likelihood training of RBMs typically necessitates sampling from these models. In many training scenarios, computationally efficient Gibbs sampling procedures are crippled by poor mixing. In this work we propose a novel method of sampling from Boltzmann machines that demonstrates a computationally efficient way to promote mixing. Our approach leverages an under-appreciated property of deep generative models such as the Deep Belief Network (DBN), where Gibbs sampling from deeper levels of the latent variable hierarchy results in dramatically increased ergodicity. Our approach is thus to train an auxiliary latent hierarchical model, based on the DBN. When used in conjunction with parallel-tempering, the method is asymptotically guaranteed to simulate samples from the target RBM. Experimental results confirm the effectiveness of this sampling strategy in the context of RBM training.

openalex-author · arXiv (Cornell University)

On the Properties of Neural Machine Translation: Encoder-Decoder Approaches

Neural machine translation is a relatively new approach to statistical machine translation based purely on neural networks. The neural machine translation models often consist of an encoder and a decoder. The encoder extracts a fixed-length representation from a variable-length input sentence, and the decoder generates a correct translation from this representation. In this paper, we focus on analyzing the properties of the neural machine translation using two models; RNN Encoder--Decoder and a newly proposed gated recursive convolutional neural network. We show that the neural machine translation performs relatively well on short sentences without unknown words, but its performance degrades rapidly as the length of the sentence and the number of unknown words increase. Furthermore, we find that the proposed gated recursive convolutional network learns a grammatical structure of a sentence automatically.

openalex-author · arXiv (Cornell University)

Overcoming the Curse of Sentence Length for Neural Machine Translation\n using Automatic Segmentation

The authors of (Cho et al., 2014a) have shown that the recently introduced\nneural network translation systems suffer from a significant drop in\ntranslation quality when translating long sentences, unlike existing\nphrase-based translation systems. In this paper, we propose a way to address\nthis issue by automatically segmenting an input sentence into phrases that can\nbe easily translated by the neural network translation model. Once each segment\nhas been independently translated by the neural machine translation model, the\ntranslated clauses are concatenated to form a final translation. Empirical\nresults show a significant improvement in translation quality for long\nsentences.\n

openalex-author · Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Scaling up deep learning

Deep learning has rapidly moved from a marginal approach in the machine learning community less than ten years ago to one that has strong industrial impact, in particular for high-dimensional perceptual data such as speech and images, but also natural language. The demand for experts in deep learning is growing very fast (faster than we can graduate PhDs), thereby considerably increasing their market value. Deep learning is based on the idea of learning multiple levels of representation, with higher levels computed as a function of lower levels, and corresponding to more abstract concepts automatically discovered by the learner. Deep learning arose out of research on artificial neural networks and graphical models and the literature on that subject has considerably grown in recent years, culminating in the creation of a dedicated conference (ICLR). The tutorial will introduce some of the basic algorithms, both on the supervised and unsupervised sides, as well as discuss some of the guidelines for successfully using them in practice. Finally, it will introduce current research questions regarding the challenge of scaling up deep learning to much larger models that can successfully extract information from huge datasets.

openalex-author · arXiv (Cornell University)

How Auto-Encoders Could Provide Credit Assignment in Deep Networks via Target Propagation

We propose to exploit {\em reconstruction} as a layer-local training signal for deep learning. Reconstructions can be propagated in a form of target propagation playing a role similar to back-propagation but helping to reduce the reliance on derivatives in order to perform credit assignment across many levels of possibly strong non-linearities (which is difficult for back-propagation). A regularized auto-encoder tends produce a reconstruction that is a more likely version of its input, i.e., a small move in the direction of higher likelihood. By generalizing gradients, target propagation may also allow to train deep networks with discrete hidden units. If the auto-encoder takes both a representation of input and target (or of any side information) in input, then its reconstruction of input representation provides a target towards a representation that is more likely, conditioned on all the side information. A deep auto-encoder decoding path generalizes gradient propagation in a learned way that can could thus handle not just infinitesimal changes but larger, discrete changes, hopefully allowing credit assignment through a long chain of non-linear operations. In addition to each layer being a good auto-encoder, the encoder also learns to please the upper layers by transforming the data into a space where it is easier to model by them, flattening manifolds and disentangling factors. The motivations and theoretical justifications for this approach are laid down in this paper, along with conjectures that will have to be verified either mathematically or experimentally, including a hypothesis stating that such auto-encoder mediated target propagation could play in brains the role of credit assignment through many non-linear, noisy and discrete transformations.

openalex-author · Proceedings of the Companion Publication of the 2014 Annual Conference on Genetic and Evolutionary Computation

Deep learning and cultural evolution

We propose a theory and its first experimental tests, relating difficulty of learning in deep architectures to culture and language. The theory is articulated around the following hypotheses: learning in an individual human brain is hampered by the presence of effective local minima, particularly when it comes to learning higher-level abstractions, which are represented by the composition of many levels of representation, i.e., by deep architectures; a human brain can learn such high-level abstractions if guided by the signals produced by other humans, which act as hints for intermediate and higher-level abstractions; language and the recombination and optimization of mental concepts provide an efficient evolutionary recombination operator for this purpose. The theory is grounded in experimental observations of the difficulties of training deep artificial neural networks and an empirical test of the hypothesis regarding the need for guidance of intermediate concepts is demonstrated. This is done through a learning task on which all the tested machine learning algorithms failed, unless provided with hints about intermediate-level abstractions.

openalex-author · arXiv (Cornell University)

Exponentially Increasing the Capacity-to-Computation Ratio for Conditional Computation in Deep Learning

Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.

openalex-author · http://www.cse.wustl.edu/~mchen/papers/deepmsda.pdf

Marginalized Denoising Auto-encoders for Nonlinear Representations

Denoising auto-encoders (DAEs) have been suc-cessfully used to learn new representations for a wide range of machine learning tasks. During training, DAEs make many passes over the train-ing dataset and reconstruct it from partial cor-ruption generated from a pre-specified corrupting distribution. This process learns robust represen-tation, though at the expense of requiring many training epochs, in which the data is explicitly corrupted. In this paper we present the marginal-ized Denoising Auto-encoder (mDAE), which (approximately) marginalizes out the corruption during training. Effectively, the mDAE takes into account infinitely many corrupted copies of the training data in every epoch, and therefore is able to match or outperform the DAE with much fewer training epochs. We analyze our proposed algorithm and show that it can be understood as a classic auto-encoder with a special form of reg-ularization. In empirical evaluations we show that it attains 1-2 order-of-magnitude speedup in training time over other competing approaches. 1.

openalex-author · Proceedings of the AAAI Conference on Artificial Intelligence

Learning Concept Embeddings for Query Expansion by Quantum Entropy Minimization

In web search, users queries are formulated using only few terms and term-matching retrieval functions could fail at retrieving relevant documents. Given a user query, the technique of query expansion (QE) consists in selecting related terms that could enhance the likelihood of retrieving relevant documents. Selecting such expansion terms is challenging and requires a computational framework capable of encoding complex semantic relationships. In this paper, we propose a novel method for learning, in a supervised way, semantic representations for words and phrases. By embedding queries and documents in special matrices, our model disposes of an increased representational power with respect to existing approaches adopting a vector representation. We show that our model produces high-quality query expansion terms. Our expansion increase IR measures beyond expansion from current word-embeddings models and well-established traditional QE methods.

openalex-author · Proceedings of the AAAI Conference on Artificial Intelligence

On the Challenges of Physical Implementations of RBMs

Restricted Boltzmann machines (RBMs) are powerful machine learning models, but learning and some kinds of inference in the model require sampling-based approximations, which, in classical digital computers, are implemented using expensive MCMC. Physical computation offers the opportunity to reduce the costof sampling by building physical systems whose natural dynamics correspond to drawing samples from the desired RBM distribution. Such a system avoids the burn-in and mixing cost of a Markov chain. However, hardware implementations of this variety usually entail limitations such as low-precision and limited range of the parameters and restrictions on the size and topology of the RBM. We conduct software simulations to determine how harmful each of these restrictions is. Our simulations are based on the D-Wave Two computer, but the issues we investigate arise in most forms of physical computation.Our findings suggest that designers of new physical computing hardware and algorithms for physical computers should focus their efforts on overcoming the limitations imposed by the topology restrictions of currently existing physical computers.

openalex-author · arXiv (Cornell University)

Reweighted Wake-Sleep

Training deep directed graphical models with many hidden variables and performing inference remains a major challenge. Helmholtz machines and deep belief networks are such models, and the wake-sleep algorithm has been proposed to train them. The wake-sleep algorithm relies on training not just the directed generative model but also a conditional generative model (the inference network) that runs backward from visible to latent, estimating the posterior distribution of latent given visible. We propose a novel interpretation of the wake-sleep algorithm which suggests that better estimators of the gradient can be obtained by sampling latent variables multiple times from the inference network. This view is based on importance sampling as an estimator of the likelihood, with the approximate inference network as a proposal distribution. This interpretation is confirmed experimentally, showing that better likelihood can be achieved with this reweighted wake-sleep procedure. Based on this interpretation, we propose that a sigmoidal belief network is not sufficiently powerful for the layers of the inference network in order to recover a good estimator of the posterior distribution of latent variables. Our experiments show that using a more powerful layer model, such as NADE, yields substantially better generative models.

openalex-author · arXiv (Cornell University)

Identifying and attacking the saddle point problem in high-dimensional non-convex optimization

A central challenge to many fields of science and engineering involves minimizing non-convex error functions over continuous, high dimensional spaces. Gradient descent or quasi-Newton methods are almost ubiquitously used to perform such minimizations, and it is often thought that a main source of difficulty for these local methods to find the global minimum is the proliferation of local minima with much higher error than the global minimum. Here we argue, based on results from statistical physics, random matrix theory, neural network theory, and empirical evidence, that a deeper and more profound difficulty originates from the proliferation of saddle points, not local minima, especially in high dimensional problems of practical interest. Such saddle points are surrounded by high error plateaus that can dramatically slow down learning, and give the illusory impression of the existence of a local minimum. Motivated by these arguments, we propose a new approach to second-order optimization, the saddle-free Newton method, that can rapidly escape high dimensional saddle points, unlike gradient descent and quasi-Newton methods. We apply this algorithm to deep or recurrent neural network training, and provide numerical evidence for its superior optimization performance.

openalex-author · arXiv (Cornell University)

Iterative Neural Autoregressive Distribution Estimator (NADE-k)

Training of the neural autoregressive density estimator (NADE) can be viewed as doing one step of probabilistic inference on missing values in data. We propose a new model that extends this inference scheme to multiple steps, arguing that it is easier to learn to improve a reconstruction in $k$ steps rather than to learn to reconstruct in a single inference step. The proposed model is an unsupervised building block for deep learning that combines the desirable properties of NADE and multi-predictive training: (1) Its test likelihood can be computed analytically, (2) it is easy to generate independent samples from it, and (3) it uses an inference engine that is a superset of variational inference for Boltzmann machines. The proposed NADE-k is competitive with the state-of-the-art in density estimation on the two datasets tested.

openalex-author · arXiv (Cornell University)

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

In this paper, we propose a novel neural network model called RNN Encoder-Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed-length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder-Decoder as an additional feature in the existing log-linear model. Qualitatively, we show that the proposed model learns a semantically and syntactically meaningful representation of linguistic phrases.

openalex-author · arXiv (Cornell University)

On the saddle point problem for non-convex optimization

A central challenge to many fields of science and engineering involves minimizing non-convex error functions over continuous, high dimensional spaces. Gradient descent or quasi-Newton methods are almost ubiquitously used to perform such minimizations, and it is often thought that a main source of difficulty for the ability of these local methods to find the global minimum is the proliferation of local minima with much higher error than the global minimum. Here we argue, based on results from statistical physics, random matrix theory, and neural network theory, that a deeper and more profound difficulty originates from the proliferation of saddle points, not local minima, especially in high dimensional problems of practical interest. Such saddle points are surrounded by high error plateaus that can dramatically slow down learning, and give the illusory impression of the existence of a local minimum. Motivated by these arguments, we propose a new algorithm, the saddle-free Newton method, that can rapidly escape high dimensional saddle points, unlike gradient descent and quasi-Newton methods. We apply this algorithm to deep neural network training, and provide preliminary numerical evidence for its superior performance.

openalex-author · arXiv (Cornell University)

On the Number of Linear Regions of Deep Neural Networks

We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.

openalex-author · IEEE Transactions on Pattern Analysis and Machine Intelligence

The Spike-and-Slab RBM and Extensions to Discrete and Sparse Data Distributions

The spike-and-slab restricted Boltzmann machine (ssRBM) is defined to have both a real-valued "slab" variable and a binary "spike" variable associated with each unit in the hidden layer. The model uses its slab variables to model the conditional covariance of the observation-thought to be important in capturing the statistical properties of natural images. In this paper, we present the canonical ssRBM framework together with some extensions. These extensions highlight the flexibility of the spike-and-slab RBM as a platform for exploring more sophisticated probabilistic models of high dimensional data in general and natural image data in particular. Here, we introduce the subspace-ssRBM focused on the task of learning invariant features. We highlight the behaviour of the ssRBM and its extensions through experiments with the MNIST digit recognition task and the CIFAR-10 object classification task.

openalex-author · Lecture Notes in Computer Science

On the Equivalence between Deep NADE and Generative Stochastic Networks

No abstract available from the OpenAlex source record.

openalex-author · arXiv (Cornell University)

Low precision arithmetic for deep learning

We simulate the training of a set of state of the art neural networks, the Maxout networks (Goodfellow et al., 2013a), on three benchmark datasets: the MNIST, CIFAR10 and SVHN, with three distinct arithmetics: floating point, fixed point and dynamic fixed point. For each of those datasets and for each of those arithmetics, we assess the impact of the precision of the computations on the final error of the training. We find that very low precision computation is sufficient not just for running trained networks but also for training them. For example, almost state-of-the-art results were obtained on most datasets with 10 bits for computing activations and gradients, and 12 bits for storing updated parameters.

openalex-author · Studies in Computational Intelligence

Evolving Culture Versus Local Minima

No abstract available from the OpenAlex source record.

openalex-author · International Conference on Learning Representations

How to Construct Deep Recurrent Neural Networks

Abstract: In this paper, we explore different ways to extend a recurrent neural network (RNN) to a \textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.

openalex-author · Lecture Notes in Computer Science

Learned-Norm Pooling for Deep Feedforward and Recurrent Neural Networks

No abstract available from the OpenAlex source record.

openalex-author · arXiv (Cornell University)

An empirical analysis of dropout in piecewise linear networks

The recently introduced dropout training criterion for neural networks has been the subject of much attention due to its simplicity and remarkable effectiveness as a regularizer, as well as its interpretation as a training procedure for an exponentially large ensemble of networks that share parameters. In this work we empirically investigate several questions related to the efficacy of dropout, specifically as it concerns networks employing the popular rectified linear activation function. We investigate the quality of the test time weight-scaling inference procedure by evaluating the geometric average exactly in small models, as well as compare the performance of the geometric mean to the arithmetic mean more commonly employed by ensemble techniques. We explore the effect of tied weights on the ensemble interpretation by training ensembles of masked networks without tied weights. Finally, we investigate an alternative criterion based on a biased estimator of the maximum likelihood ensemble gradient.

openalex-author · International Conference on Learning Representations

Multimodal Transitions for Generative Stochastic Networks

Generative Stochastic Networks (GSNs) have been recently introduced as an alternative to traditional probabilistic modeling: instead of parametrizing the data distribution directly, one parametrizes a transition operator for a Markov chain whose stationary distribution is an estimator of the data generating distribution. The result of training is therefore a machine that generates samples through this Markov chain. However, the previously introduced GSN consistency theorems suggest that in order to capture a wide class of distributions, the transition operator in general should be multimodal, something that has not been done before this paper. We introduce for the first time multimodal transition distributions for GSNs, in particular using models in the NADE family (Neural Autoregressive Density Estimator) as output distributions of the transition operator. A NADE model is related to an RBM (and can thus model multimodal distributions) but its likelihood (and likelihood gradient) can be computed easily. The parameters of the NADE are obtained as a learned function of the previous state of the learned Markov chain. Experiments clearly illustrate the advantage of such multimodal transition distributions over unimodal GSNs.

openalex-author · arXiv (Cornell University)

An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks

Catastrophic forgetting is a problem faced by many machine learning models and algorithms. When trained on one task, then trained on a second task, many machine learning models "forget" how to perform the first task. This is widely believed to be a serious problem for neural networks. Here, we investigate the extent to which the catastrophic forgetting problem occurs for modern neural networks, comparing both established and recent gradient-based training algorithms and activation functions. We also examine the effect of the relationship between the first task and the second task on catastrophic forgetting. We find that it is always best to train using the dropout algorithm--the dropout algorithm is consistently best at adapting to the new task, remembering the old task, and has the best tradeoff curve between these two extremes. We find that different tasks and relationships between tasks result in very different rankings of activation function performance. This suggests the choice of activation function should always be cross-validated.

openalex-author · arXiv (Cornell University)

On the number of response regions of deep feed forward networks with piece-wise linear activations

This paper explores the complexity of deep feedforward networks with linear pre-synaptic couplings and rectified linear activations. This is a contribution to the growing body of work contrasting the representational power of deep and shallow network architectures. In particular, we offer a framework for comparing deep and shallow models that belong to the family of piecewise linear functions based on computational geometry. We look at a deep rectifier multi-layer perceptron (MLP) with linear outputs units and compare it with a single layer version of the model. In the asymptotic regime, when the number of inputs stays constant, if the shallow model has $kn$ hidden units and $n_0$ inputs, then the number of linear regions is $O(k^{n_0}n^{n_0})$. For a $k$ layer model with $n$ hidden units on each layer it is $Ω(\left\lfloor {n}/{n_0}\right\rfloor^{k-1}n^{n_0})$. The number $\left\lfloor{n}/{n_0}\right\rfloor^{k-1}$ grows faster than $k^{n_0}$ when $n$ tends to infinity or when $k$ tends to infinity and $n \geq 2n_0$. Additionally, even when $k$ is small, if we restrict $n$ to be $2n_0$, we can show that a deep model has considerably more linear regions that a shallow one. We consider this as a first step towards understanding the complexity of these models and specifically towards providing suitable mathematical tools for future analysis.

openalex-author · Neural Information Processing Systems

Multi-Prediction Deep Boltzmann Machines

We introduce the multi-prediction deep Boltzmann machine (MP-DBM). The MP-DBM can be seen as a single probabilistic model trained to maximize a variational approximation to the generalized pseudolikelihood, or as a family of recurrent nets that share parameters and approximately solve different inference problems. Prior methods of training DBMs either do not perform well on classification tasks or require an initial learning pass that trains the DBM greedily, one layer at a time. The MP-DBM does not require greedy layerwise pretraining, and outperforms the standard DBM at classification, classification with missing inputs, and mean field prediction tasks.1

openalex-author · Neural Information Processing Systems

Stochastic Ratio Matching of RBMs for Sparse High-Dimensional Inputs

Sparse high-dimensional data vectors are common in many application domains where a very large number of rarely non-zero features can be devised. Unfortunately, this creates a computational bottleneck for unsupervised feature learning algorithms such as those based on auto-encoders and RBMs, because they involve a reconstruction step where the whole input vector is predicted from the current feature values. An algorithm was recently developed to successfully handle the case of auto-encoders, based on an importance sampling scheme stochastically selecting which input elements to actually reconstruct during training for each particular example. To generalize this idea to RBMs, we propose a stochastic ratio-matching algorithm that inherits all the computational advantages and unbiasedness of the importance sampling scheme. We show that stochastic ratio matching is a good estimator, allowing the approach to beat the state-of-the-art on two bag-of-word text classification benchmarks (20 Newsgroups and RCV1), while keeping computational cost linear in the number of non-zeros.

openalex-author · Proceedings of the 15th ACM on International conference on multimodal interaction

Combining modality specific deep neural networks for emotion recognition in video

In this paper we present the techniques used for the University of Montréal's team submissions to the 2013 Emotion Recognition in the Wild Challenge. The challenge is to classify the emotions expressed by the primary human subject in short video clips extracted from feature length movies. This involves the analysis of video clips of acted scenes lasting approximately one-two seconds, including the audio track which may contain human voices as well as background music. Our approach combines multiple deep neural networks for different data modalities, including: (1) a deep convolutional neural network for the analysis of facial expressions within video frames; (2) a deep belief net to capture audio information; (3) a deep autoencoder to model the spatio-temporal information produced by the human actions depicted within the entire scene; and (4) a shallow network architecture focused on extracted features of the mouth of the primary human subject in the scene. We discuss each of these techniques, their performance characteristics and different strategies to aggregate their predictions. Our best single model was a convolutional neural network trained to predict emotions from static frames using two large data sets, the Toronto Face Database and our own set of faces images harvested from Google image search, followed by a per frame aggregation strategy that used the challenge training data. This yielded a test set accuracy of 35.58%. Using our best strategy for aggregating our top performing models into a single predictor we were able to produce an accuracy of 41.03% on the challenge test set. These compare favorably to the challenge baseline test set accuracy of 27.56%.

openalex-author · arXiv (Cornell University)

Bounding the Test Log-Likelihood of Generative Models

Several interesting generative learning algorithms involve a complex probability distribution over many random variables, involving intractable normalization constants or latent variable normalization. Some of them may even not have an analytic expression for the unnormalized probability function and no tractable approximation. This makes it difficult to estimate the quality of these models, once they have been trained, or to monitor their quality (e.g. for early stopping) while training. A previously proposed method is based on constructing a non-parametric density estimator of the model's probability function from samples generated by the model. We revisit this idea, propose a more efficient estimator, and prove that it provides a lower bound on the true test log-likelihood, and an unbiased estimator as the number of generated samples goes to infinity, although one that incorporates the effect of poor mixing. We further propose a biased variant of the estimator that can be used reliably with a finite number of samples for the purpose of model comparison.

openalex-author · Biological Cybernetics

Conditioning and time representation in long short-term memory networks

No abstract available from the OpenAlex source record.

openalex-author · arXiv (Cornell University)

Learned-norm pooling for deep neural networks.

No abstract available from the OpenAlex source record.

openalex-author · Zenodo (CERN European Organization for Nuclear Research)

Audio Chord Recognition With Recurrent Neural Networks.

[TODO] Add abstract here.

openalex-author · Interspeech 2013

Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding

One of the key problems in spoken language understanding (SLU) is the task of slot filling. In light of the recent success of applying deep neural network technologies in domain detection and intent identification, we carried out an in-depth investigation on the use of recurrent neural networks for the more difficult task of slot filling involving sequence discrimination. In this work, we implemented and compared several important recurrent-neural-network architectures, including the Elman-type and Jordan-type recurrent networks and their variants. To make the results easy to reproduce and compare, we implemented these networks on the common Theano neural network toolkit, and evaluated them on the ATIS benchmark. We also compared our results to a conditional random fields (CRF) baseline. Our results show that on this task, both types of recurrent networks outperform the CRF baseline substantially, and a bi-directional Jordantype network that takes into account both past and future dependencies among slots works best, outperforming a CRFbased baseline by 14% in relative error reduction.

openalex-author · arXiv (Cornell University)

Pylearn2: a machine learning research library

Pylearn2 is a machine learning research library. This does not just mean that it is a collection of machine learning algorithms that share a common API; it means that it has been designed for flexibility and extensibility in order to facilitate research projects that involve new or unusual use cases. In this paper we give a brief history of the library, an overview of its basic philosophy, a summary of the library's architecture, and a description of how the Pylearn2 community functions socially.

openalex-author · arXiv (Cornell University)

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Stochastic neurons and hard non-linearities can be useful for a number of reasons in deep learning models, but in many cases they pose a challenging problem: how to estimate the gradient of a loss function with respect to the input of such stochastic or non-smooth neurons? I.e., can we "back-propagate" through these stochastic neurons? We examine this question, existing approaches, and compare four families of solutions, applicable in different settings. One of them is the minimum variance unbiased gradient estimator for stochatic binary neurons (a special case of the REINFORCE algorithm). A second approach, introduced here, decomposes the operation of a binary stochastic neuron into a stochastic binary part and a smooth differentiable part, which approximates the expected effect of the pure stochatic binary neuron to first order. A third approach involves the injection of additive or multiplicative noise in a computational graph that is otherwise differentiable. A fourth approach heuristically copies the gradient with respect to the stochastic output directly as an estimator of the gradient with respect to the sigmoid argument (we call this the straight-through estimator). To explore a context where these estimators are useful, we consider a small-scale version of {\em conditional computation}, where sparse stochastic units form a distributed representation of gaters that can turn off in combinatorially many ways large chunks of the computation performed in the rest of the neural network. In this case, it is important that the gating units produce an actual 0 most of the time. The resulting sparsity can be potentially be exploited to greatly reduce the computational cost of large deep networks for which conditional computation would be useful.

openalex-author · 2013 IEEE Conference on Computational Inteligence in Games (CIG)

Stacked calibration of off-policy policy evaluation for video game matchmaking

We consider an industrial strength application of recommendation systems for video-game matchmaking in which off-policy policy evaluation is important but where standard approaches can hardly be applied. The objective of the policy is to sequentially form teams of players from those waiting to be matched, in such a way as to produce well-balanced matches. Unfortunately, the available training data comes from a policy that is not known perfectly and that is not stochastic, making it impossible to use methods based on importance weights. Furthermore, we observe that when the estimated reward function and the policy are obtained by training from the same off-policy dataset, the policy evaluation using the estimated reward function is biased. We present a simple calibration procedure that is similar to stacked regression and that removes most of the bias, in the experiments we performed. Data collected during beta tests of Ghost Recon Online, a first person shooter from Ubisoft, were used for the experiments.

openalex-author · Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

Modeling term dependencies with quantum language models for IR

Traditional information retrieval (IR) models use bag-of-words as the basic representation and assume that some form of independence holds between terms. Representing term dependencies and defining a scoring function capable of integrating such additional evidence is theoretically and practically challenging. Recently, Quantum Theory (QT) has been proposed as a possible, more general framework for IR. However, only a limited number of investigations have been made and the potential of QT has not been fully explored and tested. We develop a new, generalized Language Modeling approach for IR by adopting the probabilistic framework of QT. In particular, quantum probability could account for both single and compound terms at once without having to extend the term space artificially as in previous studies. This naturally allows us to avoid the weight-normalization problem, which arises in the current practice by mixing scores from matching compound terms and from matching single terms. Our model is the first practical application of quantum probability to show significant improvements over a robust bag-of-words baseline and achieves better performance on a stronger non bag-of-words baseline.

openalex-author · IEEE Transactions on Pattern Analysis and Machine Intelligence

Scaling Up Spike-and-Slab Models for Unsupervised Feature Learning

We describe the use of two spike-and-slab models for modeling real-valued data, with an emphasis on their applications to object recognition. The first model, which we call spike-and-slab sparse coding (S3C), is a preexisting model for which we introduce a faster approximate inference algorithm. We introduce a deep variant of S3C, which we call the partially directed deep Boltzmann machine (PD-DBM) and extend our S3C inference algorithm for use on this model. We describe learning procedures for each. We demonstrate that our inference procedure for S3C enables scaling the model to unprecedented large problem sizes, and demonstrate that using S3C as a feature extractor results in very good object recognition performance, particularly when the number of labeled examples is low. We show that the PD-DBM generates better samples than its shallow counterpart, and that unlike DBMs or DBNs, the PD-DBM may be trained successfully without greedy layerwise training.

openalex-author · http://www.idiap.ch/~bengio/publications/bib2web/psgz/bengio_1997_oban.ps.gz

On the Optimization of a Synaptic Learning Rule

This paper presents a new approach to neural modeling based on the idea of using an automated method to optimize the parameters of a synaptic learning rule. The synaptic modification rule is considered as a parametric function. This function has local inputs and is the same in many neurons. We can use standard optimization methods to select appropriate parameters for a given type of task. We also present a theoretical analysis permitting to study the generalization property of such parametric learning rules. By generalization, we mean the possibility for the learning rule to learn to solve new tasks. Experiments were performed on three types of problems: a biologically inspired circuit (for conditioning in Aplysia), Boolean functions (linearly separable as well as non linearly separable) and classification tasks. The neural network architecture as well as the form and initial parameter values of the synaptic learning function can be designed using a priori knowledge.

openalex-author · arXiv (Cornell University)

Deep Generative Stochastic Networks Trainable by Backprop

We introduce a novel training principle for probabilistic models that is an alternative to maximum likelihood. The proposed Generative Stochastic Networks (GSN) framework is based on learning the transition operator of a Markov chain whose stationary distribution estimates the data distribution. The transition distribution of the Markov chain is conditional on the previous state, generally involving a small move, so this conditional distribution has fewer dominant modes, being unimodal in the limit of small moves. Thus, it is easier to learn because it is easier to approximate its partition function, more like learning to perform supervised function approximation, with gradients that can be obtained by backprop. We provide theorems that generalize recent work on the probabilistic interpretation of denoising autoencoders and obtain along the way an interesting justification for dependency networks and generalized pseudolikelihood, along with a definition of an appropriate joint distribution and sampling mechanism even when the conditionals are not consistent. GSNs can be used with missing inputs and can be used to sample subsets of variables given the rest. We validate these theoretical results with experiments on two image datasets using an architecture that mimics the Deep Boltzmann Machine Gibbs sampler but allows training to proceed with simple backprop, without the need for layerwise pretraining.

openalex-author · IEEE Transactions on Pattern Analysis and Machine Intelligence

Representation Learning: A Review and New Perspectives

The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can be used to help design representations, learning with generic priors can also be used, and the quest for AI is motivating the design of more powerful representation-learning algorithms implementing such priors. This paper reviews recent work in the area of unsupervised feature learning and deep learning, covering advances in probabilistic models, autoencoders, manifold learning, and deep networks. This motivates longer term unanswered questions about the appropriate objectives for learning good representations, for computing representations (i.e., inference), and the geometrical connections between representation learning, density estimation, and manifold learning.

openalex-author · Machine Learning

A semantic matching energy function for learning with multi-relational data

No abstract available from the OpenAlex source record.

openalex-author · arXiv (Cornell University)

Generalized Denoising Auto-Encoders as Generative Models

Recent work has shown how denoising and contractive autoencoders implicitly capture the structure of the data-generating density, in the case where the corruption noise is Gaussian, the reconstruction error is the squared error, and the data is continuous-valued. This has led to various proposals for sampling from this implicitly learned density function, using Langevin and Metropolis-Hastings MCMC. However, it remained unclear how to connect the training procedure of regularized auto-encoders to the implicit estimation of the underlying data-generating distribution when the data are discrete, or using other forms of corruption process and reconstruction errors. Another issue is the mathematical justification which is only valid in the limit of small corruption noise. We propose here a different attack on the problem, which deals with all these issues: arbitrary (but noisy enough) corruption, arbitrary reconstruction loss (seen as a log-likelihood), handling both discrete and continuous-valued variables, and removing the bias due to non-infinitesimal corruption noise (or non-infinitesimal contractive penalty).

openalex-author · arXiv (Cornell University)

Estimating or Propagating Gradients Through Stochastic Neurons

Stochastic neurons can be useful for a number of reasons in deep learning models, but in many cases they pose a challenging problem: how to estimate the gradient of a loss function with respect to the input of such stochastic neurons, i.e., can we "back-propagate" through these stochastic neurons? We examine this question, existing approaches, and present two novel families of solutions, applicable in different settings. In particular, it is demonstrated that a simple biologically plausible formula gives rise to an an unbiased (but noisy) estimator of the gradient with respect to a binary stochastic neuron firing probability. Unlike other estimators which view the noise as a small perturbation in order to estimate gradients by finite differences, this estimator is unbiased even without assuming that the stochastic perturbation is small. This estimator is also interesting because it can be applied in very general settings which do not allow gradient back-propagation, including the estimation of the gradient with respect to future rewards, as required in reinforcement learning setups. We also propose an approach to approximating this unbiased but high-variance estimator by learning to predict it using a biased estimator. The second approach we propose assumes that an estimator of the gradient can be back-propagated and it provides an unbiased estimator of the gradient, but can only work with non-linearities unlike the hard threshold, but like the rectifier, that are not flat for all of their range. This is similar to traditional sigmoidal units but has the advantage that for many inputs, a hard decision (e.g., a 0 output) can be produced, which would be convenient for conditional computation and achieving sparse representations and sparse gradients.

openalex-author · 2013 IEEE International Conference on Acoustics, Speech and Signal Processing

High-dimensional sequence transduction

We investigate the problem of transforming an input sequence into a high-dimensional output sequence in order to transcribe polyphonic audio music into symbolic notation. We introduce a probabilistic model based on a recurrent neural network that is able to learn realistic output distributions given the input and we devise an efficient algorithm to search for the global mode of that distribution. The resulting method produces musically plausible transcriptions even under high levels of noise and drastically outperforms previous state-of- the-art approaches on five datasets of synthesized sounds and real recordings, approximately halving the test error rate.

openalex-author · 2013 IEEE International Conference on Acoustics, Speech and Signal Processing

Advances in optimizing recurrent networks

After a more than decade-long period of relatively little research activity in the area of recurrent neural networks, several new developments will be reviewed here that have allowed substantial progress both in understanding and in technical solutions towards more efficient training of recurrent networks. These advances have been motivated by and related to the optimization issues surrounding deep learning. Although recurrent networks are extremely powerful in what they can in principle represent in terms of modeling sequences, their training is plagued by two aspects of the same issue regarding the learning of long-term dependencies. Experiments reported here evaluate the use of clipping gradients, spanning longer time ranges with leaky integration, advanced momentum techniques, using more powerful output probability models, and encouraging sparser gradients to help symmetry breaking and credit assignment. The experiments are performed on text and music data and show off the combined effects of these techniques in generally improving both training and test error.

openalex-author · http://jmlr.org/proceedings/papers/v31/luo13a.pdf

Texture Modeling with Convolutional Spike-and-Slab RBMs and Deep Extensions

We apply the spike-and-slab Restricted Boltzmann Machine (ssRBM) to texture modeling. The ssRBM with tiled-convolution weight sharing (TssRBM) achieves or sur-passes the state-of-the-art on texture synthe-sis and inpainting by parametric models. We also develop a novel RBM model with a spike-and-slab visible layer and binary variables in the hidden layer. This model is designed to be stacked on top of the ssRBM. We show the resulting deep belief network (DBN) is a powerful generative model that improves on single-layer models and is capable of model-ing not only single high-resolution and chal-lenging textures but also multiple textures with fixed-size filters in the bottom layer. 1

openalex-author · Machine Learning

Learning semantic representations of objects and their parts

No abstract available from the OpenAlex source record.

openalex-author · IEEE Computational Intelligence Magazine

Learning deep physiological models of affect

Feature extraction and feature selection are crucial phases in the process of affective modeling. Both, however, incorporate substantial limitations that hinder the development of reliable and accurate models of affect. For the purpose of modeling affect manifested through physiology, this paper builds on recent advances in machine learning with deep learning (DL) approaches. The efficiency of DL algorithms that train artificial neural network models is tested and compared against standard feature extraction and selection approaches followed in the literature. Results on a game data corpus - containing players' physiological signals (i.e., skin conductance and blood volume pulse) and subjective self-reports of affect - reveal that DL outperforms manual ad-hoc feature extraction as it yields significantly more accurate affective models. Moreover, it appears that DL meets and even outperforms affective models that are boosted by automatic feature selection, for several of the scenarios examined. As the DL method is generic and applicable to any affective modeling task, the key findings of the paper suggest that ad-hoc feature extraction and selection - to a lesser degree - could be bypassed.

openalex-author · arXiv (Cornell University)

Knowledge Matters: Importance of Prior Information for Optimization

We explore the effect of introducing prior information into the intermediate level of neural networks for a learning task on which all the state-of-the-art machine learning algorithms tested failed to learn. We motivate our work from the hypothesis that humans learn such intermediate concepts from other individuals via a form of supervision or guidance using a curriculum. The experiments we have conducted provide positive evidence in favor of this hypothesis. In our experiments, a two-tiered MLP architecture is trained on a dataset with 64x64 binary inputs images, each image with three sprites. The final task is to decide whether all the sprites are the same or one of them is different. Sprites are pentomino tetris shapes and they are placed in an image with different locations using scaling and rotation transformations. The first part of the two-tiered MLP is pre-trained with intermediate-level targets being the presence of sprites at each location, while the second part takes the output of the first part as input and predicts the final task's target binary event. The two-tiered MLP architecture, with a few tens of thousand examples, was able to learn the task perfectly, whereas all other algorithms (include unsupervised pre-training, but also traditional algorithms like SVMs, decision trees and boosting) all perform no better than chance. We hypothesize that the optimization difficulty involved when the intermediate pre-training is not performed is due to the {\em composition} of two highly non-linear tasks. Our findings are also consistent with hypotheses on cultural learning inspired by the observations of optimization problems with deep learning, presumably because of effective local minima.

openalex-author · arXiv (Cornell University)

Joint Training Deep Boltzmann Machines for Classification

We introduce a new method for training deep Boltzmann machines jointly. Prior methods of training DBMs require an initial learning pass that trains the model greedily, one layer at a time, or do not perform well on classification tasks. In our approach, we train all layers of the DBM simultaneously, using a novel training procedure called multi-prediction training. The resulting model can either be interpreted as a single generative model trained to maximize a variational approximation to the generalized pseudolikelihood, or as a family of recurrent networks that share parameters and may be approximately averaged together using a novel technique we call the multi-inference trick. We show that our approach performs competitively for classification and outperforms previous methods in terms of accuracy of approximate inference and classification with missing inputs.

openalex-author · arXiv (Cornell University)

Natural Gradient Revisited

We evaluate natural gradient, an algorithm originally proposed in Amari (1997), for learning deep models. The contributions of this paper are as follows. We show the connection between natural gradient and three other recently proposed methods for training deep models: Hessian-Free (Martens, 2010), Krylov Subspace Descent (Vinyals and Povey, 2012) and TONGA (Le Roux et al., 2008). We describe how one can use unlabeled data to improve the generalization error obtained by natural gradient and empirically evaluate the robustness of the algorithm to the ordering of the training set compared to stochastic gradient descent. Finally we extend natural gradient to incorporate second order information alongside the manifold information and provide a benchmark of the new algorithm using a truncated Newton approach for inverting the metric matrix instead of using a diagonal approximation of it.

openalex-author · arXiv (Cornell University)

Big Neural Networks Waste Capacity

This article exposes the failure of some big neural networks to leverage added capacity to reduce underfitting. Past research suggest diminishing returns when increasing the size of neural networks. Our experiments on ImageNet LSVRC-2010 show that this may be due to the fact there are highly diminishing returns for capacity in terms of training error, leading to underfitting. This suggests that the optimization method - first order gradient descent - fails at this regime. Directly attacking this problem, either through the optimization method or the choices of parametrization, may allow to improve the generalization error on large datasets, for which a large capacity is required.

openalex-author · arXiv (Cornell University)

Metric-Free Natural Gradient for Joint-Training of Boltzmann Machines

This paper introduces the Metric-Free Natural Gradient (MFNG) algorithm for training Boltzmann Machines. Similar in spirit to the Hessian-Free method of Martens [8], our algorithm belongs to the family of truncated Newton methods and exploits an efficient matrix-vector product to avoid explicitely storing the natural gradient metric $L$. This metric is shown to be the expected second derivative of the log-partition function (under the model distribution), or equivalently, the variance of the vector of partial derivatives of the energy function. We evaluate our method on the task of joint-training a 3-layer Deep Boltzmann Machine and show that MFNG does indeed have faster per-epoch convergence compared to Stochastic Maximum Likelihood with centering, though wall-clock performance is currently not competitive.

openalex-author · Proceedings of the 2nd International Conference on Pattern Recognition Applications and Methods

Unsupervised and Transfer Learning under Uncertainty - From Object Detections to Scene Categorization

No abstract available from the OpenAlex source record.

openalex-author · Lecture Notes in Computer Science

Deep Learning of Representations: Looking Forward

No abstract available from the OpenAlex source record.

openalex-author · Intelligent Systems Reference Library

Deep Learning of Representations

No abstract available from the OpenAlex source record.

openalex-author · arXiv (Cornell University)

Joint Training of Deep Boltzmann Machines

We introduce a new method for training deep Boltzmann machines jointly. Prior methods require an initial learning pass that trains the deep Boltzmann machine greedily, one layer at a time, or do not perform well on classifi- cation tasks.

openalex-author · arXiv (Cornell University)

Theano: new features and speed improvements

Theano is a linear algebra compiler that optimizes a user's symbolically-specified mathematical computations to produce efficient low-level implementations. In this paper, we present new features and efficiency improvements to Theano, and benchmarks demonstrating Theano's performance relative to Torch7, a recently introduced machine learning library, and to RNNLM, a C++ library targeted at recurrent neural networks.

openalex-author · arXiv (Cornell University)

Understanding the exploding gradient problem

Training Recurrent Neural Networks is more troublesome than feedforward ones because of the vanishing and exploding gradient problems detailed in Bengio et al. (1994). In this paper we attempt to understand the fundamental issues underlying the exploding gradient problem by exploring it from an analytical, a geometric and a dynamical system perspective. Our analysis is used to justify the simple yet effective solution of norm clipping the exploded gradient. In the experimental section, the comparison between this heuristic solution and standard SGD provides empirical evidence towards our hypothesis as well as it shows that such a heuristic is required to reach state of the art results on a character prediction task and a polyphonic music prediction one.

openalex-author · arXiv (Cornell University)

Disentangling Factors of Variation via Generative Entangling

Here we propose a novel model family with the objective of learning to disentangle the factors of variation in data. Our approach is based on the spike-and-slab restricted Boltzmann machine which we generalize to include higher-order interactions among multiple latent variables. Seen from a generative perspective, the multiplicative interactions emulates the entangling of factors of variation. Inference in the model can be seen as disentangling these generative factors. Unlike previous attempts at disentangling latent factors, the proposed model is trained using no supervised information regarding the latent factors. We apply our model to the task of facial expression classification.

openalex-author · Zenodo (CERN European Organization for Nuclear Research)

Building Musically-Relevant Audio Features Through Multiple Timescale Representations.

[TODO] Add abstract here.

openalex-author · Zenodo (CERN European Organization for Nuclear Research)

Discriminative Non-Negative Matrix Factorization For Multiple Pitch Estimation.

[TODO] Add abstract here.

openalex-author · arXiv (Cornell University)

Efficient EM Training of Gaussian Mixtures with Missing Data

In data-mining applications, we are frequently faced with a large fraction of missing entries in the data matrix, which is problematic for most discriminant machine learning algorithms. A solution that we explore in this paper is the use of a generative model (a mixture of Gaussians) to compute the conditional expectation of the missing variables given the observed variables. Since training a Gaussian mixture with many different patterns of missing values can be computationally very expensive, we introduce a spanning-tree based algorithm that significantly speeds up training in these conditions. We also observe that good results can be obtained by using the generative model to fill-in the missing values for a separate discriminant learning algorithm.

openalex-author · arXiv (Cornell University)

Better Mixing via Deep Representations

It has previously been hypothesized, and supported with some experimental evidence, that deeper representations, when well trained, tend to do a better job at disentangling the underlying factors of variation. We study the following related conjecture: better representations, in the sense of better disentangling, can be exploited to produce faster-mixing Markov chains. Consequently, mixing would be more efficient at higher levels of representation. To better understand why and how this is happening, we propose a secondary conjecture: the higher-level samples fill more uniformly the space they occupy and the high-density manifolds tend to unfold when represented at higher levels. The paper discusses these hypotheses and tests them experimentally through visualization and measurements of mixing and interpolating between samples.

openalex-author · arXiv (Cornell University)

Implicit Density Estimation by Local Moment Matching to Sample from Auto-Encoders

Recent work suggests that some auto-encoder variants do a good job of capturing the local manifold structure of the unknown data generating density. This paper contributes to the mathematical understanding of this phenomenon and helps define better justified sampling algorithms for deep learning based on auto-encoder variants. We consider an MCMC where each step samples from a Gaussian whose mean and covariance matrix depend on the previous state, defines through its asymptotic distribution a target density. First, we show that good choices (in the sense of consistency) for these mean and covariance functions are the local expected value and local covariance under that target density. Then we show that an auto-encoder with a contractive penalty captures estimators of these local moments in its reconstruction function and its Jacobian. A contribution of this work is thus a novel alternative to maximum-likelihood density estimation, which we call local moment matching. It also justifies a recently proposed sampling algorithm for the Contractive Auto-Encoder and extends it to the Denoising Auto-Encoder.

openalex-author · arXiv (Cornell University)

Large-Scale Feature Learning With Spike-and-Slab Sparse Coding

We consider the problem of object recognition with a large number of classes. In order to overcome the low amount of labeled examples available in this setting, we introduce a new feature learning and extraction procedure based on a factor model we call spike-and-slab sparse coding (S3C). Prior work on S3C has not prioritized the ability to exploit parallel architectures and scale S3C to the enormous problem sizes needed for object recognition. We present a novel inference procedure for appropriate for use with GPUs which allows us to dramatically increase both the training set size and the amount of latent factors that S3C may be trained with. We demonstrate that this approach improves upon the supervised learning capabilities of both sparse coding and the spike-and-slab Restricted Boltzmann Machine (ssRBM) on the CIFAR-10 dataset. We use the CIFAR-100 dataset to demonstrate that our method scales to large numbers of classes better than previous methods. Finally, we use our method to win the NIPS 2011 Workshop on Challenges In Learning Hierarchical Models? Transfer Learning Challenge.

openalex-author · arXiv (Cornell University)

Modeling Temporal Dependencies in High-Dimensional Sequences:\n Application to Polyphonic Music Generation and Transcription

We investigate the problem of modeling symbolic sequences of polyphonic music\nin a completely general piano-roll representation. We introduce a probabilistic\nmodel based on distribution estimators conditioned on a recurrent neural\nnetwork that is able to discover temporal dependencies in high-dimensional\nsequences. Our approach outperforms many traditional models of polyphonic music\non a variety of realistic datasets. We show how our musical language model can\nserve as a symbolic prior to improve the accuracy of polyphonic transcription.\n

openalex-author · arXiv (Cornell University)

A Generative Process for Sampling Contractive Auto-Encoders

The contractive auto-encoder learns a representation of the input data that captures the local manifold structure around each data point, through the leading singular vectors of the Jacobian of the transformation from input to representation. The corresponding singular values specify how much local variation is plausible in directions associated with the corresponding singular vectors, while remaining in a high-density region of the input space. This paper proposes a procedure for generating samples that are consistent with the local structure captured by a contractive auto-encoder. The associated stochastic process defines a distribution from which one can sample, and which experimentally appears to converge quickly and mix well between modes, compared to Restricted Boltzmann Machines and Deep Belief Networks. The intuitions behind this procedure can also be used to train the second layer of contraction that pools lower-level features and learns to be invariant to the local directions of variation discovered in the first layer. We show that this can help learn and represent invariances present in the data and improve classification error.

openalex-author · Paper

Unsupervised Feature Learning and Deep Learning: A Review and New Perspectives

The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can be used to help design representations, learning with generic priors can also be used, and the quest for AI is motivating the design of more powerful representation-learning algorithms implementing such priors. This paper reviews recent work in the area of unsupervised feature learning and deep learning, covering advances in probabilistic models, auto-encoders, manifold learning, and deep networks. This motivates longer-term unanswered questions about the appropriate objectives for learning good representations, for computing representations (i.e., inference), and the geometrical connections between representation learning, density estimation and manifold learning.

openalex-author · Computational Intelligence

DETONATION CLASSIFICATION FROM ACOUSTIC SIGNATURE WITH THE RESTRICTED BOLTZMANN MACHINE: Detonation Classification from Acoustic Signature with the Restricted Boltzmann Machine

No abstract available from the OpenAlex source record.

openalex-author · arXiv (Cornell University)

On Training Deep Boltzmann Machines

The deep Boltzmann machine (DBM) has been an important development in the quest for powerful "deep" probabilistic models. To date, simultaneous or joint training of all layers of the DBM has been largely unsuccessful with existing training methods. We introduce a simple regularization scheme that encourages the weight vectors associated with each hidden unit to have similar norms. We demonstrate that this regularization can be easily combined with standard stochastic maximum likelihood to yield an effective training strategy for the simultaneous training of all layers of the deep Boltzmann machine.

openalex-author · arXiv (Cornell University)

Evolving Culture vs Local Minima

We propose a theory that relates difficulty of learning in deep architectures to culture and language. It is articulated around the following hypotheses: (1) learning in an individual human brain is hampered by the presence of effective local minima; (2) this optimization difficulty is particularly important when it comes to learning higher-level abstractions, i.e., concepts that cover a vast and highly-nonlinear span of sensory configurations; (3) such high-level abstractions are best represented in brains by the composition of many levels of representation, i.e., by deep architectures; (4) a human brain can learn such high-level abstractions if guided by the signals produced by other humans, which act as hints or indirect supervision for these high-level abstractions; and (5), language and the recombination and optimization of mental concepts provide an efficient evolutionary recombination operator, and this gives rise to rapid search in the space of communicable ideas that help humans build up better high-level internal representations of their world. These hypotheses put together imply that human culture and the evolution of ideas have been crucial to counter an optimization difficulty: this optimization difficulty would otherwise make it very difficult for human brains to capture high-level knowledge of the world. The theory is grounded in experimental observations of the difficulties of training deep artificial neural networks. Plausible consequences of this theory for the efficiency of cultural evolutions are sketched.

openalex-author · http://jmlr.csail.mit.edu/papers/volume13/bergstra12a/bergstra12a.pdf

Random search for hyper-parameter optimization

Grid search and manual search are the most widely used strategies for hyper-parameter optimization. This paper shows empirically and theoretically that randomly chosen trials are more efficient for hyper-parameter optimization than trials on a grid. Empirical evidence comes from a comparison with a large previous study that used grid search and manual search to configure neural networks and deep belief networks. Compared with neural networks configured by a pure grid search, we find that random search over the same domain is able to find models that are as good or better within a small fraction of the computation time. Granting random search the same computational budget, random search finds better models by effectively searching a larger, less promising configuration space. Compared with deep belief networks configured by a thoughtful combination of manual search and grid search, purely random search over the same 32-dimensional configuration space found statistically equal performance on four of seven data sets, and superior performance on one of seven. A Gaussian process analysis of the function from hyper-parameters to validation set performance reveals that for most data sets only a few of the hyper-parameters really matter, but that different hyper-parameters are important on different data sets. This phenomenon makes

openalex-author · IEEE Transactions on Computational Intelligence and AI in Games

Beyond Skill Rating: Advanced Matchmaking in Ghost Recon Online

Player satisfaction is particularly difficult to ensure in online games, due to interactions with other players. In adversarial multiplayer games, matchmaking typically consists in trying to match together players of similar skill level. However, this is usually based on a single-skill value, and assumes the only factor of “fun” is the game balance. We present a more advanced matchmaking strategy developed for Ghost Recon Online, an upcoming team-focused first-person shooter (FPS) from Ubisoft (Montreal, QC, Canada). We first show how incorporating more information about players than their raw skill can lead to more balanced matches. We also argue that balance is not the only factor that matters, and present a strategy to explicitly maximize the players' fun, taking advantage of a rich player profile that includes information about player behavior and personal preferences. Ultimately, our goal is to ask players to provide direct feedback on match quality through an in-game survey. However, because such data were not available for this study, we rely here on heuristics tailored to this specific game. Experiments on data collected during Ghost Recon Online's beta tests show that neural networks can effectively be used to predict both balance and player enjoyment.

openalex-author · SpringerReference

Quantitative Structure Activity Relationship

No abstract available from the OpenAlex source record.

openalex-author · arXiv (Cornell University)

Spike-and-Slab Sparse Coding for Unsupervised Feature Discovery

We consider the problem of using a factor model we call {\em spike-and-slab sparse coding} (S3C) to learn features for a classification task. The S3C model resembles both the spike-and-slab RBM and sparse coding. Since exact inference in this model is intractable, we derive a structured variational inference procedure and employ a variational EM training algorithm. Prior work on approximate inference for this model has not prioritized the ability to exploit parallel architectures and scale to enormous problem sizes. We present an inference procedure appropriate for use with GPUs which allows us to dramatically increase both the training set size and the amount of latent factors. We demonstrate that this approach improves upon the supervised learning capabilities of both sparse coding and the ssRBM on the CIFAR-10 dataset. We evaluate our approach's potential for semi-supervised learning on subsets of CIFAR-10. We demonstrate state-of-the art self-taught learning performance on the STL-10 dataset and use our method to win the NIPS 2011 Workshop on Challenges In Learning Hierarchical Models' Transfer Learning Challenge.

openalex-author · http://jmlr.csail.mit.edu/proceedings/papers/v15/glorot11a/glorot11a.pdf

Deep Sparse Rectifier Neural Networks

While logistic sigmoid neurons are more biologically plausible than hyperbolic tangent neurons, the latter work better for training multi-layer neural networks. This paper shows that rectifying neurons are an even better model of biological neurons and yield equal or better performance than hyperbolic tangent networks in spite of the hard non-linearity and non-differentiability at zero, creating sparse representations with true zeros, which seem remarkably suitable for naturally sparse data. Even though they can take advantage of semi-supervised setups with extra-unlabeled data, deep rectifier networks can reach their best performance without requiring any unsupervised pre-training on purely supervised tasks with large labeled datasets. Hence, these results can be seen as a new milestone in the attempts at understanding the difficulty in training deep but purely supervised neural networks, and closing the performance gap between neural networks learnt with and without unsupervised pre-training. 1

openalex-author · arXiv (Cornell University)

Regularized Auto-Encoders Estimate Local Statistics

What do auto-encoders learn about the underlying data generating distribution? Recent work suggests that some auto-encoder variants do a good job of capturing the local manifold structure of data. This paper clarifies some of these previous observations by showing that minimizing a particular form of regularized reconstruction error yields a reconstruction function that locally characterizes the shape of the data generating density. We show that the auto-encoder captures the score (derivative of the log-density with respect to the input). It contradicts previous interpretations of reconstruction error as an energy function. Unlike previous results, the theorems provided here are completely generic and do not depend on the parametrization of the auto-encoder: they show what the auto-encoder would tend to if given enough capacity and examples. These results are for a contractive training criterion we show to be similar to the denoising auto-encoder training criterion with small corruption noise, but with contraction applied on the whole reconstruction function rather than just encoder. Similarly to score matching, one can consider the proposed training criterion as a convenient alternative to maximum likelihood because it does not involve a partition function. Finally, we show how an approximate Metropolis-Hastings MCMC can be setup to recover samples from the estimated distribution, and this is confirmed in sampling experiments.

openalex-author · Paper

Deep Learning for NLP (without Magic) References

No abstract available from the OpenAlex source record.

openalex-author · International Conference on Machine Learning

A Generative Process for Contractive Auto-Encoders.

No abstract available from the OpenAlex source record.

openalex-author · Lecture Notes in Computer Science

Disentangling Factors of Variation for Facial Expression Recognition

No abstract available from the OpenAlex source record.

openalex-author · Paper

Apprentissage machine efficace: theorie et pratique

Despite constant progress in terms of available computational power, memory and amount of data, machine learning algorithms need to be efficient in how they use them. Although minimizing cost is an obvious major concern, another motivation is to attempt to design algorithms that can learn as efficiently as intelligent species. This thesis tackles the problem of efficient learning through various papers dealing with a wide range of machine learning algorithms: this topic is seen both from the point of view of computational efficiency (processing power and memory required by the algorithms) and of statistical efficiency (number of samples necessary to solve a given learning task). The first contribution of this thesis is in shedding light on various statistical inefficiencies in existing algorithms. Indeed, we show that decision trees do not generalize well on tasks with some particular properties (chapter 3), and that a similar flaw affects typical graph-based semi-supervised learning algorithms (chapter 5). This flaw is a form of curse of dimensionality that is specific to each of these algorithms. For a subclass of neural networks, called sum-product networks, we prove that using networks with a single hidden layer can be exponentially less efficient than when using deep networks (chapter 4). Our analyses help better understand some inherent flaws found in these algorithms, and steer research towards approaches that may potentially overcome them. We also exhibit computational inefficiencies in popular graph-based semi-supervised learning algorithms (chapter 5) as well as in the learning of mixtures of Gaussians with missing data (chapter 6). In both cases we propose new algorithms that make it possible to scale to much larger datasets. The last two chapters also deal with computational efficiency, but in different ways. Chapter 7 presents a new view on the contrastive divergence algorithm (which has been used for efficient training of restricted Boltzmann machines). It provides additional insight on the reasons why this algorithm has been so successful. Finally, in chapter 8 we describe an application of machine learning to video games, where computational efficiency is tied to software and hardware engineering constraints which, although often ignored in research papers, are ubiquitous in practice. Keywords: computational efficiency, statistical efficiency, curse of dimensionality, decision trees, neural networks, graph-based semi-supervised learning, contrastive divergence, mixtures of Gaussians, matchmaking.

openalex-author · http://jmlr.org/papers/volume13/larochelle12a/larochelle12a.pdf

Learning algorithms for the classification restricted Boltzmann machine

Recent developments have demonstrated the capacity of restricted Boltzmann machines (RBM) to be powerful generative models, able to extract useful features from input data or construct deep artificial neural networks. In such settings, the RBM only yields a preprocessing or an initialization for some other model, instead of acting as a complete supervised model in its own right. In this paper, we argue that RBMs can provide a self-contained framework for developing competitive classifiers. We study the Classification RBM (ClassRBM), a variant on the RBM adapted to the classification setting. We study different strategies for training the ClassRBM and show that competitive classification performances can be reached when appropriately combining discriminative and generative training objectives. Since training according to the generative objective requires the computation of a generally intractable gradient, we also compare different approaches to estimating this gradient and address the issue of obtaining such a gradient for problems with very high dimensional inputs. Finally, we describe how to adapt the ClassRBM to two special cases of classification problems, namely semi-supervised and multitask learning.

openalex-author · http://biglearn.org/2011/files/papers/biglearn2011_submission_18.pdf

Theano: Deep Learning on GPUs with Python

In this paper, we present Theano1, a framework in the Python programming language for defining, optimizing and evaluating expressions involving high-level operations on ten-sors. Theano offers most of NumPy’s functionality, but adds automatic symbolic differen-tiation, GPU support, and faster expression evaluation. Theano is a general mathematical tool, but it was developed with the goal of facilitating research in deep learning. The Deep Learning Tutorials2 introduce recent advances in deep learning, and showcase how Theano makes such algorithms compact, elegant, and fast.

openalex-author · http://www.thespermwhale.com/jaseweston/papers/otsm_aistats11.pdf

Joint learning of words and meaning representations for open-text semantic parsing

Open-text (or open-domain) semantic parsers are designed to interpret any statement in natural language by inferring a corresponding meaning representation (MR). Unfortunately, large scale systems cannot be easily machine-learned due to lack of directly supervised data. We propose here a method that learns to assign MRs to a wide range of text (using a dictionary of more than 70,000 words, which are mapped to more than 40,000 entities) thanks to a training scheme that combines learning from knowledge bases (WordNet and ConceptNet) with learning from raw text. The model jointly learns representations of words, entities and MRs via a multi-task training process operating on these diverse sources of data. Hence, the system ends up providing methods for knowledge acquisition and word-sense disambiguation within the context of semantic parsing in a single elegant framework. Experiments on these various tasks indicate the promise of the approach. 1

openalex-author · Lecture Notes in Computer Science

Practical Recommendations for Gradient-Based Training of Deep Architectures

No abstract available from the OpenAlex source record.

openalex-author · http://papers.nips.cc/paper/4350-shallow-vs-deep-sum-product-networks.pdf

Shallow vs. Deep Sum-Product Networks

We investigate the representational power of sum-product networks (computation networks analogous to neural networks, but whose individual units compute either products or weighted sums), through a theoretical analysis that compares deep (multiple hidden layers) vs. shallow (one hidden layer) architectures. We prove there exist families of functions that can be represented much more efficiently with a deep network than with a shallow one, i.e. with substantially fewer hidden units. Such results were not available until now, and contribute to motivate recent research involving learning of deep sum-product networks, and more generally motivate research in Deep Learning. 1 Introduction and prior work Many learning algorithms are based on searching a family of functions so as to identify one member of said family which minimizes a training criterion. The choice of this family of functions and how members of that family are parameterized can be a crucial one. Although there is no universally optimal choice of parameterization or family of functions (or “architecture”), as demonstrated by

openalex-author · http://books.nips.cc/papers/files/nips24/NIPS2011_1350.pdf

On Tracking The Partition Function

Markov Random Fields (MRFs) have proven very powerful both as density estimators and feature extractors for classification. However, their use is often limited by an inability to estimate the partition function Z. In this paper, we exploit the gradient descent training procedure of restricted Boltzmann machines (a type of MRF) to track the log partition function during learning. Our method relies on two distinct sources of information: (1) estimating the change AZ incurred by each gradient update, (2) estimating the difference in Z over a small set of tempered distributions using bridge sampling. The two sources of information are then combined using an inference procedure similar to Kalman filtering. Learning MRFs through Tempered Stochastic Maximum Likelihood, we can estimate Z using no more temperatures than are required for learning. Comparing to both exact values and estimates using annealed importance sampling (AIS), we show on several datasets that our method is able to accurately track the log partition function. In contrast to AIS, our method provides this estimate at each time-step, at a computational cost similar to that required for training alone.

openalex-author · http://www.iro.umontreal.ca/%7Evincentp/Publications/MTC_nips2011.pdf

The Manifold Tangent Classifier

We combine three important ideas present in previous work for building classifiers: the semi-supervised hypothesis (the input distribution contains information about the classifier), the unsupervised manifold hypothesis (data density concentrates near low-dimensional manifolds), and the manifold hypothesis for classification (different classes correspond to disjoint manifolds separated by low density). We exploit a novel algorithm for capturing manifold structure (high-order contractive auto-encoders) and we show how it builds a topological atlas of charts, each chart being characterized by the principal singular vectors of the Jacobian of a representation mapping. This representation learning algorithm can be stacked to yield a deep architecture, and we combine it with a domain knowledge-free version of the TangentProp algorithm to encourage the classifier to be insensitive to local directions changes along the manifold. Record-breaking classification results are obtained. 1

openalex-author · Zenodo (CERN European Organization for Nuclear Research)

Temporal Pooling And Multiscale Learning For Automatic Annotation And Ranking Of Music Audio.

[TODO] Add abstract here.

openalex-author · ACM Transactions on Multimedia Computing, Communications, and Applications

Contextual tag inference

This article examines the use of two kinds of context to improve the results of content-based music taggers: the relationships between tags and between the clips of songs that are tagged. We show that users agree more on tags applied to clips temporally “closer” to one another; that conditional restricted Boltzmann machine models of tags can more accurately predict related tags when they take context into account; and that when training data is “smoothed” using context, support vector machines can better rank these clips according to the original, unsmoothed tags and do this more accurately than three standard multi-label classifiers.

openalex-author · arXiv (Cornell University)

The Statistical Inefficiency of Sparse Coding for Images (or, One Gabor to Rule them All)

Sparse coding is a proven principle for learning compact representations of images. However, sparse coding by itself often leads to very redundant dictionaries. With images, this often takes the form of similar edge detectors which are replicated many times at various positions, scales and orientations. An immediate consequence of this observation is that the estimation of the dictionary components is not statistically efficient. We propose a factored model in which factors of variation (e.g. position, scale and orientation) are untangled from the underlying Gabor-like filters. There is so much redundancy in sparse codes for natural images that our model requires only a single dictionary element (a Gabor-like edge detector) to outperform standard sparse coding. Our model scales naturally to arbitrary-sized images while achieving much greater statistical efficiency during learning. We validate this claim with a number of experiments showing, in part, superior compression of out-of-sample data using a sparse coding dictionary learned with only a single image.

openalex-author · Proceedings of the AAAI Conference on Artificial Intelligence

Learning Structured Embeddings of Knowledge Bases

Many Knowledge Bases (KBs) are now readily available and encompass colossal quantities of information thanks to either a long-term funding effort (e.g. WordNet, OpenCyc) or a collaborative process (e.g. Freebase, DBpedia). However, each of them is based on a different rigorous symbolic framework which makes it hard to use their data in other systems. It is unfortunate because such rich structured knowledge might lead to a huge leap forward in many other areas of AI like nat- ural language processing (word-sense disambiguation, natural language understanding, ...), vision (scene classification, image semantic annotation, ...) or collaborative filtering. In this paper, we present a learning process based on an innovative neural network architecture designed to embed any of these symbolic representations into a more flexible continuous vector space in which the original knowledge is kept and enhanced. These learnt embeddings would allow data from any KB to be easily used in recent machine learning meth- ods for prediction and information retrieval. We illustrate our method on WordNet and Freebase and also present a way to adapt it to knowledge extraction from raw text.

openalex-author · arXiv (Cornell University)

Towards Open-Text Semantic Parsing via Multi-Task Learning of Structured Embeddings

Open-text (or open-domain) semantic parsers are designed to interpret any statement in natural language by inferring a corresponding meaning representation (MR). Unfortunately, large scale systems cannot be easily machine-learned due to lack of directly supervised data. We propose here a method that learns to assign MRs to a wide range of text (using a dictionary of more than 70,000 words, which are mapped to more than 40,000 entities) thanks to a training scheme that combines learning from WordNet and ConceptNet with learning from raw text. The model learns structured embeddings of words, entities and MRs via a multi-task training process operating on these diverse sources of data that integrates all the learnt knowledge into a single system. This work ends up combining methods for knowledge acquisition, semantic parsing, and word-sense disambiguation. Experiments on various tasks indicate that our approach is indeed successful and can form a basis for future more sophisticated systems.

openalex-author · http://jmlr.csail.mit.edu/proceedings/papers/v27/bengio12a/bengio12a.pdf

Deep Learning of Representations for Unsupervised and Transfer Learning.

Deep learning algorithms seek to exploit the unknown structure in the input distribution in order to discover good representations, often at multiple levels, with higher-level learned features defined in terms of lower-level features. The objective is to make these higher-level representations more abstract, with their individual features more invariant to most of the variations that are typically present in the training distribution, while collectively preserving as much as possible of the information in the input. Ideally, we would like these representations to disentangle the unknown factors of variation that underlie the training distribution. Such unsupervised learning of representations can be exploited usefully under the hypothesis that the input distribution P (x) is structurally related to some task of interest, say predicting P (y|x). This paper focusses on why unsupervised pre-training of representations can be useful, and how it can be exploited in the transfer learning scenario, where we care about predictions on examples that are not from the same distribution as the training distribution.

openalex-author · International Conference on Machine Learning

Unsupervised and Transfer Learning Challenge: a Deep Learning Approach

Learning good representations from a large set of unlabeled data is a particularly challenging task. Recent work (see Bengio (2009) for a review) shows that training deep architectures is a good way to extract such representations, by extracting and disentangling gradually higher-level factors of variation characterizing the input distribution. In this paper, we describe different kinds of layers we trained for learning representations in the setting of the Unsupervised and Transfer Learning Challenge. The strategy of our team won the final phase of the challenge. It combined and stacked different one-layer unsupervised learning algorithms, adapted to each of the five datasets of the competition. This paper describes that strategy and the particular one-layer learning algorithms feeding a simple linear classifier with a tiny number of labeled training samples (1 to 64 per class).

openalex-author · http://www.iro.umontreal.ca/~vincentp/Publications/contractive_autoencoder_icml2011.pdf

Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

We present in this paper a novel approach for training deterministic auto-encoders. We show that by adding a well chosen penalty term to the classical reconstruction cost function, we can achieve results that equal or surpass those attained by other regularized autoencoders as well as denoising auto-encoders on a range of datasets. This penalty term corresponds to the Frobenius norm of the Jacobian matrix of the encoder activations with respect to the input. We show that this penalty term results in a localized space contraction which in turn yields robust features on the activation layer. Furthermore, we show how this penalty term is related to both regularized auto-encoders and denoising auto-encoders and how it can be seen as a link between deterministic and non-deterministic auto-encoders. We find empirically that this penalty helps to carve a representation that better captures the local directions of variation dictated by the data, corresponding to a lower-dimensional non-linear manifold, while being more invariant to the vast majority of directions orthogonal to the manifold. Finally, we show that by using the learned features to initialize a MLP, we achieve state of the art classification error on a range of datasets, surpassing other methods of pretraining.

openalex-author · International Conference on Machine Learning

Unsupervised Models of Images by Spike-and-Slab RBMs

The spike-and-slab Restricted Boltzmann Machine (RBM) is defined by having both a real valued slab variable and a binary variable associated with each unit in the hidden layer. In this paper we generalize and extend the spike-and-slab RBM to include non-zero means of the conditional distribution over the observed variables given the binary spike variables. We also introduce a term, quadratic in the observed data that we exploit to guarantee that all conditionals associated with the model are well defined – a guarantee that was absent in the original spike-and-slab RBM. The inclusion of these generalizations improves the performance of the spike-and-slab RBM as a feature learner and achieves competitive performance on the CIFAR-10 image classification task. The spike-and-slab model, when trained in a convolutional configuration, can generate sensible samples that demonstrate that the model has captured the broad statistical structure of natural images.

openalex-author · International Conference on Machine Learning

Large-Scale Learning of Embeddings with Reconstruction Sampling

In this paper, we present a novel method to speed up the learning of embeddings for large-scale learning tasks involving very sparse data, as is typically the case for Natural Language Processing tasks. Our speed-up method has been developed in the context of Denoising Auto-encoders, which are trained in a purely unsupervised way to capture the input distribution, and learn embeddings for words and text similar to earlier neural language models. The main contribution is a new method to approximate reconstruction error by a sampling procedure. We show how this approximation can be made to obtain an unbiased estimator of the training criterion, and we show how it can be leveraged to make learning much more computationally efficient. We demonstrate the effectiveness of this method on the Amazon and RCV1 NLP datasets. Instead of reducing vocabulary size to make learning practical, our method allows us to train using very large vocabularies. In particular, reconstruction sampling requires 22x less training time on the full Amazon dataset.

openalex-author · http://jmlr.csail.mit.edu/proceedings/papers/v15/bengio11b/bengio11b.pdf

Deep Learners Benefit More from Out-of-Distribution Examples

Recent theoretical and empirical work in statistical machine learning has demonstrated the potential of learning algorithms for deep architectures, i.e., function classes obtained by composing multiple levels of representation. The hypothesis evaluated here is that intermediate levels of representation, because they can be shared across tasks and examples from different but related distributions, can yield even more benefits. Comparative experiments were performed on a large-scale handwritten character recognition setting with 62 classes (upper case, lower case, digits), using both a multi-task setting and perturbed examples in order to obtain out-ofdistribution examples. The results agree with the hypothesis, and show that a deep learner did beat previously published results and reached human-level performance. 1

openalex-author · International Conference on Artificial Intelligence and Statistics

A Spike and Slab Restricted Boltzmann Machine

We introduce the spike and slab Restricted Boltzmann Machine, characterized by having both a real-valued vector, the slab, and a binary variable, the spike, associated with each unit in the hidden layer. The model possesses some practical properties such as being amenable to Block Gibbs sampling as well as being capable of generating similar latent representations of the data to the recently introduced mean and covariance Restricted Boltzmann Machine. We illustrate how the spike and slab Restricted Boltzmann Machine achieves competitive performance on the CIFAR-10 object recognition task.

openalex-author · International Conference on Artificial Intelligence and Statistics

Discussion of \The Neural Autoregressive Distribution Estimator"

The Restricted Boltzmann Machine (Smolensky, 1986; Hinton et al., 2006) has inspired much research in recent years, in particular as a building block for deep architectures (see Bengio (2009) for a review). The Restricted Boltzmann Machine (RBM) is an undirected graphical model with latent variables, exact inference, rather simple sampling procedures (block Gibbs), and several successful learning algorithms based on approximations of the log-likelihood gradient. However, when it comes to actually computing the distribution or density function, it is intractable, except when either the number of inputs or latent variables is very small (about 25 binary hidden units with current computers and about an hour of computing, on MNIST).

openalex-author · Neural Computation

Quickly Generating Representative Samples from an RBM-Derived Process

Two recently proposed learning algorithms, herding and fast persistent contrastive divergence (FPCD), share the following interesting characteristic: they exploit changes in the model parameters while sampling in order to escape modes and mix better during the sampling process that is part of the learning algorithm. We justify such approaches as ways to escape modes while keeping approximately the same asymptotic distribution of the Markov chain. In that spirit, we extend FPCD using an idea borrowed from Herding in order to obtain a pure sampling algorithm, which we call the rates-FPCD sampler. Interestingly, this sampler can improve the model as we collect more samples, since it optimizes a lower bound on the log likelihood of the training data. We provide empirical evidence that this new algorithm displays substantially better and more robust mixing than Gibbs sampling.

openalex-author · arXiv (Cornell University)

Learning invariant features through local space contraction

We present in this paper a novel approach for training deterministic auto-encoders. We show that by adding a well chosen penalty term to the classical reconstruction cost function, we can achieve results that equal or surpass those attained by other regularized auto-encoders as well as denoising auto-encoders on a range of datasets. This penalty term corresponds to the Frobenius norm of the Jacobian matrix of the encoder activations with respect to the input. We show that this penalty term results in a localized space contraction which in turn yields robust features on the activation layer. Furthermore, we show how this penalty term is related to both regularized auto-encoders and denoising encoders and how it can be seen as a link between deterministic and non-deterministic auto-encoders. We find empirically that this penalty helps to carve a representation that better captures the local directions of variation dictated by the data, corresponding to a lower-dimensional non-linear manifold, while being more invariant to the vast majority of directions orthogonal to the manifold. Finally, we show that by using the learned features to initialize a MLP, we achieve state of the art classification error on a range of datasets, surpassing other methods of pre-training.

openalex-author · arXiv (Cornell University)

Adding noise to the input of a model trained with a regularized objective

Regularization is a well studied problem in the context of neural networks. It is usually used to improve the generalization performance when the number of input samples is relatively small or heavily contaminated with noise. The regularization of a parametric model can be achieved in different manners some of which are early stopping (Morgan and Bourlard, 1990), weight decay, output smoothing that are used to avoid overfitting during the training of the considered model. From a Bayesian point of view, many regularization techniques correspond to imposing certain prior distributions on model parameters (Krogh and Hertz, 1991). Using Bishop's approximation (Bishop, 1995) of the objective function when a restricted type of noise is added to the input of a parametric function, we derive the higher order terms of the Taylor expansion and analyze the coefficients of the regularization terms induced by the noisy input. In particular we study the effect of penalizing the Hessian of the mapping function with respect to the input in terms of generalization performance. We also show how we can control independently this coefficient by explicitly penalizing the Jacobian of the mapping function on corrupted inputs.

openalex-author · arXiv (Cornell University)

Autotagging music with conditional restricted Boltzmann machines

This paper describes two applications of conditional restricted Boltzmann machines (CRBMs) to the task of autotagging music. The first consists of training a CRBM to predict tags that a user would apply to a clip of a song based on tags already applied by other users. By learning the relationships between tags, this model is able to pre-process training data to significantly improve the performance of a support vector machine (SVM) autotagging. The second is the use of a discriminative RBM, a type of CRBM, to autotag music. By simultaneously exploiting the relationships among tags and between tags and audio-based features, this model is able to significantly outperform SVMs, logistic regression, and multi-layer perceptrons. In order to be applied to this problem, the discriminative RBM was generalized to the multi-label setting and four different learning algorithms for it were evaluated, the first such in-depth analysis of which we are aware.

openalex-author · arXiv (Cornell University)

Adaptive Drift-Diffusion Process to Learn Time Intervals

Animals learn the timing between consecutive events very easily. Their precision is usually proportional to the interval to time (Weber's law for timing). Most current timing models either require a central clock and unbounded accumulator or whole pre-defined populations of delay lines, decaying traces or oscillators to represent elapsing time. Current adaptive recurrent neural networks fail at learning to predict the timing of future events (the 'when') in a realistic manner. In this paper, we present a new model of interval timing, based on simple temporal integrators, derived from drift-diffusion models. We develop a simple geometric rule to learn 'when' instead of 'what'. We provide an analytical proof that the model can learn inter-event intervals in a number of trials independent of the interval size and that the temporal precision of the system is proportional to the timed interval. This new model uses no clock, no gradient, no unbounded accumulators, no delay lines, and has internal noise allowing generations of individual trials. Three interesting predictions are made.

openalex-author · Lecture Notes in Computer Science

On the Expressive Power of Deep Architectures

No abstract available from the OpenAlex source record.

openalex-author · Paper

On learning distributed representations of semantics.

No abstract available from the OpenAlex source record.

openalex-author · Lecture Notes in Computer Science

Higher Order Contractive Auto-Encoder

No abstract available from the OpenAlex source record.

openalex-author · http://hal.in2p3.fr/docs/00/64/29/98/PDF/draft1.pdf

Algorithms for hyper-parameter optimization

Several recent advances to the state of the art in image classification benchmarks have come from better configurations of existing techniques rather than novel ap-proaches to feature learning. Traditionally, hyper-parameter optimization has been the job of humans because they can be very efficient in regimes where only a few trials are possible. Presently, computer clusters and GPU processors make it pos-sible to run more trials and we show that algorithmic approaches can find better results. We present hyper-parameter optimization results on tasks of training neu-ral networks and deep belief networks (DBNs). We optimize hyper-parameters using random search and two new greedy sequential methods based on the ex-pected improvement criterion. Random search has been shown to be sufficiently efficient for learning neural networks for several datasets, but we show it is unreli-able for training DBNs. The sequential algorithms are applied to the most difficult DBN learning problems from [1] and find significantly better results than the best previously reported. This work contributes novel techniques for making response surface models P (y|x) in which many elements of hyper-parameter assignment (x) are known to be irrelevant given particular values of other elements. 1

openalex-author · Paper

Incorporating complex cells into neural networks for pattern classification

Computational neuroscientists have hypothesized that the visual system from the retina to at least primary visual cortex is continuously fitting a latent variable probability model to its stream of perceptions. It is not known exactly which probability model, nor exactly how the fitting takes place, but known algorithms for fitting such models require conditional estimates of the latent variables. This gives us a strong hint as to why the visual system might be fitting such a model; in the right kind of model those conditional estimates can also serve as excellent features for analyzing the semantic content of images perceived. The work presented here uses image classification performance (accurate discrimination between common classes of objects) as a basis for comparing visual system models, and algorithms for fitting those models as probability densities to images. This dissertation (a) finds that models based on visual area V1's complex cells generalize better from labeled training examples than conventional neural networks whose hidden units are more like V1's simple cells, (b) presents novel interpretations for complex-cell-based visual system models as probability distributions and novel algorithms for fitting them to data, and (c) demonstrates that these models form better features for image classification after they are first trained as probability models. Visual system models based on complex cells achieve some of the best results to date on the CIFAR-10 image classification benchmark, and samples from their probability distributions indicate that they have learnt to capture important aspects of natural images. Two auxiliary technical innovations that made this work possible are also described: a random search algorithm for selecting hyper-parameters, and an optimizing compiler for matrix-valued mathematical expressions which can target both CPU and GPU devices. Keywords: machine learning, visual area V1, hyper-parameter selection, computer vision, biological vision.

openalex-author · http://www.icml-2011.org/papers/342_icmlpaper.pdf

Domain adaptation for large-scale sentiment classification: A deep learning approach

The exponential increase in the availability of online reviews and recommendations makes sentiment classification an interesting topic in academic and industrial research. Reviews can span so many different domains that it is difficult to gather annotated training data for all of them. Hence, this paper studies the problem of domain adaptation for sentiment classifiers, hereby a system is trained on labeled reviews from one source domain but is meant to be deployed on another. We propose a deep learning approach which learns to extract a meaningful representation for each review in an unsupervised fashion. Sentiment classifiers trained with this high-level feature representation clearly outperform state-of-the-art methods on a benchmark composed of reviews of 4 types of Amazon products. Furthermore, this method scales well and allowed us to successfully perform domain adaptation on a larger industrial-strength dataset of 22 domains. 1.

openalex-author · Paper

Understanding deep architectures and the effect of unsupervised pre-training

This thesis studies a class of algorithms called deep architectures. We argue that models that are based on a shallow composition of local features are not appropriate for the set of real-world functions and datasets that are of interest to us, namely data with many factors of variation. Modelling such functions and datasets is important if we are hoping to create an intelligent agent that can learn from complicated data. Deep architectures are hypothesized to be a step in the right direction, as they are compositions of nonlinearities and can learn compact distributed representations of data with many factors of variation. Training fully-connected artificial neural networks—the most common form of a deep architecture—was not possible before Hinton (2006) showed that one can use stacks of unsupervised Restricted Boltzmann Machines to initialize or pre-train a supervised multi-layer network. This breakthrough has been influential, as the basic idea of using unsupervised learning to improve generalization in deep networks has been reproduced in a multitude of other settings and models. In this thesis, we cast the deep learning ideas and techniques as defining a special kind of inductive bias. This bias is defined not only by the kind of functions that are eventually represented by such deep models, but also by the learning process that is commonly used for them. This work is a study of the reasons for why this class of functions generalizes well, the situations where they should work well, and the qualitative statements that one could make about such functions. This thesis is thus an attempt to understand why deep architectures work. In the first of the articles presented we study the question of how well our intuitions about the need for deep models correspond to functions that they can actually model well. In the second article we perform an in-depth study of why unsupervised pre-training helps deep learning and explore a variety of hypotheses that give us an intuition for the dynamics of learning in such architectures. Finally, in the third article, we want to better understand what a deep architecture models, qualitatively speaking. Our visualization approach enables us to understand the representations and invariances modelled and learned by deeper layers. Keywords: machine learning, artificial neural networks, deep architectures, unsupervised learning, visualization.

openalex-author · Paper

A Common GPU n-Dimensional Array for Python and C

Currently there are multiple incompatible array/matrix/n-dimensional base object implementations for GPUs. This hinders the sharing of GPU code and causes duplicate development work. This paper proposes and presents a first version of a common GPU n-dimensional array (tensor) named GpuNdArray [1] that works with both CUDA and OpenCL. It will be usable from Python, C, and possibly other programming languages.

openalex-author · Neural Computation

Suitability of V1 Energy Models for Object Classification

Simulations of cortical computation have often focused on networks built from simplified neuron models similar to rate models hypothesized for V1 simple cells. However, physiological research has revealed that even V1 simple cells have surprising complexity. Our computational simulations explore the effect of this complexity on the visual system's ability to solve simple tasks, such as the categorization of shapes and digits, after learning from a limited number of examples. We use recently proposed high-throughput methodology to explore what axes of modeling complexity are useful in these categorization tasks. We find that complex cell rate models learn to categorize objects better than simple cell models, and without incurring extra computational expense. We find that the squaring of linear filter responses leads to better performance. We find that several other components of physiologically derived models do not yield better performance.

openalex-author · arXiv (Cornell University)

Adaptive Parallel Tempering for Stochastic Maximum Likelihood Learning of RBMs

Restricted Boltzmann Machines (RBM) have attracted a lot of attention of late, as one the principle building blocks of deep networks. Training RBMs remains problematic however, because of the intractibility of their partition function. The maximum likelihood gradient requires a very robust sampler which can accurately sample from the model despite the loss of ergodicity often incurred during learning. While using Parallel Tempering in the negative phase of Stochastic Maximum Likelihood (SML-PT) helps address the issue, it imposes a trade-off between computational complexity and high ergodicity, and requires careful hand-tuning of the temperatures. In this paper, we show that this trade-off is unnecessary. The choice of optimal temperatures can be automated by minimizing average return time (a concept first proposed by [Katzgraber et al., 2006]) while chains can be spawned dynamically, as needed, thus minimizing the computational overhead. We show on a synthetic dataset, that this results in better likelihood scores.

openalex-author · Computational Intelligence

DECISION TREES DO NOT GENERALIZE TO NEW VARIATIONS

The family of decision tree learning algorithms is among the most widespread and studied. Motivated by the desire to develop learning algorithms that can generalize when learning highly varying functions such as those presumably needed to achieve artificial intelligence, we study some theoretical limitations of decision trees. We demonstrate formally that they can be seriously hurt by the curse of dimensionality in a sense that is a bit different from other nonparametric statistical methods, but most importantly, that they cannot generalize to variations not seen in the training set. This is because a decision tree creates a partition of the input space and needs at least one example in each of the regions associated with a leaf to make a sensible prediction in that region. A better understanding of the fundamental reasons for this limitation suggests that one should use forests or even deeper architectures instead of trees, which provide a form of distributed representation and can generalize to variations not encountered in the training data.

openalex-author · arXiv (Cornell University)

Deep Self-Taught Learning for Handwritten Character Recognition

Recent theoretical and empirical work in statistical machine learning has demonstrated the importance of learning algorithms for deep architectures, i.e., function classes obtained by composing multiple non-linear transformations. Self-taught learning (exploiting unlabeled examples or examples from other distributions) has already been applied to deep learners, but mostly to show the advantage of unlabeled examples. Here we explore the advantage brought by {\em out-of-distribution examples}. For this purpose we developed a powerful generator of stochastic variations and noise processes for character images, including not only affine transformations but also slant, local elastic deformations, changes in thickness, background images, grey level changes, contrast, occlusion, and various types of noise. The out-of-distribution examples are obtained from these highly distorted images or by including examples of object classes different from those in the target test set. We show that {\em deep learners benefit more from out-of-distribution examples than a corresponding shallow learner}, at least in the area of handwritten character recognition. In fact, we show that they beat previously published results and reach human-level performance on both handwritten digit classification and 62-class handwritten character recognition.

openalex-author · Zenodo (CERN European Organization for Nuclear Research)

Learning Tags That Vary Within A Song.

[TODO] Add abstract here.

openalex-author · http://aclweb.org/anthology-new/P/P10/P10-1040.pdf

Word Representations: A Simple and General Method for Semi-Supervised Learning

If we take an existing supervised NLP system, a simple and general way to improve accuracy is to use unsupervised word representations as extra word features. We evaluate Brown clusters, Collobert and Weston (2008) embeddings, and HLBL (Mnih &amp; Hinton, 2009) embeddings of words on both NER and chunking. We use near state-of-the-art supervised baselines, and find that each of the three word representations improves the accuracy of these baselines. We find further improvements by combining different word representations. You can download our word features, for off-the-shelf use in existing NLP systems, as well as our code, here:

openalex-author · Neural Computation

Tractable Multivariate Binary Density Estimation and the Restricted Boltzmann Forest

We investigate the problem of estimating the density function of multivariate binary data. In particular, we focus on models for which computing the estimated probability of any data point is tractable. In such a setting, previous work has mostly concentrated on mixture modeling approaches. We argue that for the problem of tractable density estimation, the restricted Boltzmann machine (RBM) provides a competitive framework for multivariate binary density modeling. With this in mind, we also generalize the RBM framework and present the restricted Boltzmann forest (RBForest), which replaces the binary variables in the hidden layer of RBMs with groups of tree-structured binary variables. This extension allows us to obtain models that have more modeling capacity but remain tractable. In experiments on several data sets, we demonstrate the competitiveness of this approach and study some of its properties.

openalex-author · International Conference on Artificial Intelligence and Statistics

Tempered Markov Chain Monte Carlo for training of Restricted Boltzmann Machines

No abstract available from the OpenAlex source record.

openalex-author · Neural Computation

Deep Belief Networks Are Compact Universal Approximators

Deep belief networks (DBN) are generative models with many layers of hidden causal variables, recently introduced by Hinton, Osindero, and Teh ( 2006 ), along with a greedy layer-wise unsupervised learning algorithm. Building on Le Roux and Bengio ( 2008 ) and Sutskever and Hinton ( 2008 ), we show that deep but narrow generative networks do not require more parameters than shallow ones to achieve universal approximation. Exploiting the proof technique, we prove that deep but narrow feedforward neural networks with sigmoidal units can represent any Boolean expression.

openalex-author · http://jmlr.csail.mit.edu/proceedings/papers/v9/erhan10a/erhan10a.pdf

Why Does Unsupervised Pre-training Help Deep Learning?

Much recent research has been devoted to learning algorithms for deep architectures such as Deep Belief Networks and stacks of autoencoder variants with impressive results being obtained in several areas, mostly on vision and language datasets. The best results obtained on supervised learning tasks often involve an unsupervised learning component, usually in an unsupervised pre-training phase. The main question investigated here is the following: why does unsupervised pre-training work so well? Through extensive experimentation, we explore several possible explanations discussed in the literature including its action as a regularizer (Erhan et al., 2009b) and as an aid to optimization (Bengio et al., 2007). Our results build on the work of Erhan et al. (2009b), showing that unsupervised pre-training appears to play predominantly a regularization role in subsequent supervised training. However our results in an online setting, with a virtually unlimited data stream, point to a somewhat more nuanced interpretation of the roles of optimization and regularization in the unsupervised pre-training effect.

openalex-author · http://jmlr.csail.mit.edu/proceedings/papers/v9/glorot10a/glorot10a.pdf

Understanding the difficulty of training deep feedforward neural networks

Whereas before 2006 it appears that deep multilayer neural networks were not successfully trained, since then several algorithms have been shown to successfully train them, with experimental results showing the superiority of deeper vs less deep architectures. All these experimental results were obtained with new initialization or training mechanisms. Our objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future. We first observe the influence of the non-linear activations functions. We find that the logistic sigmoid activation is unsuited for deep networks with random initialization because of its mean value, which can drive especially the top hidden layer into saturation. Surprisingly, we find that saturated units can move out of saturation by themselves, albeit slowly, and explaining the plateaus sometimes seen when training neural networks. We find that a new non-linearity that saturates less can often be beneficial. Finally, we study how activations and gradients vary across layers and during training, with the idea that training may be more difficult when the singular values of the Jacobian associated with each layer are far from 1. Based on these considerations, we propose a new initialization scheme that brings substantially faster convergence. 1 Deep Neural Networks Deep learning methods aim at learning feature hierarchies with features from higher levels of the hierarchy formed by the composition of lower level features. They include

openalex-author · Proceedings of the Python in Science Conference

Theano: A CPU and GPU Math Compiler in Python

Abstract—Theano is a compiler for mathematical expressions in Python that combines the convenience of NumPy’s syntax with the speed of optimized native machine language. The user composes mathematical expressions in a high-level description that mimics NumPy’s syntax and semantics, while being statically typed and functional (as opposed to imperative). These expressions allow Theano to provide symbolic differentiation. Before performing computation, Theano optimizes the choice of expressions, translates them into C++ (or CUDA for GPU), compiles them into dynamically loaded Python modules, all automatically. Common machine learn-ing algorithms implemented with Theano are from 1.6 × to 7.5× faster than competitive alternatives (including those implemented with C/C++, NumPy/SciPy and MATLAB) when compiled for the CPU and between 6.5 × and 44 × faster when compiled for the GPU. This paper illustrates how to use Theano, outlines the scope of the compiler, provides benchmarks on both CPU and GPU processors, and explains its overall design.

openalex-author · http://jmlr.csail.mit.edu/papers/volume11/vincent10a/vincent10a.pdf

Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion

We explore an original strategy for building deep networks, based on stacking layers of denoising autoencoders which are trained locally to denoise corrupted versions of their inputs. The resulting algorithm is a straightforward variation on the stacking of ordinary autoencoders. It is however shown on a benchmark of classification problems to yield significantly lower classification error, thus bridging the performance gap with deep belief networks (DBN), and in several cases surpassing it. Higher level representations learnt in this purely unsupervised fashion also help boost the performance of subsequent SVM classifiers. Qualitative experiments show that, contrary to ordinary autoencoders, denoising autoencoders are able to learn Gabor-like edge detectors from natural image patches and larger stroke detectors from digit images. This work clearly establishes the value of using a denoising criterion as a tractable unsupervised objective to guide the learning of useful higher level representations.

openalex-author · http://jmlr.csail.mit.edu/proceedings/papers/v9/desjardins10a/desjardins10a.pdf

Parallel Tempering for Training of Restricted Boltzmann Machines

Alternating Gibbs sampling between visible and latent units is the most common scheme used for sampling from Restricted Boltzmann Machines (RBM), a crucial component in deep architectures such as Deep Belief Networks (DBN). However, we find that it often does a very poor job of rendering the diversity of modes captured by the trained model. We suspect that this property hinders RBM training methods such as the Contrastive Divergence and Persistent Contrastive Divergence algorithm that rely on Gibbs sampling to approximate the likelihood gradient. To alleviate this problem, we explore the use of tempered Markov Chain Monte-Carlo for sampling in RBMs. We find both through visualization of samples and measures of likelihood on a toy dataset that it helps both sampling and learning.

openalex-author · Neural Information Processing Systems

Slow, Decorrelated Features for Pretraining Complex Cell-like Networks

We introduce a new type of neural network activation function based on recent physiological rate models for complex cells in visual area V1. A single-hidden-layer neural network of this kind of model achieves 1.50% error on MNIST. We also introduce an existing criterion for learning slow, decorrelated features as a pretraining strategy for image models. This pretraining strategy results in orientation-selective features, similar to the receptive fields of complex cells. With this pretraining, the same single-hidden-layer model achieves 1.34% error, even though the pretraining sample distribution is very different from the fine-tuning distribution. To implement this pretraining strategy, we derive a fast algorithm for online learning of decorrelated features such that each iteration of the algorithm runs in linear time with respect to the number of features.

openalex-author · Paper

Proceedings of the 22nd International Conference on Neural Information Processing Systems

No abstract available from the OpenAlex source record.

openalex-author · Neural Information Processing Systems

An Infinite Factor Model Hierarchy Via a Noisy-Or Mechanism

The Indian Buffet Process is a Bayesian nonparametric approach that models objects as arising from an infinite number of latent factors. Here we extend the latent factor model framework to two or more unbounded layers of latent factors. From a generative perspective, each layer defines a conditional factorial prior distribution over the binary latent variables of the layer below via a noisy-or mechanism. We explore the properties of the model with two empirical studies, one digit recognition task and one music tag data experiment.

openalex-author · http://jmlr.org/papers/volume10/dugas09a/dugas09a.pdf

Incorporating Functional Knowledge in Neural Networks

Incorporating prior knowledge of a particular task into the architecture of a learning algorithm can greatly improve generalization performance. We study here a case where we know that the function to be learned is non-decreasing in its two arguments and convex in one of them. For this purpose we propose a class of functions similar to multi-layer neural networks but (1) that has those properties, (2) is a universal approximator of Lipschitz 1 functions with these and other properties. We apply this new class of functions to the task of modelling the price of call options. Experiments show improvements on regressing the price of call options using the new types of function classes that incorporate the a priori constraints.

openalex-author · http://jmlr.org/papers/volume10/larochelle09a/larochelle09a.pdf

Exploring Strategies for Training Deep Neural Networks

Deep multi-layer neural networks have many levels of non-linearities allowing them to compactly represent highly non-linear and highly-varying functions. However, until recently it was not clear how to train such deep networks, since gradient-based optimization starting from random initialization often appears to get stuck in poor solutions. Hinton et al. recently proposed a greedy layer-wise unsupervised learning procedure relying on the training algorithm of restricted Boltzmann machines (RBM) to initialize the parameters of a deep belief network (DBN), a generative model with many layers of hidden causal variables. This was followed by the proposal of another greedy layer-wise procedure, relying on the usage of autoassociator networks. In the context of the above optimization problem, we study these algorithms empirically to better understand their success. Our experiments confirm the hypothesis that the greedy layer-wise unsupervised training strategy helps the optimization by initializing weights in a region near a good local minimum, but also implicitly acts as a sort of regularization that brings better generalization and encourages internal distributed representations that are high-level abstractions of the input. We also present a series of experiments aimed at evaluating the link between the performance of deep neural networks and practical aspects of their topology, for example, demonstrating cases where the addition of more depth helps. Finally, we empirically explore simple variants of these training algorithms, such as the use of different RBM input unit distributions, a simple way of combining gradient estimators to improve performance, as well as on-line versions of those algorithms.

openalex-author · Journal of Computational Neuroscience

Alternative time representation in dopamine models

No abstract available from the OpenAlex source record.

openalex-author · Proceedings of the 26th Annual International Conference on Machine Learning

Curriculum learning

Humans and animals learn much better when the examples are not randomly presented but organized in a meaningful order which illustrates gradually more concepts, and gradually more complex ones. Here, we formalize such training strategies in the context of machine learning, and call them "curriculum learning". In the context of recent research studying the difficulty of training in the presence of non-convex training criteria (for deep deterministic and stochastic neural networks), we explore curriculum learning in various set-ups. The experiments show that significant improvements in generalization can be achieved. We hypothesize that curriculum learning has both an effect on the speed of convergence of the training process to a minimum and, in the case of non-convex criteria, on the quality of the local minima obtained: curriculum learning can be seen as a particular form of continuation method (a general strategy for global optimization of non-convex functions).

openalex-author · Proceedings of the 26th Annual International Conference on Machine Learning

Workshop summary: Workshop on learning feature hierarchies

No abstract available.

openalex-author · IEEE Transactions on Neural Networks

A Hybrid Pareto Mixture for Conditional Asymmetric Fat-Tailed Distributions

In many cases, we observe some variables X that contain predictive information over a scalar variable of interest Y , with (X,Y) pairs observed in a training set. We can take advantage of this information to estimate the conditional density p(Y|X = x). In this paper, we propose a conditional mixture model with hybrid Pareto components to estimate p(Y|X = x). The hybrid Pareto is a Gaussian whose upper tail has been replaced by a generalized Pareto tail. A third parameter, in addition to the location and spread parameters of the Gaussian, controls the heaviness of the upper tail. Using the hybrid Pareto in a mixture model results in a nonparametric estimator that can adapt to multimodality, asymmetry, and heavy tails. A conditional density estimator is built by modeling the parameters of the mixture estimator as functions of X. We use a neural network to implement these functions. Such conditional density estimators have important applications in many domains such as finance and insurance. We show experimentally that this novel approach better models the conditional density in terms of likelihood, compared to competing algorithms: conditional mixture models with other types of components and a classical kernel-based nonparametric model.

openalex-author · International Conference on Artificial Intelligence and Statistics

The Difficulty of Training Deep Architectures and the Effect of Unsupervised Pre-Training

Whereas theoretical work suggests that deep architectures might be more e cient at representing highly-varying functions, training deep architectures was unsuccessful until the recent advent of algorithms based on unsupervised pretraining. Even though these new algorithms have enabled training deep models, many questions remain as to the nature of this di cult learning problem. Answering these questions is important if learning in deep architectures is to be further improved. We attempt to shed some light on these questions through extensive simulations. The experiments confirm and clarify the advantage of unsupervised pre-training. They demonstrate the robustness of the training procedure with respect to the random initialization, the positive e ect of pre-training in terms of optimization and its role as a regularizer. We empirically show the influence of pre-training with respect to architecture depth, model capacity, and number of training examples.

openalex-author · http://www.apstat.com/documents/mss_ml_classif.pdf

Statistical Machine Learning Algorithms for Target Classification from Acoustic Signature

Machine learning classification algorithms are relevant to a large number of Army classification prob-lems, including the determination of a weapon class from a detonation acoustic signature. However, much such work has been focused on classification of events from small weapons used for asymmet-ric warfare, which have been of importance in recent years. In this work we consider classification of very different weapon classes, such as mortar, rockets and RPGs, which are difficult to reliably classify with standard techniques since they tend to have similar acoustic signatures. To address this problem, we compare two recently-introduced state-of-the-art machine learning algorithms, Support Vector Machines and Discriminative Restricted Boltzmann Machines, and develop how to use them to solve this difficult acoustic classification task. We obtain classification accuracy results that could make these techniques suitable for fielding on autonomous devices. Discriminative Restricted Boltzmann Machines appear to yield slightly better accuracy than Support Vector Machines, and are less sensitive to the choice of signal preprocessing and model hyperparameters. Importantly, we also address methodological issues that one faces in order to rigorously compare several classifiers on limited data collected from field trials; these questions are of significance to any application of machine learning methods to Army problems. Approved for public release; distribution is unlimited 1 1

openalex-author · now publishers, Inc. eBooks

Learning Deep Architectures for AI

Can machine learning deliver AI? Theoretical results, inspiration from the brain and cognition, as well as machine learning experiments suggest that in order to learn the kind of complicated functions that can represent high-level abstractions (e.g. in vision, language, and other AI-level tasks), one would need deep architectures. Deep architectures are composed of multiple levels of non-linear operations, such as in neural nets with many hidden layers, graphical models with many levels of latent variables, or in complicated propositional formulae re-using many sub-formulae. Each level of the architecture represents features at a different level of abstraction, defined as a composition of lower-level features. Searching the parameter space of deep architectures is a difficult task, but new algorithms have been discovered and a new sub-area has emerged in the machine learning community since 2006, following these discoveries. Learning algorithms such as those for Deep Belief Networks and other related unsupervised learning algorithms have recently been proposed to train deep architectures, yielding exciting results and beating the state-of-the-art in certain areas. Learning Deep Architectures for AI discusses the motivations for and principles of learning algorithms for deep architectures. By analyzing and comparing recent results with different learning algorithms for deep architectures, explanations for their success are proposed and discussed, highlighting challenges and suggesting avenues for future explorations in this area.

openalex-author · Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion

Quadratic features and deep architectures for chunking

We experiment with several chunking models. Deeper architectures achieve better generalization. Quadratic filters, a simplification of a theoretical model of V1 complex cells, reliably increase accuracy. In fact, logistic regression with quadratic filters outperforms a standard single hidden layer neural network. Adding quadratic filters to logistic regression is almost as effective as feature engineering. Despite predicting each output label independently, our model is competitive with ones that use previous decisions.

openalex-author · Paper

Proceedings of the 21st International Conference on Neural Information Processing Systems

No abstract available from the OpenAlex source record.

openalex-author · Neural Computation

Justifying and Generalizing Contrastive Divergence

We study an expansion of the log likelihood in undirected graphical models such as the restricted Boltzmann machine (RBM), where each term in the expansion is associated with a sample in a Gibbs chain alternating between two random variables (the visible vector and the hidden vector in RBMs). We are particularly interested in estimators of the gradient of the log likelihood obtained through this expansion. We show that its residual term converges to zero, justifying the use of a truncation--running only a short Gibbs chain, which is the main idea behind the contrastive divergence (CD) estimator of the log-likelihood gradient. By truncating even more, we obtain a stochastic reconstruction error, related through a mean-field approximation to the reconstruction error often used to train autoassociators and stacked autoassociators. The derivation is not specific to the particular parametric forms used in RBMs and requires only convergence of the Gibbs chain. We present theoretical and empirical evidence linking the number of Gibbs steps k and the magnitude of the RBM parameters to the bias in the CD estimator. These experiments also suggest that the sign of the CD estimator is correct most of the time, even when the bias is large, so that CD-k is a good descent direction even for small k.

openalex-author · Extremes

A hybrid Pareto model for asymmetric fat-tailed data: the univariate case

No abstract available from the OpenAlex source record.

openalex-author · http://www.dmi.usherb.ca/~larocheh/publications/aaai2008_zero-data.pdf

Zero-data learning of new tasks

We introduce the problem of zero-data learning, where a model must generalize to classes or tasks for which no training data are available and only a description of the classes or tasks are provided. Zero-data learning is useful for problems where the set of classes to distinguish or tasks to solve is very large and is not entirely covered by the training data. The main contributions of this work lie in the presentation of a general formalization of zero-data learning, in an experimental analysis of its properties and in empirical evidence showing that generalization is possible and significant in this context. The experimental work of this paper addresses two classification problems of character recognition and a multitask ranking problem in the context of drug discovery. Finally, we conclude by discussing how this new framework could lead to a novel perspective on how to extend machine learning towards AI, where an agent can be given a specification for a learning problem before attempting to solve it (with very few or even zero examples).

openalex-author · IEEE Transactions on Neural Networks

Adaptive Importance Sampling to Accelerate Training of a Neural Probabilistic Language Model

Previous work on statistical language modeling has shown that it is possible to train a feedforward neural network to approximate probabilities over sequences of words, resulting in significant error reduction when compared to standard baseline models based on n-grams. However, training the neural network model with the maximum-likelihood criterion requires computations proportional to the number of words in the vocabulary. In this paper, we introduce adaptive importance sampling as a way to accelerate training of the model. The idea is to use an adaptive n-gram model to track the conditional distributions produced by the neural network. We show that a very significant speedup can be obtained on standard problems.

openalex-author · Neural Computation

Representational Power of Restricted Boltzmann Machines and Deep Belief Networks

Deep belief networks (DBN) are generative neural network models with many layers of hidden explanatory factors, recently introduced by Hinton, Osindero, and Teh (2006) along with a greedy layer-wise unsupervised learning algorithm. The building block of a DBN is a probabilistic model called a restricted Boltzmann machine (RBM), used to represent one layer of the model. Restricted Boltzmann machines are interesting because inference is easy in them and because they have been successfully used as building blocks for training deeper models. We first prove that adding hidden units yields strictly improved modeling power, while a second theorem shows that RBMs are universal approximators of discrete distributions. We then study the question of whether DBNs with more layers are strictly more powerful in terms of representational power. This suggests a new and less greedy criterion for training RBMs within DBNs.

openalex-author · Proceedings of the 25th international conference on Machine learning - ICML '08

Classification using discriminative restricted Boltzmann machines

Recently, many applications for Restricted Boltzmann Machines (RBMs) have been developed for a large variety of learning problems. However, RBMs are usually used as feature extractors for another learning algorithm or to provide a good initialization for deep feed-forward neural network classifiers, and are not considered as a standalone solution to classification problems. In this paper, we argue that RBMs provide a self-contained framework for deriving competitive non-linear classifiers. We present an evaluation of different learning algorithms for RBMs which aim at introducing a discriminative component to RBM training and improve their performance as classifiers. This approach is simple in that RBMs are used directly to build a classifier, rather than as a stepping stone. Finally, we demonstrate how discriminative RBMs can also be successfully employed in a semi-supervised setting.

openalex-author · Proceedings of the 25th international conference on Machine learning - ICML '08

Extracting and composing robust features with denoising autoencoders

Previous work has shown that the difficulties in learning deep generative or discriminative models can be overcome by an initial unsupervised learning step that maps inputs to useful intermediate representations. We introduce and motivate a new training principle for unsupervised learning of a representation based on the idea of making the learned representations robust to partial corruption of the input pattern. This approach can be used to train autoencoders, and these denoising autoencoders can be stacked to initialize deep architectures. The algorithm can be motivated from a manifold learning and information theoretic perspective or from a generative model perspective. Comparative experiments clearly show the surprising advantage of corrupting the input of autoencoders on a pattern classification benchmark suite. 1.

openalex-author · Scholarpedia

Neural net language models

No abstract available from the OpenAlex source record.

openalex-author · http://www.apstat.com/documents/gp_spreads_nips07.pdf

Augmented Functional Time Series Representation and Forecasting with Gaussian Processes

We introduce a functional representation of time series which allows forecasts to be performed over an unspecified horizon with progressively-revealed informa-tion sets. By virtue of using Gaussian processes, a complete covariance matrix between forecasts at several time-steps is available. This information is put to use in an application to actively trade price spreads between commodity futures con-tracts. The approach delivers impressive out-of-sample risk-adjusted returns after transaction costs on a portfolio of 30 spreads. 1

openalex-author · Advances in Neural Information Processing Systems, 2007, Vancouver, Canada

Learning the 2-D Topology of Images

International audience

openalex-author · http://www.deetc.isel.ipl.pt/ftp/FicheirosSeccoes/AnaliseSinais/ccd/temp/LZRW1NoHash_DCC98_84060568.pdf

A Memory-Efficient Adaptive Huffman Coding . . .

The problem of computing the minimum redundancy codes as we observe symbols one by one has received a lot of attention. However, existing algorithms implicitly assumes that either we have a small alphabet --- quite typically 256 symbols --- or that we have an arbitrary amount of memory at our disposal for the creation of the tree. In real life applications one may need to encode symbols coming from a much larger alphabet, for e.g. coding integers. We now have to deal not with hundreds of symbols but possibly with millions of symbols. While other algorithms use a space proportional to the number of observed symbols, we here propose one that uses space proportional to the number of frequency classes, which is, quite interestingly, always smaller or equal to the size of the alphabet.

openalex-author · Advances in Neural Information Processing Systems 19

Greedy Layer-Wise Training of Deep Networks

Complexity theory of circuits strongly suggests that deep architectures can be much more efficient (sometimes exponentially) than shallow architectures, in terms of computational elements required to represent some functions. Deep multi-layer neural networks have many levels of non-linearities allowing them to compactly represent highly non-linear and highly-varying functions. However, until recently it was not clear how to train such deep networks, since gradient-based optimization starting from random initialization appears to often get stuck in poor solutions. Hinton et al. recently introduced a greedy layer-wise unsupervised learning algorithm for Deep Belief Networks (DBN), a generative model with many layers of hidden causal variables. In the context of the above optimization problem, we study this algorithm empirically and explore variants to better understand its success and extend it to cases where the inputs are continuous or where the structure of the input distribution is not revealing enough about the variable to be predicted in a supervised task. Our experiments also confirm the hypothesis that the greedy layer-wise unsupervised training strategy mostly helps the optimization, by initializing weights in a region near a good local minimum, giving rise to internal distributed representations that are high-level abstractions of the input, bringing better generalization.

openalex-author · Large-Scale Kernel Machines

Scaling Learning Algorithms toward AI

This chapter contains sections titled: Introduction, LearningModels Toward AI, Learning Architectures, Shallow and Deep, Fundamental Limitation of Local Learning, Deep Architectures, Experiments with Visual Pattern Recognition, Conclusion

openalex-author · Proceedings of the 24th international conference on Machine learning

An empirical evaluation of deep architectures on problems with many factors of variation

Recently, several learning algorithms relying on models with deep architectures have been proposed. Though they have demonstrated impressive performance, to date, they have only been evaluated on relatively simple problems such as digit recognition in a controlled environment, for which many machine learning algorithms already report reasonable results. Here, we present a series of experiments which indicate that these models show promise in solving harder learning problems that exhibit many factors of variation. These models are compared with well-established algorithms such as Support Vector Machines and single hidden-layer feed-forward neural networks.

openalex-author · http://research.microsoft.com/en-us/people/nicolasl/continuous_nnet.pdf

Continuous Neural Networks

This article extends neural networks to the case of an uncountable number of hidden units, in several ways. In the first approach proposed, a finite parametrization is possible, allowing gradient-based learning. While having the same number of parameters as an ordinary neural network, its internal structure suggests that it can represent some smooth functions much more compactly. Under mild assumptions, we also find better error bounds than with ordinary neural networks. Furthermore, this parametrization may help reducing the problem of saturation of the neurons. In a second approach, the input-to-hidden weights are fully nonparametric, yielding a kernel machine for which we demonstrate a simple kernel formula. Interestingly, the resulting kernel machine can be made hyperparameter-free and still generalizes in spite of an absence of explicit regularization. 1

openalex-author · Journal of Computers

Noisy K Best-Paths for Approximate Dynamic Programming with Application to Portfolio Optimization

Abstract — We describe a general method to transform a non-Markovian sequential decision problem into a supervised learning problem using a K-bestpaths algorithm. We consider an application in financial portfolio management where we can train a controller to directly optimize a Sharpe Ratio (or other risk-averse non-additive) utility function. We illustrate the approach by demonstrating experimental results using a kernel-based controller architecture that would not normally be considered in traditional reinforcement learning or approximate dynamic programming. We further show that using a non-additive criterion (incremental Sharpe Ratio) yields a noisy K-best-paths extraction problem, that can give substantially improved performance. I.

openalex-author · Paper

Forecasting and Trading Commodity Contract Spreads with Gaussian Processes

Gaussian Processes are general statistical models for nonlinear regression and classification that have recently received wide attention in the machine learning community, having originally been introduced in geostatistics (where they are known under the name “Kriging”.) They differ from neural networks in that they engage in a full Bayesian treatment, supplying a complete posterior distribution of forecasts. For regression, they are also computationally relatively simple to implement, the basic model requiring only solving a system of linear equations, albeit one of size equal to the number of training examples (requiring O(N3) computation). This paper examines the use Gaussian Processes to forecast the evolution of futures contracts spreads arising on the commodities markets. Contrarily to most forecasting techniques which rely on modeling the short-term dynamics of a time series (e.g. arima and most neural-network models), an appropriate representation of the input and target variables allows the Gaussian Process to forecast the complete future trajectory of the spread. Furthermore, as a customary outcome of using Gaussian Processes, the forecast includes not only the expectation of future spread prices (across time-steps), but their joint autocovariance matrix as well. We introduce a technique to exploit this joint autocovariance matrix in order to profitably trade spreads, based on maximizing an information ratio criterion between candidate entry-exit points and constantly monitoring the position with revised forecasts as the spread realization unfolds. This approach results in a qualitatively very different methodology than a classical mean-variance portfolio construction based on short-term forecasts, yielding models that do not overtrade yet react quickly to changes in market conditions. We present simultation results on historical data to validate the approach.

openalex-author · PolyPublie (École Polytechnique de Montréal)

A Hybrid Pareto Model for Conditional Density Estimation of Asymmetric Fat-Tail Data

We propose an estimator for the conditional density p(Y |X) that can adapt for asymmetric heavy tails which might depend on X. Such estimators have important applications in nance and insurance. We draw from Extreme Value Theory the tools to build a hybrid unimodal density having a parameter controlling the heaviness of the upper tail. This hybrid is a Gaussian whose upper tail has been replaced by a generalized Pareto tail. We use this hybrid in a multi-modal mixture in order to obtain a nonparametric density estimator that can easily adapt for heavy tailed data. To obtain a conditional density estimator, the parameters of the mixture estimator can be seen as functions of X and these functions learned. We show experimentally that this approach better models the conditional density in terms of likelihood than compared competing algorithms: conditional mixture models with other types of components and multivariate nonparametric models. 1

openalex-author · Progress in Brain Research

On the challenge of learning complex functions

No abstract available from the OpenAlex source record.

openalex-author · Paper

Deep Architectures for Baby AI

No abstract available from the OpenAlex source record.

openalex-author · http://www.iro.umontreal.ca/~pift6266/A06/refs/bengio+lecun_chapter2007.pdf

Scaling learning algorithms towards AI

One long-term goal of machine learning research is to produce methods that are applicable to highly complex tasks, such as perception (vision, audition), rea-soning, intelligent control, and other artificially intelligent behaviors. We argue that in order to progress toward this goal, the Machine Learning community must endeavor to discover algorithms that can learn highly complex functions, with min-imal need for prior knowledge, and with minimal human intervention. We present mathematical and empirical evidence suggesting that many popular approaches to non-parametric learning, particularly kernel methods, are fundamentally lim-ited in their ability to learn complex high-dimensional functions. Our analysis focuses on two problems. First, kernel machines are shallow architectures, in which one large layer of simple template matchers is followed by a single layer of trainable coefficients. We argue that shallow architectures can be very ineffi-cient in terms of required number of computational elements and examples. Sec-ond, we analyze a limitation of kernel machines with a local kernel, linked to the curse of dimensionality, that applies to supervised, unsupervised (manifold learn-ing) and semi-supervised kernel machines. Using empirical results on invariant image recognition tasks, kernel methods are compared with deep architectures, in which lower-level features or concepts are progressively combined into more ab-stract and higher-level representations. We argue that deep architectures have the potential to generalize in non-local ways, i.e., beyond immediate neighbors, and that this is crucial in order to make progress on the kind of complex tasks required for artificial intelligence. 1 1

openalex-author · Semi-Supervised Learning

Large-Scale Algorithms

No abstract available from the OpenAlex source record.

openalex-author · Semi-Supervised Learning

Label Propagation and Quadratic Criterion

Abstract This chapter shows how the different graph-based algorithms for semi-supervised learning can be cast into a common framework where one minimizes a quadratic cost criterion whose closed-form solution is found by solving a linear system of size n (total number of data points). The cost criterion naturally leads to an extension of such algorithms to the inductive setting, where one obtains test samples one at a time: the derived induction formula can be evaluated in O(n) time, which is much more efficient than solving again exactly the linear system (which in general costs O(kn2) time for a sparse graph where each data point has k neighbors). This inductive formula is also used to show that when the similarity between points satisfies a locality property, then the algorithms are plagued by the curse of dimensionality, with respect to the dimensionality of an underlying manifold.

openalex-author · Semi-Supervised Learning

Entropy Regularization

This chapter promotes the use of entropy regularization as a means to benefit from unlabeled data in the framework of maximum a posteriori estimation. The learning criterion is derived from clearly stated assumptions and can be applied to any smoothly parameterized model of posterior probabilities. The regularization scheme favors low-density separation, without any modeling of the density of input features. The contribution of unlabeled data to the learning criterion induces local optima, but this problem can be alleviated by deterministic annealing. For well-behaved models of posterior probabilities, deterministic annealing expectation-maximization (EM) provides a decomposition of the learning problem in a series of concave subproblems. Other approaches to the semi-supervised problem are shown to be close relatives or limiting cases of entropy regularization. A series of experiments illustrates the good behavior of the algorithm in terms of performance and robustness with respect to the violation of the postulated low-density separation assumption.

openalex-author · Neural Computation

Nonlocal Estimation of Manifold Structure

We claim and present arguments to the effect that a large class of manifold learning algorithms that are essentially local and can be framed as kernel learning algorithms will suffer from the curse of dimensionality, at the dimension of the true underlying manifold. This observation invites an exploration of nonlocal manifold learning algorithms that attempt to discover shared structure in the tangent planes at different positions. A training criterion for such an algorithm is proposed, and experiments estimating a tangent plane prediction function are presented, showing its advantages with respect to local manifold learning algorithms: it is able to generalize very far from training data (on learning handwritten character image rotations), where local nonparametric methods fail.

openalex-author · Studies in Fuzziness and Soft Computing

Neural Probabilistic Language Models

No abstract available from the OpenAlex source record.

openalex-author · Journal of Chemical Information and Modeling

Collaborative Filtering on a Family of Biological Targets

Building a QSAR model of a new biological target for which few screening data are available is a statistical challenge. However, the new target may be part of a bigger family, for which we have more screening data. Collaborative filtering or, more generally, multi-task learning, is a machine learning approach that improves the generalization performance of an algorithm by using information from related tasks as an inductive bias. We use collaborative filtering techniques for building predictive models that link multiple targets to multiple examples. The more commonalities between the targets, the better the multi-target model that can be built. We show an example of a multi-target neural network that can use family information to produce a predictive model of an undersampled target. We evaluate JRank, a kernel-based method designed for collaborative filtering. We show their performance on compound prioritization for an HTS campaign and the underlying shared representation between targets. JRank outperformed the neural network both in the single- and multi-target models.

openalex-author · Lecture Notes in Computer Science

The K Best-Paths Approach to Approximate Dynamic Programming with Application to Portfolio Optimization

No abstract available from the OpenAlex source record.

openalex-author · http://www.cs.toronto.edu/~larocheh/publications/dist_rep_pred_tr1284.pdf

Distributed Representation Prediction for Generalization to New Words

Learning distributed representations of symbols (e.g. words) has been used in several Natural Language Processing systems. Such representations can capture semantic or syntactic similarities between words, which permit to fight the curse of dimensionality when considering sequences of such words. Unfortunately, because these representations are learned only for a previously determined vocabulary of words, it is not clear how to obtain representations for new words. We present here an approach which gets around this problem by considering the distributed representations as predictions from low-level or domain-knowledge features of words. We report experiments on a Part Of Speech tagging task, which demonstrates the success of this approach in learning meaningful representations and in providing improved accuracy, especially for new words. 1

openalex-author · Studies in Fuzziness and Soft Computing

Spectral Dimensionality Reduction

No abstract available from the OpenAlex source record.

openalex-author · http://papers.nips.cc/paper/2800-convex-neural-networks.pdf

Convex Neural Networks

Convexity has recently received a lot of attention in the machine learning community, and the lack of convexity has been seen as a major disad-vantage of many learning algorithms, such as multi-layer artificial neural networks. We show that training multi-layer neural networks in which the number of hidden units is learned can be viewed as a convex optimization problem. This problem involves an infinite number of variables, but can be solved by incrementally inserting a hidden unit at a time, each time finding a linear classifier that minimizes a weighted sum of errors. 1

openalex-author · http://www.iro.umontreal.ca/~lisa/pointeurs/local_failure-nips-submission.pdf

The Curse of Highly Variable Functions for Local Kernel Machines

We present a series of theoretical arguments supporting the claim that a large class of modern learning algorithms that rely solely on the smooth-ness prior – with similarity between examples expressed with a local kernel – are sensitive to the curse of dimensionality, or more precisely to the variability of the target. Our discussion covers supervised, semi-supervised and unsupervised learning algorithms. These algorithms are found to be local in the sense that crucial properties of the learned func-tion at x depend mostly on the neighbors of x in the training set. This makes them sensitive to the curse of dimensionality, well studied for classical non-parametric statistical learning. We show in the case of the Gaussian kernel that when the function to be learned has many variations, these algorithms require a number of training examples proportional to the number of variations, which could be large even though there may ex-ist short descriptions of the target function, i.e. their Kolmogorov com-plexity may be low. This suggests that there exist non-local learning algorithms that at least have the potential to learn about such structured but apparently complex functions (because locally they have many vari-ations), while not using very specific prior domain knowledge. 1

openalex-author · http://www.iro.umontreal.ca/~lisa/pointeurs/nonlocal_manifold_parzen-nips-submission.pdf

Non-Local Manifold Parzen Windows

In order to escape from the curse of dimensionality, we claim that one can learn non-local functions, in the sense that the value and shape of the learned function at x must be inferred using examples that may be far from x. With this objective, we present a non-local non-parametric density estimator. It builds upon previously proposed Gaussian mixture models with regularized covariance matrices to take into account the local shape of the manifold. It also builds upon recent work on non-local estimators of the tangent plane of a manifold, which are able to generalize in places with little training data, unlike traditional, local, non-parametric models. 1

openalex-author · Statistical Modeling and Analysis for Complex Data Problems

Bias in Estimating the Variance of K-Fold Cross-Validation

No abstract available from the OpenAlex source record.

openalex-author · http://books.nips.cc/papers/files/nips17/NIPS2004_0691.pdf

Non-local manifold tangent learning

We claim and present arguments to the effect that a large class of manifold learning algorithms that are essentially local and can be framed as kernel learning algorithms will suffer from the curse of dimensionality, at the dimension of the true underlying manifold. This observation suggests to explore non-local manifold learning algorithms which attempt to discover shared structure in the tangent planes at different positions. A criterion for such an algorithm is proposed and experiments estimating a tangent plane prediction function are presented, showing its advantages with respect to local manifold learning algorithms: it is able to generalize very far from training data (on learning handwritten character image rotations), where a local non-parametric method fails. 1

openalex-author · Paper

Graph-Based Semi-Supervised Learning

No abstract available from the OpenAlex source record.

openalex-author · Paper

Reassuring and Troubling Views on Graph-Based Semi-Supervised Learning

No abstract available from the OpenAlex source record.

openalex-author · International Conference on Artificial Intelligence and Statistics

Greedy Spectral Embedding.

No abstract available from the OpenAlex source record.

openalex-author · International Conference on Artificial Intelligence and Statistics

Hierarchical Probabilistic Neural Network Language Model.

According to the Center for Disease Control, there were more than 107,000 US drug overdose deaths in 2021, over 80,000 of which due to opioids. One of the more vulnerable populations is US military veterans. Nearly 250,000 military veterans suffer from substance-related disorders (SRD). For those seeking treatment, buprenorphine is prescribed to help treat opioid use disorder (OUD). Urinalysis is currently used to monitor buprenorphine adherence as well as to detect illicit drug use during treatment. Sometimes sample tampering occurs if patients seek to generate a false positive buprenorphine urine test or mask illicit drugs, both of which can compromise treatment. To address this problem, we have been developing a point-of-care (POC) analyzer that can rapidly measure both medications used for treatment and illicit drugs in patient saliva, ideally in the physi-cian's office. The two-step analyzer employs (1) supported liquid extraction (SLE) to isolate the drugs from the saliva and (2) surface-enhanced Raman spectroscopy (SERS) to detect the drugs. A prototype SLE-SERS-POC analyzer was used to quantify buprenorphine at ng/mL concentrations and identify illicit drugs in less than 1 mL of saliva collected from 20 SRD veterans in less than 20 min. It correctly detected buprenorphine in 19 of 20 samples (18 true positives, 1 true negative and 1 false negative). It also identified 10 other drugs in patient samples: acetaminophen, amphetamine, cannabidiol, cocaethylene, codeine, ibuprofen, methamphetamine, methadone, nicotine, and norbuprenorphine. The prototype analyzer shows evidence of accuracy in measuring treatment medications and relapse to drug use. Further study and development of the system is warranted.

openalex-author · http://www.iro.umontreal.ca/~lisa/bib/pub_subject/kernel/pointeurs/tr1258.pdf

The Curse of Dimensionality for Local Kernel Machines

We present a series of theoretical arguments supporting the claim that a large class of modern learning algorithms based on local kernels are sensitive to the curse of dimensionality. These include local manifold learning algorithms such as Isomap and LLE, support vector classifiers with Gaussian or other local kernels, and graph-based semisupervised learning algorithms using a local similarity function. These algorithms are shown to be local in the sense that crucial properties of the learned function at x depend mostly on the neighbors of x in the training set. This makes them sensitive to the curse of dimensionality, well studied for classical non-parametric statistical learning. There is a large class of data distributions for which non-local solutions could be expressed compactly and potentially be learned with few examples, but which will require a large number of local bases and therefore a large number of training examples when using a local learning algorithm. 1

openalex-author · http://eprints.pascal-network.org/archive/00001978/01/grandvalet05.pdf

Semi-supervised Learning by Entropy Minimization

We consider the semi-supervised learning problem, where a decision rule is to be learned from labeled and unlabeled data. In this framework, we motivate minimum entropy regularization, which enables to incorporate unlabeled data in the standard supervised learning. Our approach in-cludes other approaches to the semi-supervised problem as particular or limiting cases. A series of experiments illustrates that the proposed solu-tion benefits from unlabeled data. The method challenges mixture mod-els when the data are sampled from the distribution class spanned by the generative model. The performances are definitely in favor of minimum entropy regularization when generative models are misspecified, and the weighting of unlabeled data provides robustness to the violation of the “cluster assumption”. Finally, we also illustrate that the method can also be far superior to manifold learning in high dimension spaces. 1

openalex-author · http://www.iro.umontreal.ca/~lisa/pointeurs/RivestNIPS2004.pdf

Brain Inspired Reinforcement Learning

Successful application of reinforcement learning algorithms often involves considerable hand-crafting of the necessary non-linear features to reduce the complexity of the value functions and hence to promote convergence of the algorithm. In contrast, the human brain readily and autonomously finds the complex features when provided with sufficient training. Recent work in machine learning and neurophysiology has demonstrated the role of the basal ganglia and the frontal cortex in mammalian reinforcement learning. This paper develops and explores new reinforcement learning algorithms inspired by neurological evidence that provides potential new approaches to the feature construction problem. The algorithms are compared and evaluated on the Acrobot task. 1

openalex-author · Neural Computation

Learning Eigenfunctions Links Spectral Embedding and Kernel PCA

In this letter, we show a direct relation between spectral embedding methods and kernel principal components analysis and how both are special cases of a more general learning problem: learning the principal eigenfunctions of an operator defined from a kernel and the unknown data-generating density. Whereas spectral embedding methods provided only coordinates for the training points, the analysis justifies a simple extension to out-of-sample examples (the Nyström formula) for multidimensional scaling (MDS), spectral clustering, Laplacian eigenmaps, locally linear embedding (LLE), and Isomap. The analysis provides, for all such spectral embedding methods, the definition of a loss function, whose empirical average is minimized by the traditional algorithms. The asymptotic expected value of that loss defines a generalization performance and clarifies what these algorithms are trying to learn. Experiments with LLE, Isomap, spectral clustering, and MDS show that this out-of-sample embedding formula generalizes well, with a level of error comparable to the effect of small perturbations of the training set on the embedding.

openalex-author · Journal of Computer-Aided Molecular Design

Locally Linear Embedding for dimensionality reduction in QSAR

No abstract available from the OpenAlex source record.

openalex-author · Érudit documents and data repository (Érudit Consortium, University of Montreal)

Locally Weighted Full Covariance Gaussian Density Estimation

Nous décrivons une application du principe d'apprentissage local à l'estimation de densité. Le lissage pondéré localement d'une gaussienne utilisant une matrice de covariance pleine et régularisée conduit à un estimateur de densité ayant un comportement amélioré lorsque la masse de probabilité est concentrée le long d'une variété de basse dimension. Même si l'estimateur proposé n'est pas garanti d'intégrer à 1 sur un ensemble de données fini, nous prouvons la convergence asymptotique de la vraie densité. Les résultats expérimentaux illustrant les avantages de cet estimateur sur les estimateurs non paramétriques classiques sont présentés.

openalex-author · Érudit documents and data repository (Érudit Consortium, University of Montreal)

Estimation de densité conditionnelle lorsque l'hypothèse de normalité est insatisfaisante

Nous cherchons à modéliser des densités dont la distribution est inconnue mais qui est asymétrique et présente des queues lourdes. Dans ce contexte, l'hypothèse de normalité n'est pas appropriée. Afin de maintenir au minimum le nombre d'hypothèses distributionnelles, nous utilisons une méthode non paramétrique pour modéliser le centre de la distribution. La modélisation est plus difficile dans les queues de la distribution puisque peu d'observations s'y trouvent. Nous nous proposons donc d'utiliser la Pareto généralisée (GPD) pour modéliser les queues de la distribution. La GPD permet d'approximer tous les types de queues de distributions (qu'elles soient finies, exponentielles ou sous-exponentielles). L'estimation des paramètres de la GPD est uniquement basée sur les observations extrêmes. Une observation est définie comme étant extrême si elle dépasse un seuil donné. La principale difficulté de la modélisation avec la GPD réside dans le choix d'un seuil adéquat.

openalex-author · Érudit documents and data repository (Érudit Consortium, University of Montreal)

Régularisation du prix des options : Stacking

La modélisation non-paramétrique du prix des options et autres produits dérivés a connu un intérêt croissant au cours des dernières années. Ce rapport se situe dans la perspective de prédire le prix de l'option au marché à partir des mêmes informations utilisées dans la formule de Black-Scholes. Il se situe dans la continuation de travaux récents sur la modélisation de ces prix par des réseaux de neurones avec une structure inspirée des connaissances économiques sur la valorisation d'options. La contribution de la recherche présentée ici est l'utilisation avec succès de l'algorithme de Stacking pour améliorer la généralisation de ces modèles. Cet algorithme combine deux niveaux d'entraînement des modèles, le deuxième cherchant à combler les déficits hors-échantillon du premier. Les résultats obtenus sont très intéressants et portent sur des options d'achat du S&P 500 entre 1987 et 1993.

openalex-author · Paper

Approche statistique pour le repérage de mots informatifs dans les textes oraux

Nous presentons les resultats de l’approche statistique que nous avons developpee pour le reperage de mots informatifs a partir de textes oraux. Ce travail fait partie d’un projet lance par le departement de la defense canadienne pour le developpement d’un systeme d’extraction d’information dans le domaine de la Recherche et Sauvetage maritime (SAR). Il s’agit de trouver et annoter les mots pertinents avec des etiquettes semantiques qui sont les concepts d’une ontologie du domaine (SAR). Notre methode combine deux types d’information : les vecteurs de similarite generes grâce a l’ontologie du domaine et le dictionnaire-thesaurus Wordsmyth ; le contexte d’enonciation represente par le theme. L’evaluation est effectuee en comparant la sortie du systeme avec les reponses de formulaires d’extraction d’information predefinis. Les resultats obtenus sur les textes oraux sont comparables a ceux obtenus dans le cadre de MUC7 pour des textes ecrits.

openalex-author · Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics - ACL '04

Unsupervised sense disambiguation using bilingual probabilistic models

We describe two probabilistic models for unsupervised word-sense disambiguation using parallel corpora. The first model, which we call the Sense model, builds on the work of Diab and Resnik (2002) that uses both parallel text and a sense inventory for the target language, and recasts their approach in a probabilistic framework. The second model, which we call the Concept model, is a hierarchical model that uses a concept latent variable to relate different language specific sense labels. We show that both models improve performance on the word sense disambiguation task over previous unsupervised approaches, with the Concept model showing the largest improvement. Furthermore, in learning the Concept model, as a by-product, we learn a sense inventory for the parallel language.

openalex-author · Érudit documents and data repository (Érudit Consortium, University of Montreal)

Spectral Clustering and Kernel PCA are Learning Eigenfunctions

Dans cet article, on montre une équivalence directe entre la classification spectrale et l'ACP à noyau, et on montre que les deux sont des cas particuliers d'un problème plus général, celui d'apprendre les fonctions propres d'un noyau. Ces fonctions fournissent une base pour un espace de Hilbert dont le produit scalaire est défini par rapport à la densité des données. Les fonctions propres définissent une transformation de coordonnées naturelles pour de nouveaux points, alors que des méthodes comme la classification spectrale et les 'Laplacian eigenmaps' ne fournissaient un système de coordonnées que pour les exemples d'apprentissage. Cette analyse suggère aussi de nouvelles approches à l'apprentissage non-supervisé dans lesquelles on extrait des abstractions qui résument la densité des données, telles que des variétés et des classes naturelles.

openalex-author · http://www.iro.umontreal.ca/~lisa/bib/pub_subject/kernel/pointeurs/tr-tangent.pdf

Discovering Shared Structure in Manifold Learning

We claim and present arguments to the effect that a large class of manifold learning algorithms that are essentially local will suffer from at least four generic problems associated with (1) noise in the data, (2) curvature of the manifold, (3) dimensionality of the manifold, and (4) the presence of many manifolds with little data per manifold. This analysis suggests non-local manifold learning algorithms which attempt to discover shared structure in the tangent planes at different positions. A criterion for such an algorithm is proposed and experiments estimating a tangent plane prediction function are presented. The function has parameters that are shared across space rather than estimated based on the local neighborhood, as in current non-parametric manifold learning algorithms. The results show clearly the advantages of this approach with respect to local manifold learning algorithms. 1

openalex-author · Érudit documents and data repository (Érudit Consortium, University of Montreal)

Efficient Non-Parametric Function Induction in Semi-Supervised Learning

Il y a eu un regain d'intérêt récemment pour l'apprentissage semi-supervisé, à cause du grand nombre de bases de données comportant de très nombreux exemples non étiquetés et seulement quelques exemples étiquetés. Cet article poursuit le travail fait sur les algorithmes non paramétriques qui fournissent une étiquette continue estimée pour les exemples non-étiquetés. Il les étend à des algorithmes d'induction fonctionnelle qui correspondent à la minimisation d'un critère de régularisation appliqué à un exemple hors-échantillon, et qui ont la forme d'un régresseur à fenêtre de Parzen. L'avantage de cette extension est qu'elle permet de prédire l'étiquette d'un nouvel exemple sans avoir à résoudre de nouveau un système de dimension 'n' (le nombre d'exemples d'entraînement total), qui peut être de l'ordre de O(n^3). Les expériences montrent que l'extension fonctionne bien, en ce sens que l'étiquette prédite est proche de celle qui aurait été obtenue si l'exemple de test avait fait partie de l'ensemble non étiqueté. Cette procédure d'induction fonctionnelle relativement efficace peut également être utilisée, lorsque 'n' est grand, pour estimer la solution en l'écrivant seulement en fonction d'une expansion à noyau avec 'm' << 'n', et en la réduisant à un système linéaire avec 'm' équations et 'm' inconnues.

openalex-author · http://www.iro.umontreal.ca/~lisa/pointeurs/bengio_extension_nips_2003.pdf

Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering

Several unsupervised learning algorithms based on an eigendecomposition provide either an embedding or a clustering only for given training points, with no straightforward extension for out-of-sample examples short of recomputing eigenvectors. This paper provides a unified framework for extending Local Linear Embedding (LLE), Isomap, Laplacian Eigenmaps, Multi-Dimensional Scaling (for dimensionality reduction) as well as for Spectral Clustering. This framework is based on seeing these algorithms as learning eigenfunctions of a data-dependent kernel. Numerical experiments show that the generalizations performed have a level of error comparable to the variability of the embedding algorithms due to the choice of training data. 1

openalex-author · Intelligent and Other Computational Techniques in Insurance

Statistical Learning Algorithms Applied to Automobile Insurance Ratemaking

AbstractThe following sections are included:IntroductionConcepts of Statistical Learning TheoryHypothesis Testing: an ExampleParameter Optimization: an ExampleMathematical ObjectivesThe Precision CriterionThe Fairness CriterionMethodologyModelsConstant ModelLinear ModelTable-Based MethodsGreedy Multiplicative ModelGeneralized Linear ModelCHAID Decision TreesCombination of CHAID and Linear ModelOrdinary Neural NetworkHow Can Neural Networks Represent Nonlinear Interactions?Softplus Neural NetworkRegression Support Vector MachineMixture ModelsExperimental ResultsMean-Squared Error ComparisonsEvaluating Model FairnessComparison with Current PremiumsApplication to Risk Sharing Pool FacilitiesConclusionAppendixReferences

openalex-author · IEEE Transactions on Neural Networks

Bias learning, knowledge sharing

Biasing properly the hypothesis space of a learner has been shown to improve generalization performance. Methods for achieving this goal have been proposed, that range from designing and introducing a bias into a learner to automatically learning the bias. Multitask learning methods fall into the latter category. When several related tasks derived from the same domain are available, these methods use the domain-related knowledge coded in the training examples of all the tasks as a source of bias. We extend some of the ideas presented in this field and describe a new approach that identifies a family of hypotheses, represented by a manifold in hypothesis space, that embodies domain-related knowledge. This family is learned using training examples sampled from a group of related tasks. Learning models trained on these tasks are only allowed to select hypotheses that belong to the family. We show that the new approach encompasses a large variety of families which can be learned. A statistical analysis on a class of related tasks is performed that shows significantly improved performances when using this approach.

openalex-author · Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing

Metric-based model selection for time-series forecasting

Metric-based methods, which use unlabeled data to detect gross differences in behavior away from the training points, have recently been introduced for model selection, often yielding very significant improvements over alternatives (including cross-validation). We introduce extensions that take advantage of the particular case of time-series data in which the task involves prediction with a horizon h. The ideas are: (i) to use at t the h unlabeled examples that precede t for model selection, and (ii) take advantage of the different error distributions of cross-validation and the metric methods. Experimental results establish the effectiveness of these extensions in the context of feature subset selection.

openalex-author · Érudit documents and data repository (Érudit Consortium, University of Montreal)

No Unbiased Estimator of the Variance of K-Fold Cross-Validation

L'erreur de prédiction, donc la perte attendue sur des données futures, est la mesure standard pour la qualité des modèles d'apprentissage statistique. Quand la distribution des données est inconnue, cette erreur ne peut être calculée mais plusieurs méthodes de rééchantillonnage, comme la validation croisée, peuvent être utilisées pour obtenir un estimateur non-biaisé de l'erreur de prédiction. Cependant pour comparer des algorithmes d'apprentissage, il faut aussi estimer l'incertitude autour de cet estimateur d'erreur future, car cette incertitude peut être très grande. Cependant, les estimateurs ordinaires de variance d'une moyenne pour des échantillons indépendants ne peuvent être utilisés à cause du recoupement des ensembles d'apprentissage utilisés pour effectuer la validation croisée. Le résultat principal de cet article est qu'il n'existe pas d'estimateur non-biaisé universel (indépendant de la distribution) de la variance de la validation croisée, en se basant sur les mesures d'erreur faites durant la validation croisée. L'analyse fournit une meilleure compréhension de la difficulté d'estimer l'incertitude autour de la validation croisée. Ces résultats se généralisent à d'autres méthodes de rééchantillonnage pour lesquelles des données sont réutilisées pour l'apprentissage ou le test.

openalex-author · Érudit documents and data repository (Érudit Consortium, University of Montreal)

Stochastic Gradient Descent on a Portfolio Management Training Criterion Using the IPA Gradient Estimator

Dans cet article, nous jetons les bases pour l'apprentissage d'une stratégie de gestion d'un portefeuille de biens, de natures variées, et ne s'appuyant sur aucune supposition quant aux distributions des données financières. Ce modèle, basé sur l'utilisation d'un réseau de neurones, tente de capturer les tendances du marché. De plus, le modèle permet l'introduction d'un bruit stochastique au niveau des prix prévus par le réseau afin d'éviter les maxima locaux dans l'espace de décision. Dans ces conditions, nous démontrons que notre stratégie d'investissement suit un processus de décision markovien qui est presque sûrement lipchitzien en ses paramètres. Ainsi, l'estimateur du gradient IPA, obtenu ici par la méthode classique de rétropropagation, peut être utilisé pour approcher, par une descente de gradient, un maximum local de notre critère d'apprentissage, le Sharpe ratio.

openalex-author · International Journal of Pattern Recognition and Artificial Intelligence

SCALING LARGE LEARNING PROBLEMS WITH HARD PARALLEL MIXTURES

A challenge for statistical learning is to deal with large data sets, e.g. in data mining. The training time of ordinary Support Vector Machines is at least quadratic, which raises a serious research challenge if we want to deal with data sets of millions of examples. We propose a "hard parallelizable mixture" methodology which yields significantly reduced training time through modularization and parallelization: the training data is iteratively partitioned by a "gater" model in such a way that it becomes easy to learn an "expert" model separately in each region of the partition. A probabilistic extension and the use of a set of generative models allows representing the gater so that all pieces of the model are locally trained. For SVMs, time complexity appears empirically to local growth linearly with the number of examples, while generalization performance can be enhanced. For the probabilistic version of the algorithm, the iterative algorithm probably goes down in a cost function that is an upper bound on the negative log-likelihood.

openalex-author · Érudit documents and data repository (Érudit Consortium, University of Montreal)

Comment améliorer la capacité de généralisation des algorithmes d'apprentissage pour la prise de décisions financières

Ce rapport présente et propose plusieurs méthodes pour améliorer la capacité de généralisation des algorithmes d'apprentissage dans un contexte de prise de décisions financières. Globalement, ces méthodes visent à contrôler la capacité des algorithmes d'apprentissage en vue de limiter le problème du sur-apprentissage, qui est l'un des plus pernicieux en finance à cause des niveaux de bruit élevés rencontrés en pratique. Nous proposons quelques pistes de recherches afin d'améliorer les algorithmes et résultats déjà obtenus.

openalex-author · http://www.jmlr.org/papers/volume3/bengio03b/bengio03b.ps.gz

Extensions to metric based model selection

Metric-based methods have recently been introduced for model selection and regularization, often yielding very significant improvements over the alternatives tried (including cross-validation). All these methods require unlabeled data over which to compare functions and detect gross differences in behavior away from the training points. We introduce three new extensions of the metric model selection methods and apply them to feature selection. The first extension takes advantage of the particular case of time-series data in which the task involves prediction with a horizon h. The idea is to use at t the h unlabeled examples that precede t for model selection. The second extension takes advantage of the different error distributions of cross-validation and the metric methods: crossvalidation tends to have a larger variance and is unbiased. A hybrid combining the two model selection methods is rarely beaten by any of the two methods. The third extension deals with the case when unlabeled data is not available at all, using an estimated input density. Experiments are described to study these extensions in the context of capacity control and feature subset selection. Keywords: Metric-based Methods, Model Selection 1. Model Selection and Regularization

openalex-author · International Conference on Acoustics, Speech, and Signal Processing

Speech coding with multi-layer networks

Combining a structural or knowledge-based approach for describing speech units with neural networks capable of automatically learning relations between acoustic properties and speech units is investigated. The authors show how speech coding can be performed by sets of multilayer neural networks whose execution is decided by a data-driven strategy. Coding is based on phonetic properties characterizing a large population of speakers. Results on speaker-independent recognition of vowels using an ear model for preprocessing are reported.< <ETX xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">></ETX>

openalex-author · ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing

Use of neural networks for the recognition of place of articulation

The Boltzmann machine algorithm and the error back propagation algorithm were used to learn to recognize the place of articulation of vowels (front, center or back), represented by a static description of spectral lines. The error rate is shown to depend on the coding. Results are comparable or better than those obtained by us on the same data using hidden Markov models. The authors also show a fault tolerant property of the neural nets, i.e. that the error on the test set increases slowly and gradually when an increasing number of nodes fail.< <ETX xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">></ETX>

openalex-author · http://research.microsoft.com/conferences/AIStats2003/proceedings/164.ps

Quick Training of Probabilistic Neural Nets by Importance Sampling

Our previous work on statistical language modeling introduced the use of probabilistic feedforward neural networks to help dealing with the curse of dimensionality. Training this model by maximum likelihood however requires for each example to perform as many network passes as there are words in the vocabulary. Inspired by the contrastive divergence model, we propose and evaluate sampling-based methods which require network passes only for the observed &quot;positive example&quot; and a few sampled negative example words. A very significant speed-up is obtained with an adaptive importance sampling.

openalex-author · http://www.iro.umontreal.ca/~lisa/bib/pub_subject/language/pointeurs/TR1231.pdf

Extracting hidden sense probabilities from bitexts

We propose a probabilistic model that is inspired by Diab &amp; Resnik’s algorithm to extract disambiguation information from aligned bilingual texts. Like Diab &amp; Resnik’s, the proposed model uses WordNet and the fact that word ambiguities are not always the same in the two languages. The generative model introduces a dependency between two translated words through a common ancestor in WordNet’s ontology. Unlike Diab &amp; Resnik’s algorithm it does not suppose that the translation in the source language has a single meaning. 1

openalex-author · IEEE International Conference on Neural Networks

The problem of learning long-term dependencies in recurrent networks

The authors seek to train recurrent neural networks in order to map input sequences to output sequences, for applications in sequence recognition or production. Results are presented showing that learning long-term dependencies in such recurrent networks using gradient descent is a very difficult task. It is shown how this difficulty arises when robustly latching bits of information with certain attractors. The derivatives of the output at time t with respect to the unit activations at time zero tend rapidly to zero as t increases for most input values. In such a situation, simple gradient descent techniques appear inappropriate. The consideration of alternative optimization methods and architectures is suggested.< <ETX xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">></ETX>

openalex-author · Proceedings of the 12th IAPR International Conference on Pattern Recognition (Cat. No.94CH3440-5)

An EM approach to grammatical inference: input/output HMMs

Proposes a modular recurrent connectionist architecture for adaptive temporal processing. The model is given, a probabilistic interpretation and is trained using the estimation-maximisation (EM) algorithm. This model can also be seen as an input/output hidden Markov model. The focus of this paper is on sequence classification tasks. The authors demonstrate that EM supervised learning is well suited for solving grammatical inference problems. Experimental benchmark results are presented for the seven Tomita grammars, showing that these adaptive models can, attain excellent generalization.

openalex-author · Proceedings of the 12th IAPR International Conference on Pattern Recognition (Cat. No.94CH3440-5)

Word normalization for online handwritten word recognition

openalex-author · Proceedings of the First IEEE Conference on Evolutionary Computation. IEEE World Congress on Computational Intelligence

Use of genetic programming for the search of a new learning rule for neural networks

In previous work we explained how to use standard optimization methods such as simulated annealing, gradient descent and genetic algorithms to optimize a parametric function which could be used as a learning rule for neural networks. To use these methods, we had to choose a fixed number of parameters and a rigid form for the learning rule. In this article, we propose to use genetic programming to find not only the values of rule parameters but also the optimal number of parameters and the form of the rule. Experiments on classification tasks suggest genetic programming finds better learning rules than other optimization methods. Furthermore, the best rule found with genetic programming outperformed the well-known backpropagation algorithm for a given set of tasks.< <ETX xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">></ETX>

openalex-author · Proceedings of the 12th IAPR International Conference on Pattern Recognition (Cat. No.94CH3440-5)

Word-level training of a handwritten word recognizer based on convolutional neural networks

We introduce a new approach for online recognition of handwritten words written in unconstrained mixed style. Words are represented by low resolution "annotated images" where each pixel contains information about trajectory direction and curvature. The recognizer is a convolutional network which can be spatially replicated. From the network output, a hidden Markov model produces word scores. The entire system is globally trained to minimize word-level errors.

openalex-author · Proceedings of 1994 28th Asilomar Conference on Signals, Systems and Computers

Pen-based visitor registration system (PENGUIN)

We describe a new electronic pen-based visitors registration system (PENGUIN) whose goal is to expand and modernize the visitor sign-in procedure at Bell Laboratories. The system uses a pen-interface (i.e. tablet-display) in what is essentially a form filling application. Our pen-interface is coupled with a powerful and accurate on-line handwriting recognition module. A database of AT&T employees (the visitors' hosts) and country names is used to check the recognition module outputs, in order to find the best match. The system provides assistance to the guard at one of the guard stations in routing visitors to their hosts. All the entered data are stored electronically. Initial testing shows that PENGUIN system performs reliably and with high accuracy. It retrieves the correct host name with 97% accuracy and the correct visitors citizenship with 99% accuracy. The system is robust and easy to use for both visitors and guards.< <ETX xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">></ETX>

openalex-author · IJCNN-91-Seattle International Joint Conference on Neural Networks

Global optimization of a neural network-hidden Markov model hybrid

An original method for integrating artificial neural networks (ANN) with hidden Markov models (HMM) is proposed. ANNs are suitable for performing phonetic classification, whereas HMMs have been proven successful at modeling the temporal structure of the speech signal. In the approach described, the ANN outputs constitute the sequence of observation vectors for the HMM. An algorithm is proposed for global optimization of all the parameters. Results on speaker-independent recognition experiments using this integrated ANN-HMM system on the TIMIT continuous speech database are reported.< <ETX xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">></ETX>

openalex-author · International Conference on Acoustics, Speech, and Signal Processing

A hybrid coder for hidden Markov models using a recurrent neural networks

A hybrid coder is introduced for obtaining descriptions of speech patterns. This coder uses vector quantization (VQ) techniques on mel-scale cepstral coefficients and their derivatives together with a recurrent network (RN) for describing suprasegmental features of speech. The purpose of these features is to focus the search when hidden Markov models (HMMs) are used for speech unit or word models. Preliminary experiments of speaker-independent connected digit recognition show that using a hybrid coder based on a RN improves recognition performance.< <ETX xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">></ETX>

openalex-author · Proceedings IEEE International Forum on Research and Technology Advances in Digital Libraries -ADL'98-

Browsing through high quality document images with DjVu

Presents a new image compression technique called "DjVu" that is specifically geared towards the compression of high-resolution, high-quality images of scanned documents in color. With DjVu, any screen connected to the Internet can access and display images of scanned pages while faithfully reproducing the font, color, drawings, pictures and paper texture. A typical magazine page in color at 300 dpi can be compressed down to between 40 to 60 KBytes, approximately 5 to 10 times better than JPEG for a similar level of subjective quality. Black-and-white documents are typically 15 to 30 KBytes at 300 dpi, or 4 to 8 times better than CCITT-G4. A real-time, memory-efficient version of the decoder was implemented, and is available as a plug-in for popular Web browsers.

openalex-author · Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225)

A memory-efficient adaptive Huffman coding algorithm for very large sets of symbols

Summary form only given. The problem of computing the minimum redundancy codes as we observe symbols one by one has received a lot of attention. However, existing algorithms implicitly assumes that either we have a small alphabet or that we have an arbitrary amount of memory at our disposal for the creation of a coding tree. In real life applications one may need to encode symbols coming from a much larger alphabet, for e.g. coding integers. We introduce a new algorithm for adaptive Huffman coding, called algorithm M, that uses space proportional to the number of frequency classes. The algorithm uses a tree with leaves that represent sets of symbols with the same frequency, rather than individual symbols. The code for each symbol is therefore composed of a prefix (specifying the set, or the leaf of the tree) and a suffix (specifying the symbol within the set of same-frequency symbols). The algorithm uses only two operations to remain as close as possible to the optimal: set migration and rebalancing. We analyze the computational complexity of algorithm M, and point to its advantages in terms of low memory complexity and fast decoding. Comparative experiments were performed with algorithm M on the Calgary corpus, with static Huffman coding as well as with another adaptive Huffman coding algorithms, algorithm /spl Lambda/ of Vitter. Experiments show that M performs comparably or better than the other algorithms but requires much less memory. Finally, we present an improved algorithm, M/sup +/, for non-stationary data, which models the distribution of the data in a fixed-size window in the data sequence.

openalex-author · Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225)

The Z-coder adaptive binary coder

We present the Z-coder, a new adaptive data compression coder for coding binary data. The Z-coder is derived from the Golomb (1966) run-length coder, and retains most of the speed and simplicity of the earlier coder. The Z-coder can also be thought of as a multiplication-free approximate arithmetic coder, showing the close relationship between run-length coding and arithmetic coding. The Z-coder improves upon existing arithmetic coders by its speed and its principled design. We present a derivation of the Z-coder as well as details of the construction of its adaptive probability estimation table.

openalex-author · Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition

Global training of document processing systems using graph transformer networks

We propose a new machine learning paradigm called Graph Transformer Networks that extends the applicability of gradient-based learning algorithms to systems composed of modules that take graphs as inputs and produce graphs as output. Training is performed by computing gradients of a global objective function with respect to all the parameters in the system using a kind of back-propagation procedure. A complete check reading system based on these concepts is described. The system uses convolutional neural network character recognizers, combined with global training techniques to provide record accuracy on business and personal checks. It is presently deployed commercially and reads million of checks per month.

openalex-author · 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing

Reading checks with multilayer graph transformer networks

We propose a new machine learning paradigm called multilayer graph transformer network that extends the applicability of gradient-based learning algorithms to systems composed of modules that take graphs as input and produce graphs as output. A complete check reading system based on this concept is described. The system combines convolutional neural network character recognizers with graph-based stochastic models trained cooperatively at the document level. It is deployed commercially and reads million of business and personal checks per month with record accuracy.

openalex-author · IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222)

Input decay: simple and effective soft variable selection

To deal with the overfitting problems that occur when there are not enough examples compared to the number of input variables in supervised learning, traditional approaches are weight decay and greedy variable selection. An alternative that has recently started to attract attention is to keep all the variables but to put more emphasis on the "most useful" ones. We introduce a new regularization method called input decay that exerts more relative penalty on the parameters associated with the inputs that contribute less to the learned function. This method, like weight decay and variable selection, still requires to perform a kind of model selection. Successful comparative experiments with this new method were performed both on a simulated regression task and a real-world financial prediction task.

openalex-author · Machine Learning

Kernel Matching Pursuit

No abstract available from the OpenAlex source record.

openalex-author · Machine Learning

Model Selection for Small Sample Regression

No abstract available from the OpenAlex source record.

openalex-author · Machine Learning

Guest Introduction: Special Issue on New Methods for Model Selection and Model Combination

No abstract available from the OpenAlex source record.

openalex-author · Paper

Segmentation en thèmes de conversations téléphoniques : traitement en amont pour l’extraction d’information

Nous presentons une approche de decoupage thematique que nous utiliserons pour faciliter l’extraction d’information a partir de conversations telephoniques transcrites. Nous experimentons avec un modele de Markov cache utilisant des informations de differents niveaux linguistiques, des marques d’extra-grammaticalites et les entites nommees comme source additionnelle d’information. Nous comparons le modele obtenu avec notre modele de base utilisant uniquement les marques linguistiques et les extra-grammaticalites. Les resultats montrent l’efficacite de l’approche utilisant les entites nommees.

openalex-author · RePEc: Research Papers in Economics

Multi-Task Learning For Option Pricing

Multi-task learning is a process used to learn domain-specific bias. It consists in simultaneously training models on different tasks derived from the same domain and forcing them to exchange domain information. This transfer of knowledge is performed by imposing constraints on the parameters defining the models and can lead to improved generalization performance. In this paper, we explore a particular multi-task learning method that forces the parameters of the models to lie on an affine manifold defined in parameter space and embedding domain information. We apply this method to the prediction of the prices of call options on the S&P index for a period of time ranging from 1987 to 1993. An analysis of variance of the results is presented that shows significant improvements of the generalization performance. L'apprentissage multi-tâches est une maniere d'apprendre des particularites d'un domaine (le biais) qui comprend plusieurs tâches possibles. On entraine simultanement plusieurs modeles, un par tâche, en imposant des contraintes sur les parametres de maniere a capturer ce qui est en commun entre les tâches, afin d'obtenir une meilleure generalisation sur chaque tâche, et pour pouvoir rapidement generaliser (avec peu d'exemples) sur une nouvelle tâche provenant du meme domaine. Ici cette commonalite est definie par une variete affine dans l'espace des parametres. Dans cet article, nous appliquons ces methodes a la prediction du prix d'options d'achat de l'indice S&P 500 entre 1987 et 1993. Une analyse de la variance des resultats est presentee, demontrant des ameliorations significatives de la prediction hors-echantillon.

openalex-author · RePEc: Research Papers in Economics

Valorisation d'Options par Optimisation du Sharpe Ratio

Prior work on option pricing falls mostly in two categories: it either relies on strong distributional or economical assumptions, or it tries to mimic the Black-Scholes formula through statistical models, trained to fit today's market price based on information available today. The work presented here is closer to the second category but its objective is different: predict the future value of the option, and establish its current value based on a trading scenario. This work thus innovates in two ways: first it proposes an empirical and hypothesis-free method to compare different option pricing systems (by having trade against each other or against the market), second it uses this criterion to train a non-parametric statistical model (here based on neural networks) to estimate a price for the option that maximizes the expected utility when trading against the market. Note that the price will depend on the utility function and current portfolio (i.e. current risks) of the trading agent. Preliminary experiments are presented on the S&P 500 options. Les travaux precedents sur la valorisation des options entraient en gros dans deux categories : ou bien ils etaient bases sur de fortes hypotheses distributionnelles ou economiques, ou bien ils essayaient d'imiter la formule de Black-Scholes par des modeles statistiques entraines a approximer les prix de marche quotidiens a l'aide d'information disponible le jour meme. Le travail presente ici se rapproche plus de la deuxieme categorie mais son objectif est different : predire les prix futurs d'une option, et etablir sa valeur courante a l'aide d'un scenario de transactions. Ce travail innove donc de deux facons : premierement, il propose une methode empirique et sans hypothese pour comparer differents systemes de valorisation d'options (en transigeant contre lui-meme ou contre le marche) et deuxiemement, il utilise ce critere pour entrainer un modele statistique non-parametrique (utilisant dans ce cas-ci des reseaux de neurones) pour estimer un prix pour l'option qui maximise l'utilite esperee lorsque l'on transige contre le marche. A noter que les prix dependront de la fonction d'utilite ainsi que du portefeuille (i.e. des risques courants) de la personne qui transige. Des resultats preliminaires sur des options d'achat du S&P 500 sont presentes.

openalex-author · RePEc: Research Papers in Economics

Étude du biais dans le prix des options

Abstract Le prix d'une option devrait refleter la valeur moyenne que l'acheteur en recoitainsi qu'une prime de risque. Ce rapport decrit une etude empirique pour analyser cesfacteurs de maniere graphique et quantitative. L'analyse se concentre sur la differencemoyenne entre le prix de l'option et sa valeur actualisee moyenne a maturite (lebiais), et tente de cerner des regularites temporelles dans les patrons de cettedifference. On y decouvre de surprenants patronsquasi-periodiques de ces variations, en particulier pour les calls de maturite elevee (moins clairement pour les puts ), qui sont etudies avec une analyse spectrale. The price of an option should reflect the average value that a buyer receives forit, and also a risk premium. This report describes an empirical study for analysingthese factors as a graphical and quantitative manner. The analysis focuses on theaverage difference between the price option and its present average value at maturity(the bias), and tries to detect some temporal regularities in the pattern of this bias.We found some very surprising almost-periodic patterns for the bias, in particular forthe long-time maturities (not so clearly for the puts), as studied by spectral analysis.

openalex-author · RePEc: Research Papers in Economics

On Out-of-Sample Statistics for Time-Series

This paper studies an out-of-sample statistic for time-series prediction that is analogous to the widely used R2 in-sample statistic. We propose and study methods to estimate the variance of this out-of-sample statistic. We suggest that the out-of-sample statistic is more robust to distributional and asymptotic assumptions behind many tests for in-sample statistics. Furthermore we argue that it may be more important in some cases to choose a model that generalizes as well as possible rather than choose the parameters that are closest to the true parameters. Comparative experiments are performed on a financial time-series (daily and monthly returns of the TSE300 index). The experiments are performed for varying prediction horizons and we study the relation between predictibility (out-of-sample R2), variability of the out-of-sample R2 statistic, and the prediction horizon. Cet article etudie une statistique hors-echantillon pour la prediction de series temporelles qui est analogue a la tres utilisee statistique R2 de l'ensemble d'entrainement (in-sample). Nous proposons et etudions une methode qui estime la variance de cette statistique hors-echantillon. Nous suggerons que la statistique hors-echantillon est plus robuste aux hypotheses distributionnelles et asymptotiques pour plusieurs tests faits pour les statistiques sur l'ensemble d'entrainement (in-sample). De plus, nous affirmons qu'il peut etre plus important, dans certains cas, de choisir un modele qui generalise le mieux possible plutot que de choisir les parametres qui sont le plus proches des vrais parametres. Des experiences comparatives furent realisees sur des series financieres (rendements journaliers et mensuels de l'indice du TSE300). Les experiences realisees pour plusieurs horizons de predictions, et nous etudions la relation entre la predictibilite (hors-echantillon), la variabilite de la statistique R2 hors-echantillon, et l'horizon de prediction.

openalex-author · Neural Computation

A Parallel Mixture of SVMs for Very Large Scale Problems

Support vector machines (SVMs) are the state-of-the-art models for many classification problems, but they suffer from the complexity of their training algorithm, which is at least quadratic with respect to the number of examples. Hence, it is hopeless to try to solve real-life problems having more than a few hundred thousand examples with SVMs. This article proposes a new mixture of SVMs that can be easily implemented in parallel and where each SVM is trained on a small subset of the whole data set. Experiments on a large benchmark data set (Forest) yielded significant time improvement (time complexity appears empirically to locally grow linearly with the number of examples). In addition, and surprisingly, a significant improvement in generalization was observed.

openalex-author · Érudit documents and data repository (Érudit Consortium, University of Montreal)

Forecasting Non-Stationary Volatility with Hyper- Parameters

Nous considérons des données séquentielles échantillonnées à partir d'un processus inconnu, donc les données ne sont pas nécessairement iid. Nous développons une mesure de généralisation pour de telles données et nous considérons une approche récemment proposée pour optimiser les hyper-paramètres qui est basée sur le calcul du gradient d'un critère de sélection de modèle par rapport à ces hyper-paramètres. Les hyper-paramètres sont utilisés pour donner différents poids dans la séquence de données historiques. Notre approche est appliquée avec succès à la modélisation de la volatilité des rendements d'actions canadiennes sur un horizon de un mois.

openalex-author · http://www.iro.umontreal.ca/~vincentp/Publications/mparzen_nips2002.ps.gz

Manifold Parzen Windows

The similarity between objects is a fundamental element of many learn-ing algorithms. Most non-parametric methods take this similarity to be fixed, but much recent work has shown the advantages of learning it, in particular to exploit the local invariances in the data or to capture the possibly non-linear manifold on which most of the data lies. We propose a new non-parametric kernel density estimation method which captures the local structure of an underlying manifold through the leading eigen-vectors of regularized local covariance matrices. Experiments in density estimation show significant improvements with respect to Parzen density estimators. The density estimators can also be used within Bayes classi-fiers, yielding classification rates similar to SVMs and much superior to the Parzen classifier. 1

openalex-author · IEEE Transactions on Neural Networks

Cost functions and model combination for VaR-based asset allocation using neural networks

We introduce an asset-allocation framework based on the active control of the value-at-risk of the portfolio. Within this framework, we compare two paradigms for making the allocation using neural networks. The first one uses the network to make a forecast of asset behavior, in conjunction with a traditional mean-variance allocator for constructing the portfolio. The second paradigm uses the network to directly make the portfolio allocation decisions. We consider a method for performing soft input variable selection, and show its considerable utility. We use model combination (committee) methods to systematize the choice of hyperparameters during training. We show that committees using both paradigms are significantly outperforming the benchmark market performance.

openalex-author · http://www-2.cs.cmu.edu/Groups/NIPS/NIPS2001/papers/psgz/AA61.ps.gz

K-Local Hyperplane and Convex Distance Nearest Neighbor Algorithms

Guided by an initial idea of building a complex (non linear) decision surface with maximal local margin in input space, we give a possible geometrical intuition as to why K-Nearest Neighbor (KNN) algorithms often perform more poorly than SVMs on classification tasks. We then propose modified K-Nearest Neighbor algorithms to overcome the perceived problem. The approach is similar in spirit to Tangent Distance,but with invariances inferred from the local neighborhood rather than prior knowledge. Experimental results on real world classification tasks suggest that the modified KNN algorithms often give a dramatic improvement over standard KNN and perform as well or better than SVMs. 1

openalex-author · http://www-2.cs.cmu.edu/Groups/NIPS/NIPS2001/papers/psgz/AP15.ps.gz

Estimating Car Insurance Premia: a Case Study in High-Dimensional Data Inference

Estimating insurance premia from data is a di#cult regression problem for several reasons: the large number of variables, many of which are discrete, and the very peculiar shape of the noise distribution, asymmetric with fat tails, with a large majority zeros and a few unreliable and very large values. We introduce a methodology for estimating insurance premia that has been applied in the car insurance industry. It is based on mixtures of specialized neural networks, in order to reduce the e#ect of outliers on the estimation. Statistical comparisons with several di#erent alternatives, including decision trees and generalized linear models show that the proposed method is significantly more precise, allowing to identify the least and most risky contracts, and reducing the median premium by charging more to the most risky customers. 1

openalex-author · ftp://ftp.idsia.ch/pub/juergen/ch7.ps.gz

Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies

Introduction Recurrent networks (crossreference Chapter 12) can, in principle, use their feedback connections to store representations of recent input events in the form of activations. The most widely used algorithms for learning what to put in short-term memory, however, take too much time to be feasible or do not work well at all, especially when minimal time lags between inputs and corresponding teacher signals are long. Although theoretically fascinating, they do not provide clear practical advantages over, say, backprop in feedforward networks with limited time windows (see crossreference Chapters 11 and 12). With conventional &quot;algorithms based on the computation of the complete gradient&quot;, such as &quot;Back-Propagation Through Time&quot; (BPTT, e.g., [22, 27, 26]) or &quot;Real-Time Recurrent Learning&quot; (RTRL, e.g., [21]) error signals &quot;flowing backwards in time&quot; tend to either (1) blow up or (2) vanish: the temporal evolution of the backpropagated error exponentially depends on the size of th

openalex-author · IEEE Transactions on Neural Networks

Experiments on the application of IOHMMs to model financial returns series

Input-output hidden Markov models (IOHMM) are conditional hidden Markov models in which the emission (and possibly the transition) probabilities can be conditioned on an input sequence. For example, these conditional distributions can be linear, logistic, or nonlinear (using for example multilayer neural networks). We compare the generalization performance of several models which are special cases of input-output hidden Markov models on financial time-series prediction tasks: an unconditional Gaussian, a conditional linear Gaussian, a mixture of Gaussians, a mixture of conditional linear Gaussians, a hidden Markov model, and various IOHMMs. The experiments compare these models on predicting the conditional density of returns of market and sector indices. Note that the unconditional Gaussian estimates the first moment with the historical average. The results show that, although for the first moment the historical average gives the best results, for the higher moments, the IOHMMs yielded significantly better performance, as estimated by the out-of-sample likelihood.

openalex-author · Neural Computation

Gradient-Based Optimization of Hyperparameters

Many machine learning algorithms can be formulated as the minimization of a training criterion that involves a hyperparameter. This hyperparameter is usually chosen by trial and error with a model selection criterion. In this article we present a methodology to optimize several hyperparameters, based on the computation of the gradient of a model selection criterion with respect to the hyperparameters. In the case of a quadratic training criterion, the gradient of the selection criterion with respect to the hyperparameters is efficiently computed by backpropagating through a Cholesky decomposition. In the more general case, we show that the implicit function theorem can be used to derive a formula for the hyperparameter gradient involving second derivatives of the training criterion.

openalex-author · Neural Computation

Boosting Neural Networks

Boosting is a general method for improving the performance of learning algorithms. A recently proposed boosting algorithm, AdaBoost, has been applied with great success to several benchmark machine learning problems using mainly decision trees as base classifiers. In this article we investigate whether AdaBoost also works as well with neural networks, and we discuss the advantages and drawbacks of different versions of the AdaBoost algorithm. In particular, we compare training methods based on sampling the training set and weighting the cost function. The results suggest that random resampling of the training data is not the main explanation of the success of the improvements brought by AdaBoost. This is in contrast to bagging, which directly aims at reducing variance and for which random resampling is essential to obtain the reduction in generalization error. Our system achieves about 1.4% error on a data set of on-line handwritten digits from more than 200 writers. A boosted multilayer network achieved 1.5% error on the UCI letters and 8.1% error on the UCI satellite data set, which is significantly better than boosted decision trees.

openalex-author · IEEE Transactions on Neural Networks

Taking on the curse of dimensionality in joint distributions using neural networks

The curse of dimensionality is severe when modeling high-dimensional discrete data: the number of possible combinations of the variables explodes exponentially. We propose an architecture for modeling high-dimensional data that requires resources (parameters and computations) that grow at most as the square of the number of variables, using a multilayer neural network to represent the joint distribution of the variables as the product of conditional distributions. The neural network can be interpreted as a graphical model without hidden random variables, but in which the conditional distributions are tied through the hidden units. The connectivity of the neural network can be pruned by using dependency tests between the variables (thus reducing significantly the number of parameters). Experiments on modeling the distribution of several discrete data sets show statistically significant improvements over other methods such as naive Bayes and comparable Bayesian networks and show that significant improvements can be obtained by pruning the network.

openalex-author · Neural Information Processing Systems

Incorporating Second-Order Functional Knowledge for Better Option Pricing

Incorporating prior knowledge of a particular task into the architecture of a learning algorithm can greatly improve generalization performance. We study here a case where we know that the function to be learned is non-decreasing in two of its arguments and convex in one of them. For this purpose we propose a class of functions similar to multi-layer neural networks but (1) that has those properties, (2) is a universal approximator of continuous functions with these and other properties. We apply this new class of functions to the task of modeling the price of call options. Experiments show improvements on regressing the price of call options using the new types of function classes that incorporate the a priori constraints.

openalex-author · Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New M

Probabilistic neural network models for sequential data

Artificial neural networks (ANN) can be incorporated into probabilistic models. In this paper we review some of the approaches which have been proposed to incorporate them into probabilistic models of sequential data, such as hidden Markov models (HMM). We also discuss new developments and new ideas in this area, in particular how ANN can be used to model high-dimensional discrete and continuous data to deal with the curse of dimensionality and how the ideas proposed in these models could be applied to statistical language modeling to represent longer-term context than allowed by trigram models, while keeping word-order information.

openalex-author · Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New M

Continuous optimization of hyper-parameters

Many machine learning algorithms can be formulated as the minimization of a training criterion which involves a hyper-parameter. This hyper-parameter is usually chosen by trial and error with a model selection criterion. In this paper we present a methodology to optimize several hyper-parameters, based on the computation of the gradient of a model selection criterion with respect to the hyper-parameters. In the case of a quadratic training criterion, the gradient of the selection criterion with respect to the hyper-parameters is efficiently computed by back-propagating through a Cholesky decomposition. In the more general case, we show that the implicit function theorem can be used to derive a formula for the hyper-parameter gradient involving second derivatives of the training criterion.

openalex-author · Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New M

A Neural Support Vector Network architecture with adaptive kernels

In the Support Vector Machines (SVM) framework, the positive-definite kernel can be seen as representing a fixed similarity measure between two patterns, and a discriminant function is obtained by taking a linear combination of the kernels computed at training examples called support vectors. We investigate learning architectures in which the kernel functions can be replaced by more general similarity measures that can have arbitrary internal parameters. The training criterion used in SVMs is not appropriate for this purpose so we adopt the simple criterion that is generally used when training neural networks for classification tasks. Several experiments are performed which show that such Neural Support Vector Networks perform similarly to SVMs while requiring significantly fewer support vectors, even when the similarity measure has no internal parameters.

openalex-author · http://www.cs.colorado.edu/~mozer/courses/6622/papers/BengioDucharmeVincentJauvin2003.pdf

A Neural Probabilistic Language Model

A goal of statistical language modeling is to learn the joint probability function of sequences of words. This is intrinsically difficult because of the curse of dimensionality: we propose to fight it with its own weapons. In the proposed approach one learns simultaneously (1) a distributed representation for each word (i.e. a similarity between words) along with (2) the probability function for word sequences, expressed with these representations. Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is made of words that are similar to words forming an already seen sentence. We report on experiments using neural networks for the probability function, showing on two text corpora that the proposed approach very significantly improves on a state-of-the-art trigram model. 1

openalex-author · http://www.iro.umontreal.ca/~bengio/pointeurs/bb_2000_nips.ps

Modeling High-Dimensional Discrete Data with Multi-Layer Neural Networks

The curse of dimensionality is severe when modeling high-dimensional discrete data: the number of possible combinations of the variables ex-plodes exponentially. In this paper we propose a new architecture for modeling high-dimensional data that requires resources (parameters and computations) that grow only at most as the square of the number of vari-ables, using a multi-layer neural network to represent the joint distribu-tion of the variables as the product of conditional distributions. The neu-ral network can be interpreted as a graphical model without hidden ran-dom variables, but in which the conditional distributions are tied through the hidden units. The connectivity of the neural network can be pruned by using dependency tests between the variables. Experiments on modeling the distribution of several discrete data sets show statistically significant improvements over other methods such as naive Bayes and comparable Bayesian networks, and show that significant improvements can be ob-tained by pruning the network. 1

openalex-author · Érudit documents and data repository (Érudit Consortium, University of Montreal)

Inference for the Generalization Error

Nous considérons l'estimation par validation croisée de l'erreur de généralisation. Nous effectuons une étude théorique de la variance de ect estimateur en tenant compte de la variabilité due au choix des ensembles d'entraînement et des exemples de test. Cela nous permet de proposer deux nouveaux estimateurs de cette variance. Nous montrons, via des simulations, que ces nouvelles statistiques performent bien par rapport aux statistiques considérées dans Dietterich (1998). En particulier, ces nouvelles statistiques se démarquent des autres présentement utilisées par le fait qu'elles mènent à des tests d'hypothèses qui sont puissants sans avoir tendance à être trop libéraux.

openalex-author · Neural Computation

Stochastic Learning of Strategic Equilibria for Auctions

This article presents a new application of stochastic adaptive learning algorithms to the computation of strategic equilibria in auctions. The proposed approach addresses the problems of tracking a moving target and balancing exploration (of action space) versus exploitation (of better modeled regions of action space). Neural networks are used to represent a stochastic decision model for each bidder. Experiments confirm the correctness and usefulness of the approach.

openalex-author · Lecture Notes in Computer Science

Object Recognition with Gradient-Based Learning

No abstract available from the OpenAlex source record.

openalex-author · Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096)

Binary pseudowavelets and applications to bilevel image processing

This paper shows the existence of binary pseudowavelets, bases on the binary domain that exhibit some of the properties of wavelets, such as multiresolution reconstruction and compact support. The binary pseudowavelets are defined on B/sup n/ (binary vectors of length n) and are operated upon with the binary operators logical and, and exclusive or. The forward transform, or analysis, is the decomposition of a binary vector into its constituent binary pseudowavelets. Binary pseudowavelets allow multiresolution, progressive reconstruction of binary vectors by using progressively more coefficients in the inverse transform. Binary pseudowavelets bases, being sparse matrices, also provide for fast transforms; moreover pseudowavelets rely on hardware-friendly operations for efficient software and hardware implementation.

openalex-author · Paper

Pattern recognition

No abstract available from the OpenAlex source record.

openalex-author · The handbook of brain theory and neural networks, 1998, 978-0-262-51102-5. ⟨10.5555/303568.303704⟩

Convolutional networks for images, speech, and time series

International audience

openalex-author · SPIE Proceedings

<title>Support vector machines for improving the classification of brain PET images</title>

The classification of brain PET volumes is carried out in three main steps: (1) registration, (2) feature extraction and (3) classification. The PET images were already smoothed with a 16 mm isotropic Gaussian kernel and registered within the Talairach and Tournoux reference system. To make the registration more accurate over a single reference, a method based on optical flow was applied. Feature extraction is carried out by principal component analysis (PCA). Support vector machines (SVM) are then used for classification, because they are better controlled than neural networks (NN) and well adapted to small sample size problems. SVM are constructed by a training algorithm that maximizes the margin between the training vectors and the decision boundary. The algorithm is simple quadratic programming under linear constraints, which leads to global optimum. The decision boundary is expressed as a linear combination of supporting vectors which are a subset of the training vectors closest to the decision boundary. After registration, NN and SVM were trained with the features extracted by PCA from the training set. The estimate error rate is 7.1% for SVM and 14.3% for NN.

openalex-author · Proceedings of the IEEE

Gradient-based learning applied to document recognition

Multilayer neural networks trained with the back-propagation algorithm constitute the best example of a successful gradient based learning technique. Given an appropriate network architecture, gradient-based learning algorithms can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters, with minimal preprocessing. This paper reviews various methods applied to handwritten character recognition and compares them on a standard handwritten digit recognition task. Convolutional neural networks, which are specifically designed to deal with the variability of 2D shapes, are shown to outperform all other techniques. Real-life document recognition systems are composed of multiple modules including field extraction, segmentation recognition, and language modeling. A new learning paradigm, called graph transformer networks (GTN), allows such multimodule systems to be trained globally using gradient-based methods so as to minimize an overall performance measure. Two systems for online handwriting recognition are described. Experiments demonstrate the advantage of global training, and the flexibility of graph transformer networks. A graph transformer network for reading a bank cheque is also described. It uses convolutional neural network character recognizers combined with global training techniques to provide record accuracy on business and personal cheques. It is deployed commercially and reads several million cheques per day.

openalex-author · Paper

Neural Networks for Speech Recognition

No abstract available from the OpenAlex source record.

openalex-author · http://www.iro.umontreal.ca/~bengio/pointeurs/bbis_1998_nips.ps

Shared Context Probabilistic Transducers

Recently, a model for supervised learning of probabilistic transducers represented by suffix trees was introduced. However, this algorithm tends to build very large trees, requiring very large amounts of computer memory. In this paper, we propose a new, more compact, transducer model in which one shares the parameters of distributions associated to contexts yielding similar conditional output distributions. We illustrate the advantages of the proposed algorithm with comparative experiments on inducing a noun phrase recognizer.

openalex-author · Neural Information Processing Systems

Training Methods for Adaptive Boosting of Neural Networks

Boosting is a general method for improving the performance of any learning algorithm that consistently generates classifiers which need to perform only slightly better than random guessing. A recently proposed and very promising boosting algorithm is AdaBoost [5]. It has been applied with great success to several benchmark machine learning problems using rather simple learning algorithms [4], and decision trees [1, 2, 6]. In this paper we use AdaBoost to improve the performances of neural networks. We compare training methods based on sampling the training set and weighting the cost function. Our system achieves about 1.4% error on a data base of online handwritten digits from more than 200 writers. Adaptive boosting of a multi-layer network achieved 1.5% error on the UCI Letters and 8.1 % error on the UCI satellite data set.

openalex-author · 5th European Conference on Speech Communication and Technology (Eurospeech 1997)

Discriminative feature and model design for automatic speech recognition

A system for discriminative feature and model design is presented for automatic speech recognition. Training based on minimum classification error with a single objective function is applied for designing a set of parallel networks performing feature transformation and a set of hidden Markov models performing speech recognition. This paper compares the use of linear and non-linear functional transformations when applied to conventional recognition features, such as spectrum or cepstrum. It also provides a framework for integrated feature and model training when using class-specific transformations. Experimental results on telephone-based connected digit recognition are presented. 1. INTRODUCTION Improving the performance of hidden Markov model (HMM) based automatic speech recognition (ASR) systems has been a central issue that has dominated the entire field of speech recognition during the past two decades. One effort to improving HMMs has been by extending the training paradigm beyo...

openalex-author · International Journal of Neural Systems

Using a Financial Training Criterion Rather than a Prediction Criterion

The application of this work is to decision making with financial time series, using learning algorithms. The traditional approach is to train a model using a prediction criterion, such as minimizing the squared error between predictions and actual values of a dependent variable, or maximizing the likelihood of a conditional model of the dependent variable. We find here with noisy time series that better results can be obtained when the model is directly trained in order to maximize the financial criterion of interest, here gains and losses (including those due to transactions) incurred during trading. Experiments were performed on portfolio selection with 35 Canadian stocks.

openalex-author · Lecture Notes in Computer Science

AdaBoosting neural networks: Application to on-line character recognition

No abstract available from the OpenAlex source record.

openalex-author · Paper

A Memory-Efficient Adaptive Huffman Coding Algorithm for Very Large Sets of Symbols Revisited

Abstract While algorithm M (presented in A Memory-Efficient Huffman Adaptive Coding Algorithm for Very Large Sets of Symbols, by Steven Pigeon & Yoshua Bengio, Universite de Montreal technical report #1081 [1]) converges to the entropy of the signal, it also assumes that the characteristics of the signal are stationary, that is, that they do not change over time and that successive adjustments, ever decreasing in their magnitude, will lead to a reasonable approximation of the entropy. While this is true for some data, it is clearly not true for some other. We present here a modification of the M algorithm that allows negative updates. Negative updates are used to maintain a window over the source. Symbols enter the window at its right and will leave it at its left, after w steps (the window width). The algorithm presented here allows us to update correctly the weights of the symbols in the symbol tree. Here, we will also have negative migration or demotion , while we only had positive migration or

openalex-author · http://www.iro.umontreal.ca/~lisa/pointeurs/multitask-nips97.pdf

Multi-Task Learning for Stock Selection

Artificial Neural Networks can be used to predict future returns of stocks in order to take financial decisions. Should one build a separate network for each stock or share the same network for all the stocks? In this paper we also explore other alternatives, in which some layers are shared and others are not shared. When the prediction of future returns for different stocks are viewed as different tasks, sharing some parameters across stocks is a form of multi-task learning. In a series of experiments with Canadian stocks, we obtain yearly returns that are more than 14 % above various benchmarks.

openalex-author · IEEE Transactions on Neural Networks

Input-output HMMs for sequence processing

We consider problems of sequence processing and propose a solution based on a discrete-state model in order to represent past context. We introduce a recurrent connectionist architecture having a modular structure that associates a subnetwork to each state. The model has a statistical interpretation we call input-output hidden Markov model (IOHMM). It can be trained by the estimation-maximization (EM) or generalized EM (GEM) algorithms, considering state trajectories as missing data, which decouples temporal credit assignment and actual parameter estimation. The model presents similarities to hidden Markov models (HMMs), but allows us to map input sequences to output sequences, using the same processing style as recurrent neural networks. IOHMMs are trained using a more discriminant learning paradigm than HMMs, while potentially taking advantage of the EM algorithm. We demonstrate that IOHMMs are well suited for solving grammatical inference problems on a benchmark problem. Experimental results are presented for the seven Tomita grammars, showing that these adaptive models can attain excellent generalization.

openalex-author · Paper

Neural networks for speech and sequence recognition

Connectionist models Learning theory The back-propagation algorithm Introduction to back-propagation Formal description Heuristics to improve convergence and generalization Extensions Integrating domain knowledge and learning from examples Automatic speech recognition Importance of pre-processing input data Input coding. Input invariances Importance of architecture constraints on the network Modularization Output coding Sequence analysis Introduction Time delay neural networks Recurrent networks BPS Supervision of a recurrent network does not need to be everywhere Problems with training of recurrent networks Dynamic programming post-processors Hidden Markov models Integrating ANNs with other systems Advantages and disadvantages of current algorithms for ANNs Modularization and joint optimization Radial basis functions and local representation Radial basis funtions networks Neurobiological plausibility Relation to vector quantization, clustering and semi-continuous HMMs Methodology Experiments on phoneme recognition with RBFs Density estimation with a neural network Relation between input PDF and output PDF Density estimation Conclusion Post-processors based on dynamic programming ANN/DP hybrids ANN/HMM Hybrids ANN/HMM Hybrid: Phoneme recognition experiments ANN/HMM hybrid: online handwriting recognition experiments.

openalex-author · http://www.iro.umontreal.ca/labs/neuro/pointeurs/hrnn-nips8.ps

Hierarchical Recurrent Neural Networks for Long-Term Dependencies

We have already shown that extracting long-term dependencies from sequential data is difficult, both for deterministic dynamical systems such as recurrent networks, and probabilistic models such as hidden Markov models (HMMs) or input/output hidden Markov models (IOHMMs). In practice, to avoid this problem, researchers have used domain specific a-priori knowledge to give meaning to the hidden or state variables representing past context. In this paper, we propose to use a more general type of a-priori knowledge, namely that the temporal dependencies are structured hierarchically. This implies that long-term dependencies are represented by variables with a long time scale. This principle is applied to a recurrent network which includes delays and multiple time scales. Experiments confirm the advantages of such structures. A similar approach is proposed for HMMs and IOHMMs. 1 Introduction Learning from examples basically amounts to identifying the relations between random v...

openalex-author · http://www.dcs.shef.ac.uk/~ljupco/papers/miss-nips8.ps.gz

Recurrent Neural Networks for Missing or Asynchronous Data

In this paper we propose recurrent neural networks with feedback into the input units for handling two types of data analysis problems. On the one hand, this scheme can be used for static data when some of the input variables are missing. On the other hand, it can also be used for sequential data, when some of the input variables are missing or are available at different frequencies. Unlike in the case of probabilistic models (e.g. Gaussian) of the missing variables, the network does not attempt to model the distribution of the missing variables given the observed variables. Instead it is a more &quot;discriminant&quot; approach that fills in the missing variables for the sole purpose of minimizing a learning criterion (e.g., to minimize an output error). 1 Introduction Learning from examples implies discovering certain relations between variables of interest. The most general form of learning requires to essentially capture the joint distribution between these variables. However, for many spe...

openalex-author · Neural Computation

LeRec: A NN/HMM Hybrid for On-Line Handwriting Recognition

openalex-author · Journal of Artificial Intelligence Research

Diffusion of Context and Credit Information in Markovian Models

This paper studies the problem of ergodicity of transition probability matrices in Markovian models, such as hidden Markov models (HMMs), and how it makes very difficult the task of learning to represent long-term context for sequential data. This phenomenon hurts the forward propagation of long-term context information, as well as learning a hidden state representation to represent long-term context, which depends on propagating credit information backwards in time. Using results from Markov chain theory, we show that this problem of diffusion of context and credit is reduced when the transition probabilities approach 0 or 1, i.e., the transition probability matrices are sparse and the model essentially deterministic. The results found in this paper apply to learning approaches based on continuous optimization, such as gradient descent and the Baum-Welch algorithm.

openalex-author · Neural Processing Letters

On the search for new learning rules for ANNs

No abstract available from the OpenAlex source record.

openalex-author · Paper

Pattern Recognition and Neural Networks

No abstract available from the OpenAlex source record.

openalex-author · IEEE Transactions on Neural Networks

Learning long-term dependencies with gradient descent is difficult

Recurrent neural networks can be used to map input sequences to output sequences, such as for recognition, production or prediction problems. However, practical difficulties have been reported in training recurrent neural networks to perform tasks in which the temporal contingencies present in the input/output sequences span long intervals. We show why gradient based learning algorithms face an increasingly difficult problem as the duration of the dependencies to be captured increases. These results expose a trade-off between efficient learning by gradient descent and latching on information for long periods. Based on an understanding of this problem, alternatives to standard gradient descent are considered.

openalex-author · http://www.iro.umontreal.ca/~lisa/pointeurs/kmeans-nips7.pdf

Convergence Properties of the K-Means Algorithms

This paper studies the convergence properties of the well known K-Means clustering algorithm. The K-Means algorithm can be described either as a gradient descent algorithm or by slightly extending the mathematics of the EM algorithm to this hard threshold case. We show that the K-Means algorithm actually minimizes the quantization error using the very fast Newton algorithm. 1 INTRODUCTION K-Means is a popular clustering algorithm used in many applications, including the initialization of more computationally expensive algorithms (Gaussian mixtures, Radial Basis Functions, Learning Vector Quantization and some Hidden Markov Models). The practice of this initialization procedure often gives the frustrating feeling that K-Means performs most of the task in a small fraction of the overall time. This motivated us to better understand this convergence speed. A second reason lies in the traditional debate between hard threshold (e.g. KMeans, Viterbi Training) and soft threshold (e.g. Gaussia...

openalex-author · http://www.iro.umontreal.ca/labs/neuro/pointeurs/iohmm-nips7.ps

An Input Output HMM Architecture

We introduce a recurrent architecture having a modular structure and we formulate a training procedure based on the EM algorithm. The resulting model has similarities to hidden Markov models, but supports recurrent networks processing style and allows to exploit the supervised learning paradigm while using maximum likelihood estimation. 1 INTRODUCTION Learning problems involving sequentially structured data cannot be effectively dealt with static models such as feedforward networks. Recurrent networks allow to model complex dynamical systems and can store and retrieve contextual information in a flexible way. Up until the present time, research efforts of supervised learning for recurrent networks have almost exclusively focused on error minimization by gradient descent methods. Although effective for learning short term memories, practical difficulties have been reported in training recurrent neural networks to perform tasks in which the temporal contingencies present in the input/ou...

openalex-author · http://www-dsi.ing.unifi.it/~paolo/ps/diffuse-nips7.ps

Diffusion of Credit in Markovian Models

This paper studies the problem of diffusion in Markovian models (such as hidden Markov models) and how it makes very difficult the task of learning of long-term dependencies in sequences. 1 Introduction This paper is part of our research on the problem of learning long-term dependencies in sequences. In our previous work (Bengio, Simard &amp; Frasconi, 1994) we found theoretical reasons for the difficulty in training recurrent networks (or more generally parametric dynamical systems) to learn long-term dependencies. The main result stated that either long-term storing or gradient propagation would be harmed, depending on whether the norm of the Jacobian of the state to state function was greater or less than 1. In this paper we consider a special case in which the norm of the Jacobian of the state to state function is constrained to be exactly 1 because this matrix is a stochastic matrix. This paper thus deals with learning long-term dependencies in systems that have this property, i.e. M...

openalex-author · http://www.research.att.com/~yann/exdb/publis/./psgz/bengio-lecun-94.ps.gz

Word normalization for on-line handwritten word recognition

We introduce a new approach to normalizing words written with an electronic stylus that applies to all styles of handwriting (upper case, lower case, printed, cursive, or mixed). A geometrical model of the word spatial structure is fitted to the pen trajectory using the EM algorithm. The fitting process maximizes the likelihood of the trajectory given the model and a set a priors on its parameters. The method was evaluated and integrated to a recognition system that combines neural networks and hidden Markov models. 1 Introduction Natural handwriting can be a mixture of different &quot;styles&quot;, lower case printed, upper case, cursive, and punctuation. In order to improve the success of penbased computers, we would like a recognizer that reliably handles such handwriting, but its implementation faces major technical challenges [9]. It has been long known that, although characters taken in isolation can be very ambiguous, considerable information is available from the context of the whole wo...

openalex-author · http://www.research.att.com/~yann/exdb/publis/./psgz/bengio-lecun-henderson-94.ps.gz

Globally Trained Handwritten Word Recognizer using Spatial Representation, Convolutional Neural Networks, and Hidden Markov Models

We introduce a new approach for on-line recognition of handwritten words written in unconstrained mixed style. The preprocessor performs a word-level normalization by fitting a model of the word structure using the EM algorithm. Words are then coded into low resolution &quot;annotated images&quot; where each pixel contains information about trajectory direction and curvature. The recognizer is a convolution network which can be spatially replicated. From the network output, a hidden Markov model produces word scores. The entire system is globally trained to minimize word-level errors. 1 Introduction Natural handwriting is often a mixture of different &quot;styles&quot;, lower case printed, upper case, and cursive. A reliable recognizer for such handwriting would greatly improve interaction with pen-based devices, but its implementation presents new also, AT&amp;T Bell Labs, Holmdel NJ technical challenges. Characters taken in isolation can be very ambiguous, but considerable information is available fro...

openalex-author · ftp://ftp-dsi.ing.unifi.it/pub/tech-reports/alternatives.nips93.ps.Z

Credit Assignment through Time: Alternatives to Backpropagation

Learning to recognize or predict sequences using long-term context has many applications. However, practical and theoretical problems are found in training recurrent neural networks to perform tasks in which input/output dependencies span long intervals. Starting from a mathematical analysis of the problem, we consider and compare alternative algorithms and architectures on tasks for which the span of the input/output dependencies can be controlled. Results on the new algorithms show performance qualitatively superior to that obtained with backpropagation. 1 Introduction Recurrent neural networks have been considered to learn to map input sequences to output sequences. Machines that could efficiently learn such tasks would be useful for many applications involving sequence prediction, recognition or production. However, practical difficulties have been reported in training recurrent neural networks to perform tasks in which the temporal contingencies present in the input /output seque...

openalex-author · International Journal of Pattern Recognition and Artificial Intelligence

A CONNECTIONIST APPROACH TO SPEECH RECOGNITION

The task discussed in this paper is that of learning to map input sequences to output sequences. In particular, problems of phoneme recognition in continuous speech are considered, but most of the discussed techniques could be applied to other tasks, such as the recognition of sequences of handwritten characters. The systems considered in this paper are based on connectionist models, or artificial neural networks, sometimes combined with statistical techniques for recognition of sequences of patterns, stressing the integration of prior knowledge and learning. Different architectures for sequence and speech recognition are reviewed, including recurrent networks as well as hybrid systems involving hidden Markov models.

openalex-author · International journal of epidemiology

On-line handwriting recognition with neural networks: Spatial representation versus temporal representation

Our findings illustrated positive associations between intrauterine exposure to ASBs and birth size and risk of overweight/obesity at 7 years. Data with longer follow-up are warranted.

openalex-author · ICANN ’93

Generalization of a Parametric Learning Rule

We proposed in previous work ([1, 2]) a method to find new learning rules for neural networks, considering them as parametric functions and using any standard optimization method (such as genetic algorithms, gradient descent, and simulated annealing) to select the parameters.

openalex-author · Paper

Globally trained handwritten word recognizer using spatial representation, space displacement neural networks and hidden Markov models

No abstract available from the OpenAlex source record.

openalex-author · Pattern Recognition Letters

Learning the dynamic nature of speech with back-propagation for sequences

No abstract available from the OpenAlex source record.

openalex-author · Speech Recognition and Understanding

Radial Basis Functions for Speech Recognition

No abstract available from the OpenAlex source record.

openalex-author · neural information processing systems

Neural Network - Gaussian Mixture Hybrid for Speech Recognition or Density Estimation

The subject of this paper is the integration of multi-layered Artificial Neural Networks (ANN) with probability density functions such as Gaussian mixtures found in continuous density Hidden Markov Models (HMM). In the first part of this paper we present an ANN/HMM hybrid in which all the parameters of the system are simultaneously optimized with respect to a single criterion. In the second part of this paper, we study the relationship between the density of the inputs of the network and the density of the outputs of the networks. A few experiments are presented to explore how to perform density estimation with ANNs.

openalex-author · 2nd European Conference on Speech Communication and Technology (Eurospeech 1991)

Phonetically motivated acoustic parameters for continuous speech recognition using artificial neural networks

Dans le cadre d'un decodeur acoustique-phonetique hybride (ANN/HMM), trois reseaux de neurones (ANNs) specialises ont ete developpes et evalues. Un de ces ANNs detecte le mode d'articulation. Les deux autres ANNs decrivent le signal en termes du lieu d'articulation. Un reseau est utilise pour classifier les consonnes nasales et plosives. Un autre est utilise pour la classification des fricatives Le design de ces reseaux est inspire par des connaissances acoustiques et phonetiques. Les entrees, la topologie et le codage des sorties ont ete optimises pour chacun des reseaux

openalex-author · 2nd European Conference on Speech Communication and Technology (Eurospeech 1991)

A comparative study on hybrid acoustic phonetic decoders based on artificial neural networks

In this paper we compare two hybrid acoustic-phonetic decoders based on Artificial Neural Networks (ANN). We evaluate them on the task of recognizing stop phones in continuous speech independently from the speaker. ANNs are well suited to perform detailed phonetic distinctions. In general, techniques based on Dynamic Programming (DP), in particular Hidden Markov Models (HMMs), have proven to be successful at modeling the temporal structure of the speech signal. In the approach described here, the ANN outputs constitute the sequence of observation vectors for the HMM. An algorithm is proposed for global optimization of all the parameters of the ANN/HMM decoder. Comparative experiments using this ANN/HMM hybrid decoder and another ANN-DP hybrid are reported for the TIMIT database. 1 Introduction Artificial Neural Networks (ANNs) effectively perform phonetic classification, but have not proven yet to model the temporal structure of the speech signal reasonably well [Lip89, Rob90, Ben90b...

openalex-author · eScholarship@McGill (McGill)

Artificial neural networks and their application to sequence recognition

This thesis studies the introduction of a priori structure into the design of learning systems based on artificial neural networks applied to sequence recognition, in particular to phoneme recognition in continuous speech. Because we are interested in sequence analysis, algorithms for training recurrent networks are studied and an original algorithm for constrained recurrent networks is proposed and test results are reported. We also discuss the integration of connectionist models with other analysis tools that have been shown to be useful for sequences, such as dynamic programming and hidden Markov models. We introduce an original algorithm to perform global optimization of a neural network/hidden Markov model hybrid, and show how to perform such a global optimization on all the parameters of the system. Finally, we consider some alternatives to sigmoid networks: Radial Basis Functions, and a method for searching for better learning rules using a priori knowledge and optimization algorithms.

openalex-author · Machine Intelligence and Pattern Recognition

Connectionist Models and their Application to Automatic Speech Recognition

No abstract available from the OpenAlex source record.

openalex-author · Speech Communication

Phonetically-based multi-layered neural networks for vowel classification

No abstract available from the OpenAlex source record.

openalex-author · Bioinformatics

Efficient recognition of immunoglobulin domains from amino acid sequences using a neural network

A neural network was trained using back propagation to recognize immunoglobulin domains from amino acid sequences. The program was designed to identify proteins exhibiting such domains with minimal rates of false positives and false negatives. The National Biomedical Research Foundation NEW protein sequences database was scanned to evaluate the performance of the program in recognizing mouse immunoglobulin sequences. The program correctly recognized 55 out of 56 mouse immunoglobulin sequences, corresponding to a recognition efficiency of 98.2% with an overall false positive rate of 7.3%. These data demonstrate that neural network-based search programs are well suited to search for sequences characterized by only a few well-conserved subsequences.

openalex-author · Structural Pattern Analysis

On the Use of an Ear Model and Multi-Layered Networks for Automatic Speech Recognition

No abstract available from the OpenAlex source record.

openalex-author · Proceedings of the National Academy of Sciences of the United States of America

On the generalization capability of multi-layered networks in the extraction of speech properties

A fundamental problem in extracting scene structure is distinguishing different physical sources of image structure. Light reflected by an opaque surface covaries with local surface orientation, whereas light transported through the body of a translucent material does not. This suggests the possibility that the visual system may use the covariation of local surface orientation and intensity as a cue to the opacity of surfaces. We tested this hypothesis by manipulating the contrast of luminance gradients and the surface geometries to which they belonged and assessed how these manipulations affected the perception of surface opacity/translucency. We show that (i) identical luminance gradients can appear either translucent or opaque depending on the relationship between luminance and perceived 3D surface orientation, (ii) illusory percepts of translucency can be induced by embedding opaque surfaces in diffuse light fields that eliminate the covariation between surface orientation and intensity, and (iii) illusory percepts of opacity can be generated when transparent materials are embedded in a light field that generates images where surface orientation and intensity covary. Our results provide insight into how the visual system distinguishes opaque surfaces and light-permeable materials and why discrepancies arise between the perception and physics of opacity and translucency. These results suggest that the most significant information used to compute the perceived opacity and translucency of surfaces arise at a level of representation where 3D shape is made explicit.

openalex-author · Computational Intelligence

Use of multilayer networks for the recognition of phonetic features and phonemes

Artificial neural networks capable of doing hard learning offer a new way to undertake automatic speech recognition. The Boltzmann machine algorithm and the error back‐propagation algorithm have been used to perform speaker normalization. Spectral segments are represented by spectral lines. Speaker‐independent recognition of place of articulation for vowels is performed on lines. Performance of the networks is shown to depend on the coding of the input data. Samples were extracted from continuous speech of 38 speakers. The error rate obtained (4.2% error on test set of 72 samples with the Boltzmann machine and 6.9% error with error back‐propagation) is better than that of previous experiments, using the same data, with continuous Hidden Markov Models (7.3% error on test set and 3% error on training set). These experiments are part of an attempt to construct a data‐driven speech recognition system with multiple neural networks specialized to different tasks. Results are also reported on the recognition performance of other trained networks, such as one trained on the E‐set consonants.

openalex-author · Communications of the ACM

Programmable execution of multi-layered networks for automatic speech recognition

A set of Multi-Layered Networks allows the integration of information extracted with variable resolution in the time and frequency domains and to keep the number of links between nodes of the networks small for significant generalization during learning with a reasonable training set size.

openalex-author · Neural Information Processing Systems

A Neural Network to Detect Homologies in Proteins

In order to detect the presence and location of immunoglobulin (Ig) domains from amino acid sequences we built a system based on a neural network with one hidden layer trained with back propagation. The program was designed to efficiently identify proteins exhibiting such domains, characterized by a few localized conserved regions and a low overall homology. When the National Biomedical Research Foundation (NBRF) NEW protein sequence database was scanned to evaluate the program's performance, we obtained very low rates of false negatives coupled with a moderate rate of false positives.

openalex-author · Neural Information Processing Systems

Speaker Independent Speech Recognition with Neural Networks and Speech Knowledge

We attempt to combine neural networks with knowledge from speech science to build a speaker independent speech recognition system. This knowledge is utilized in designing the preprocessing, input coding, output coding, output supervision and architectural constraints. To handle the temporal aspect of speech we combine delays, copies of activations of hidden and output units at the input level, and Back-Propagation for Sequences (BPS), a learning algorithm for networks with local self-loops. This strategy is demonstrated in several experiments, in particular a nasal discrimination task for which the application of a speech theory hypothesis dramatically improved generalization.

openalex-author · National Conference on Artificial Intelligence

Data-driven execution of multi-layered networks for automatic speech recognition

A set of Multi-Layered Networks (MLN) for Automatic Speech Recognition (ASR) is proposed. Such a set allows the integration of information extracted with variable resolution in the time and frequency domains and to keep the number of links between nodes of the networks small in order to allow significant generalization during learning with a reasonable training set size. Subsets of networks can be executed depending on preconditions based on descriptions of the time evolution of signal energies allowing spectral properties that are significant in different acoustic situations to be learned. Preliminary experiments on speaker-independent recognition of the letters of the E-set are reported. Voices from 70 speakers were used for learning. Voices of 10 new speakers were used for test. An overall error rate of 9.5% was obtained in the test showing that results better than those previously reported can be achieved.

openalex-author · Paper

USE OF NEURAL NETWORKS FOR THE RECOGNITION OF PLACE

The Boltzmann machine algorithm and Science, McGill the error back propagation algorithm were used to learn to recognize the place of articulation of vowels (front, center or back), represented by a static description of spectral lines. The error rate is shown to depend on the coding. Results are comparable or better than those obtained by us on the same data using Hidden Markov Models. We also show a fault tolerant property of the neural nets, i.e. that the error on the test set increases slowly and gradually when an increasing number of nodes fail.

openalex-author · Neural Information Processing Systems

Use of Multi-Layered Networks for Coding Speech with Phonetic Features

Preliminary results on speaker-independant speech recognition are reported. A method that combines expertise on neural networks with expertise on speech recognition is used to build the recognition systems. For transient sounds, event-driven property extractors with variable resolution in the time and frequency domains are used. For sonorant speech, a model of the human auditory system is preferred to FFT as a front-end module.