Person

Geoffrey E. Hinton

Computer Scientist

Pioneering computer scientist known for foundational work in neural networks, deep learning, and representation learning.

Website

Papers

openalex-author · Open MIND

Cybersecurity and Data Privacy in AI Healthcare Systems: Competitive Differentiation for Emerging Companies

No abstract available from the OpenAlex source record.

openalex-author · Open MIND

Sustainable and Affordable AI Healthcare Innovations: How Small Enterprises Shape the Future of Medical Technology

No abstract available from the OpenAlex source record.

openalex-author · SuperIntelligence - Robotics - Safety & Alignment

International AI Safety Report 2025: Second Key Update: Technical Safeguards and Risk Management

This is the Second Key Update to the 2025 International AI Safety Report. The First Key Update (1) discussed developments in the capabilities of general-purpose AI models and systems and associated risks. This Key Update covers how various actors, including researchers, companies, and governments, are approaching risk management and technical mitigations for AI. The past year has seen important developments in AI risk management, including better techniques for training safer models and monitoring their outputs. While this represents tangible progress, significant gaps remain. It is often uncertain how effective current measures are at preventing harms, and effectiveness varies across time and applications. There are many opportunities to further strengthen existing safeguard techniques and to develop new ones. This Key Update provides a concise overview of critical developments in risk management practices and technical risk mitigation since the publication of the 2025 AI Safety Report in January. It highlights where progress is being made and where gaps remain. Above all, it aims to support policymakers, researchers, and the public in navigating a rapidly changing environment, helping them to make informed and timely decisions about the governance of general-purpose AI. Professor Yoshua BengioUniversité de Montréal / LawZero /Mila – Quebec AI Institute & Chair

openalex-author · SuperIntelligence - Robotics - Safety & Alignment

International Al Safety Report: First Key Update Capabilities and Risk Implications

The field of AI is moving too quickly for a single yearly publication to keep pace. Significant changes can occur on a timescale of months, sometimes weeks. This is why we are releasing Key Updates: shorter, focused reports that highlight the most important developments between full editions of the International AI Safety Report. With these updates, we aim to provide policymakers, researchers, and the public with up-to-date information to support wise decisions about AI governance. This first Key Update focuses on areas where especially significant changes have occurred since January 2025: advances in general-purpose AI systems' capabilities, and the implications for several critical risks. New training techniques have enabled AI systems to reason step-by-step and operate autonomously for longer periods, allowing them to tackle more kinds of work. However, these same advances create new challenges across biological risks, cyber security, and oversight of AI systems themselves. The International AI Safety Report is intended to help readers assess, anticipate, and manage risks from general-purpose AI systems. These Key Updates ensure that critical developments receive timely attention as the field rapidly evolves.

openalex-author · ArXiv.org

International AI Safety Report 2025: First Key Update: Capabilities and Risk Implications

Since the publication of the first International AI Safety Report, AI capabilities have continued to improve across key domains. New training techniques that teach AI systems to reason step-by-step and inference-time enhancements have primarily driven these advances, rather than simply training larger models. As a result, general-purpose AI systems can solve more complex problems in a range of domains, from scientific research to software development. Their performance on benchmarks that measure performance in coding, mathematics, and answering expert-level science questions has continued to improve, though reliability challenges persist, with systems excelling on some tasks while failing completely on others. These capability improvements also have implications for multiple risks, including risks from biological weapons and cyber attacks. Finally, they pose new challenges for monitoring and controllability. This update examines how AI capabilities have improved since the first Report, then focuses on key risk areas where substantial new evidence warrants updated assessments.

openalex-author · ArXiv.org

International AI Safety Report

The first International AI Safety Report comprehensively synthesizes the current evidence on the capabilities, risks, and safety of advanced AI systems. The report was mandated by the nations attending the AI Safety Summit in Bletchley, UK. Thirty nations, the UN, the OECD, and the EU each nominated a representative to the report's Expert Advisory Panel. A total of 100 AI experts contributed, representing diverse perspectives and disciplines. Led by the report's Chair, these independent experts collectively had full discretion over the report's content.

openalex-author · Science

Managing extreme AI risks amid rapid progress

Preparation requires technical research and development, as well as adaptive, proactive governance.

openalex-author · Zenodo (CERN European Organization for Nuclear Research)

CIFAR-10 & CIFAR-100

No abstract available from the OpenAlex source record.

openalex-author · 2023 IEEE/CVF International Conference on Computer Vision (ICCV)

A Generalist Framework for Panoptic Segmentation of Images and Videos

Panoptic segmentation assigns semantic and instance ID labels to every pixel of an image. As permutations of instance IDs are also valid solutions, the task requires learning of high-dimensional one-to-many mapping. As a result, state-of-the-art approaches use customized architectures and task-specific loss functions. We formulate panoptic segmentation as a discrete data generation problem, without relying on inductive bias of the task. A diffusion model is proposed to model panoptic masks, with a simple architecture and generic loss function. By simply adding past predictions as a conditioning signal, our method is capable of modeling video (in a streaming setting) and thereby learns to track object instances automatically. With extensive experiments, we demonstrate that our simple approach can perform competitively to state-of-the-art specialist methods in similar settings. <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup>

openalex-author · Nature Biomedical Engineering

Robust and data-efficient generalization of self-supervised machine learning for diagnostic imaging

No abstract available from the OpenAlex source record.

openalex-author · arXiv (Cornell University)

The Forward-Forward Algorithm: Some Preliminary Investigations

The aim of this paper is to introduce a new learning procedure for neural networks and to demonstrate that it works well enough on a few small problems to be worth further investigation. The Forward-Forward algorithm replaces the forward and backward passes of backpropagation by two forward passes, one with positive (i.e. real) data and the other with negative data which could be generated by the network itself. Each layer has its own objective function which is simply to have high goodness for positive data and low goodness for negative data. The sum of the squared activities in a layer can be used as the goodness but there are many other possibilities, including minus the sum of the squared activities. If the positive and negative passes could be separated in time, the negative passes could be done offline, which would make the learning much simpler in the positive pass and allow video to be pipelined through the network without ever storing activities or stopping to propagate derivatives.

openalex-author · Neural Computation

How to Represent Part-Whole Hierarchies in a Neural Network

This article does not describe a working system. Instead, it presents a single idea about representation that allows advances made by several different groups to be combined into an imaginary system called GLOM.1 The advances include transformers, neural fields, contrastive representation learning, distillation, and capsules. GLOM answers the question: How can a neural network with a fixed architecture parse an image into a part-whole hierarchy that has a different structure for each image? The idea is simply to use islands of identical vectors to represent the nodes in the parse tree. If GLOM can be made to work, it should significantly improve the interpretability of the representations produced by transformer-like systems when applied to vision or language.

openalex-author · arXiv (Cornell University)

Testing GLOM's ability to infer wholes from ambiguous parts

The GLOM architecture proposed by Hinton [2021] is a recurrent neural network for parsing an image into a hierarchy of wholes and parts. When a part is ambiguous, GLOM assumes that the ambiguity can be resolved by allowing the part to make multi-modal predictions for the pose and identity of the whole to which it belongs and then using attention to similar predictions coming from other possibly ambiguous parts to settle on a common mode that is predicted by several different parts. In this study, we describe a highly simplified version of GLOM that allows us to assess the effectiveness of this way of dealing with ambiguity. Our results show that, with supervised training, GLOM is able to successfully form islands of very similar embedding vectors for all of the locations occupied by the same object and it is also robust to strong noise injections in the input and to out-of-distribution input transformations.

openalex-author · arXiv (Cornell University)

Gaussian-Bernoulli RBMs Without Tears

We revisit the challenging problem of training Gaussian-Bernoulli restricted Boltzmann machines (GRBMs), introducing two innovations. We propose a novel Gibbs-Langevin sampling algorithm that outperforms existing methods like Gibbs sampling. We propose a modified contrastive divergence (CD) algorithm so that one can generate images with GRBMs starting from noise. This enables direct comparison of GRBMs with deep generative models, improving evaluation protocols in the RBM literature. Moreover, we show that modified CD and gradient clipping are enough to robustly train GRBMs with large learning rates, thus removing the necessity of various tricks in the literature. Experiments on Gaussian Mixtures, MNIST, FashionMNIST, and CelebA show GRBMs can generate good samples, despite their single-hidden-layer architecture. Our code is released at: \url{https://github.com/lrjconan/GRBM}.

openalex-author · arXiv (Cornell University)

Scaling Forward Gradient With Local Losses

Forward gradient learning computes a noisy directional gradient and is a biologically plausible alternative to backprop for learning deep neural networks. However, the standard forward gradient algorithm, when applied naively, suffers from high variance when the number of parameters to be learned is large. In this paper, we propose a series of architectural and algorithmic modifications that together make forward gradient learning practical for standard deep learning benchmark tasks. We show that it is possible to substantially reduce the variance of the forward gradient estimator by applying perturbations to activations rather than weights. We further improve the scalability of forward gradient by introducing a large number of local greedy loss functions, each of which involves only a small number of learnable parameters, and a new MLPMixer-inspired architecture, LocalMixer, that is more suitable for local learning. Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.

openalex-author · arXiv (Cornell University)

Analog Bits: Generating Discrete Data using Diffusion Models with Self-Conditioning

We present Bit Diffusion: a simple and generic approach for generating discrete data with continuous state and continuous time diffusion models. The main idea behind our approach is to first represent the discrete data as binary bits, and then train a continuous diffusion model to model these bits as real numbers which we call analog bits. To generate samples, the model first generates the analog bits, which are then thresholded to obtain the bits that represent the discrete variables. We further propose two simple techniques, namely Self-Conditioning and Asymmetric Time Intervals, which lead to a significant improvement in sample quality. Despite its simplicity, the proposed approach can achieve strong performance in both discrete image generation and image captioning tasks. For discrete image generation, we significantly improve previous state-of-the-art on both CIFAR-10 (which has 3K discrete 8-bit tokens) and ImageNet-64x64 (which has 12K discrete 8-bit tokens), outperforming the best autoregressive model in both sample quality (measured by FID) and efficiency. For image captioning on MS-COCO dataset, our approach achieves competitive results compared to autoregressive models.

openalex-author · arXiv (Cornell University)

A Unified Sequence Interface for Vision Tasks

While language tasks are naturally expressed in a single, unified, modeling framework, i.e., generating sequences of tokens, this has not been the case in computer vision. As a result, there is a proliferation of distinct architectures and loss functions for different vision tasks. In this work we show that a diverse set of "core" computer vision tasks can also be unified if formulated in terms of a shared pixel-to-sequence interface. We focus on four tasks, namely, object detection, instance segmentation, keypoint detection, and image captioning, all with diverse types of outputs, e.g., bounding boxes or dense masks. Despite that, by formulating the output of each task as a sequence of discrete tokens with a unified interface, we show that one can train a neural network with a single model architecture and loss function on all these tasks, with no task-specific customization. To solve a specific task, we use a short prompt as task description, and the sequence output adapts to the prompt so it can produce task-specific output. We show that such a model can achieve competitive performance compared to well-established task-specific models.

openalex-author · arXiv (Cornell University)

Robust and Efficient Medical Imaging with Self-Supervision

Recent progress in Medical Artificial Intelligence (AI) has delivered systems that can reach clinical expert level performance. However, such systems tend to demonstrate sub-optimal "out-of-distribution" performance when evaluated in clinical settings different from the training environment. A common mitigation strategy is to develop separate systems for each clinical setting using site-specific data [1]. However, this quickly becomes impractical as medical data is time-consuming to acquire and expensive to annotate [2]. Thus, the problem of "data-efficient generalization" presents an ongoing difficulty for Medical AI development. Although progress in representation learning shows promise, their benefits have not been rigorously studied, specifically for out-of-distribution settings. To meet these challenges, we present REMEDIS, a unified representation learning strategy to improve robustness and data-efficiency of medical imaging AI. REMEDIS uses a generic combination of large-scale supervised transfer learning with self-supervised learning and requires little task-specific customization. We study a diverse range of medical imaging tasks and simulate three realistic application scenarios using retrospective data. REMEDIS exhibits significantly improved in-distribution performance with up to 11.5% relative improvement in diagnostic accuracy over a strong supervised baseline. More importantly, our strategy leads to strong data-efficient generalization of medical imaging AI, matching strong supervised baselines using between 1% to 33% of retraining data across tasks. These results suggest that REMEDIS can significantly accelerate the life-cycle of medical imaging AI development thereby presenting an important step forward for medical imaging AI to deliver broad impact.

openalex-author · Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Meta-Learning Fast Weight Language Models

Dynamic evaluation of language models (LMs) adapts model parameters at test time using gradient information from previous tokens and substantially improves LM performance. However, it requires over 3x more compute than standard inference. We present Fast Weight Layers (FWLs), a neural component that provides the benefits of dynamic evaluation much more efficiently by expressing gradient updates as linear attention. A key improvement over dynamic evaluation is that FWLs can also be applied at training time, so the model learns to make good use of gradient updates. FWLs can easily be added on top of existing transformer models, require relatively little extra compute or memory to run, and significantly improve language modeling perplexity.

openalex-author · arXiv (Cornell University)

Pix2seq: A Language Modeling Framework for Object Detection

We present Pix2Seq, a simple and generic framework for object detection. Unlike existing approaches that explicitly integrate prior knowledge about the task, we cast object detection as a language modeling task conditioned on the observed pixel inputs. Object descriptions (e.g., bounding boxes and class labels) are expressed as sequences of discrete tokens, and we train a neural network to perceive the image and generate the desired sequence. Our approach is based mainly on the intuition that if a neural network knows about where and what the objects are, we just need to teach it how to read them out. Beyond the use of task-specific data augmentations, our approach makes minimal assumptions about the task, yet it achieves competitive results on the challenging COCO dataset, compared to highly specialized and well optimized detection algorithms.

openalex-author · International Conference on Machine Learning

Unsupervised Part Representation by Flow Capsules

No abstract available from the OpenAlex source record.

openalex-author · Communications of the ACM

Deep learning for AI

How can neural networks learn the rich internal representations required for difficult tasks such as recognizing objects or understanding language?

openalex-author · arXiv (Cornell University)

Canonical Capsules: Unsupervised Capsules in Canonical Pose

We propose an unsupervised capsule architecture for 3D point clouds. We compute capsule decompositions of objects through permutation-equivariant attention, and self-supervise the process by training with pairs of randomly rotated objects. Our key idea is to aggregate the attention masks into semantic keypoints, and use these to supervise a decomposition that satisfies the capsule invariance/equivariance properties. This not only enables the training of a semantically consistent decomposition, but also allows us to learn a canonicalization operation that enables object-centric reasoning. In doing so, we require neither classification labels nor manually-aligned training datasets to train. Yet, by learning an object-centric representation in an unsupervised manner, our method outperforms the state-of-the-art on 3D point cloud reconstruction, registration, and unsupervised classification. We will release the code and dataset to reproduce our results as soon as the paper is published.

openalex-author · arXiv (Cornell University)

Teaching with Commentaries

Effective training of deep neural networks can be challenging, and there remain many open questions on how to best learn these models. Recently developed methods to improve neural network training examine teaching: providing learned information during the training process to improve downstream model performance. In this paper, we take steps towards extending the scope of teaching. We propose a flexible teaching framework using commentaries, learned meta-information helpful for training on a particular task. We present gradient-based methods to learn commentaries, leveraging recent work on implicit differentiation for scalability. We explore diverse applications of commentaries, from weighting training examples, to parameterising label-dependent data augmentation policies, to representing attention masks that highlight salient image regions. We find that commentaries can improve training speed and/or performance, and provide insights about the dataset and training process. We also observe that commentaries generalise: they can be reused when training new models to obtain performance benefits, suggesting a use-case where commentaries are stored with a dataset and leveraged in future for improved model training.

openalex-author · 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

CvxNet: Learnable Convex Decomposition

Any solid object can be decomposed into a collection of convex polytopes (in short, convexes). When a small number of convexes are used, such a decomposition can be thought of as a piece-wise approximation of the geometry. This decomposition is fundamental in computer graphics, where it provides one of the most common ways to approximate geometry, for example, in real-time physics simulation. A convex object also has the property of being simultaneously an explicit and implicit representation: one can interpret it explicitly as a mesh derived by computing the vertices of a convex hull, or implicitly as the collection of half-space constraints or support functions. Their implicit representation makes them particularly well suited for neural network training, as they abstract away from the topology of the geometry they need to represent. However, at testing time, convexes can also generate explicit representations - polygonal meshes - which can then be used in any downstream application. We introduce a network architecture to represent a low dimensional family of convexes. This family is automatically derived via an auto-encoding process. We investigate the applications of this architecture including automatic convex decomposition, image to 3D reconstruction, and part-based shape retrieval.

openalex-author · Nature Reviews Neuroscience

Backpropagation and the brain

No abstract available from the OpenAlex source record.

openalex-author · arXiv (Cornell University)

Imputer: Sequence Modelling via Imputation and Dynamic Programming

This paper presents the Imputer, a neural sequence model that generates output sequences iteratively via imputations. The Imputer is an iterative generative model, requiring only a constant number of generation steps independent of the number of input or output tokens. The Imputer can be trained to approximately marginalize over all possible alignments between the input and output sequences, and all possible generation orders. We present a tractable dynamic programming training algorithm, which yields a lower bound on the log marginal likelihood. When applied to end-to-end speech recognition, the Imputer outperforms prior non-autoregressive models and achieves competitive results to autoregressive models. On LibriSpeech test-other, the Imputer achieves 11.1 WER, outperforming CTC at 13.0 WER and seq2seq at 12.5 WER.

openalex-author · arXiv (Cornell University)

Deflecting Adversarial Attacks

There has been an ongoing cycle where stronger defenses against adversarial attacks are subsequently broken by a more advanced defense-aware attack. We present a new approach towards ending this cycle where we "deflect'' adversarial attacks by causing the attacker to produce an input that semantically resembles the attack's target class. To this end, we first propose a stronger defense based on Capsule Networks that combines three detection mechanisms to achieve state-of-the-art detection performance on both standard and defense-aware attacks. We then show that undetected attacks against our defense often perceptually resemble the adversarial target class by performing a human study where participants are asked to label images produced by the attack. These attack images can no longer be called "adversarial'' because our network classifies them the same way as humans do.

openalex-author · arXiv (Cornell University)

A Simple Framework for Contrastive Learning of Visual Representations

This paper presents SimCLR: a simple framework for contrastive learning of visual representations. We simplify recently proposed contrastive self-supervised learning algorithms without requiring specialized architectures or a memory bank. In order to understand what enables the contrastive prediction tasks to learn useful representations, we systematically study the major components of our framework. We show that (1) composition of data augmentations plays a critical role in defining effective predictive tasks, (2) introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and (3) contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning. By combining these findings, we are able to considerably outperform previous methods for self-supervised and semi-supervised learning on ImageNet. A linear classifier trained on self-supervised representations learned by SimCLR achieves 76.5% top-1 accuracy, which is a 7% relative improvement over previous state-of-the-art, matching the performance of a supervised ResNet-50. When fine-tuned on only 1% of the labels, we achieve 85.8% top-5 accuracy, outperforming AlexNet with 100X fewer labels.

openalex-author · arXiv (Cornell University)

Subclass Distillation

After a large "teacher" neural network has been trained on labeled data, the probabilities that the teacher assigns to incorrect classes reveal a lot of information about the way in which the teacher generalizes. By training a small "student" model to match these probabilities, it is possible to transfer most of the generalization ability of the teacher to the student, often producing a much better small model than directly training the student on the training data. The transfer works best when there are many possible classes because more is then revealed about the function learned by the teacher, but in cases where there are only a few possible classes we show that we can improve the transfer by forcing the teacher to divide each class into many subclasses that it invents during the supervised training. The student is then trained to match the subclass probabilities. For datasets where there are known, natural subclasses we demonstrate that the teacher learns similar subclasses and these improve distillation. For clickthrough datasets where the subclasses are unknown we demonstrate that subclass distillation allows the student to learn faster and better.

openalex-author · Lecture Notes in Computer Science

NASA Neural Articulated Shape Approximation

No abstract available from the OpenAlex source record.

openalex-author · Neural Information Processing Systems

Lookahead Optimizer: k steps forward, 1 step back

The vast majority of successful deep neural networks are trained using variants of stochastic gradient descent (SGD) algorithms. Recent attempts to improve SGD can be broadly categorized into two approaches: (1) adaptive learning rate schemes, such as AdaGrad and Adam and (2) accelerated schemes, such as heavy-ball and Nesterov momentum. In this paper, we propose a new optimization algorithm, Lookahead, that is orthogonal to these previous approaches and iteratively updates two sets of weights. Intuitively, the algorithm chooses a search direction by looking ahead at the sequence of ``fast weights generated by another optimizer. We show that Lookahead improves the learning stability and lowers the variance of its inner optimizer with negligible computation and memory cost. We empirically demonstrate Lookahead can significantly improve the performance of SGD and Adam, even with their default hyperparameter settings on ImageNet, CIFAR-10/100, neural machine translation, and Penn Treebank.

openalex-author · arXiv (Cornell University)

Detecting and Diagnosing Adversarial Images with Class-Conditional Capsule Reconstructions

Adversarial examples raise questions about whether neural network models are sensitive to the same visual features as humans. In this paper, we first detect adversarial examples or otherwise corrupted images based on a class-conditional reconstruction of the input. To specifically attack our detection mechanism, we propose the Reconstructive Attack which seeks both to cause a misclassification and a low reconstruction error. This reconstructive attack produces undetected adversarial examples but with much smaller success rate. Among all these attacks, we find that CapsNets always perform better than convolutional networks. Then, we diagnose the adversarial examples for CapsNets and find that the success of the reconstructive attack is highly related to the visual similarity between the source and target class. Additionally, the resulting perturbations can cause the input image to appear visually more like the target class and hence become non-adversarial. This suggests that CapsNets use features that are more aligned with human perception and have the potential to address the central issue raised by adversarial examples.

openalex-author · arXiv (Cornell University)

Stacked Capsule Autoencoders

Objects are composed of a set of geometrically organized parts. We introduce an unsupervised capsule autoencoder (SCAE), which explicitly uses geometric relationships between parts to reason about objects. Since these relationships do not depend on the viewpoint, our model is robust to viewpoint changes. SCAE consists of two stages. In the first stage, the model predicts presences and poses of part templates directly from the image and tries to reconstruct the image by appropriately arranging the templates. In the second stage, SCAE predicts parameters of a few object capsules, which are then used to reconstruct part poses. Inference in this model is amortized and performed by off-the-shelf neural encoders, unlike in previous capsule networks. We find that object capsule presences are highly informative of the object class, which leads to state-of-the-art results for unsupervised classification on SVHN (55%) and MNIST (98.7%). The code is available at https://github.com/google-research/google-research/tree/master/stacked_capsule_autoencoders

openalex-author · Neural Information Processing Systems

When does label smoothing help

The generalization and learning speed of a multi-class neural network can often be significantly improved by using soft targets that are a weighted average of the hard targets and the uniform distribution over labels. Smoothing the labels in this way prevents the network from becoming over-confident and label smoothing has been used in many state-of-the-art models, including image classification, language translation and speech recognition. Despite its widespread use, label smoothing is still poorly understood. Here we show empirically that in addition to improving generalization, label smoothing improves model calibration which can significantly improve beam search. However, we also observe that if a teacher network is trained with label smoothing, knowledge distillation into a student network is much less effective. To explain these observations, we visualize how label smoothing changes the representations learned by the penultimate layer of the network. We show that label smoothing encourages the representations of training examples from the same class to group in tight clusters. This results in loss of information in the logits about resemblances between instances of different classes, which is necessary for distillation, but does not hurt generalization or calibration of the model's predictions.

openalex-author · arXiv (Cornell University)

Learning Sparse Networks Using Targeted Dropout

Neural networks are easier to optimise when they have many more weights than are required for modelling the mapping from inputs to outputs. This suggests a two-stage learning procedure that first learns a large net and then prunes away connections or hidden units. But standard training does not necessarily encourage nets to be amenable to pruning. We introduce targeted dropout, a method for training a neural network so that it is robust to subsequent pruning. Before computing the gradients for each weight update, targeted dropout stochastically selects a set of units or weights to be dropped using a simple self-reinforcing sparsity criterion and then computes the gradients for the remaining weights. The resulting network is robust to post hoc pruning of weights or units that frequently occur in the dropped sets. The method improves upon more complicated sparsifying regularisers while being simple to implement and easy to tune.

openalex-author · arXiv (Cornell University)

Cerberus: A Multi-headed Derenderer

To generalize to novel visual scenes with new viewpoints and new object poses, a visual system needs representations of the shapes of the parts of an object that are invariant to changes in viewpoint or pose. 3D graphics representations disentangle visual factors such as viewpoints and lighting from object structure in a natural way. It is possible to learn to invert the process that converts 3D graphics representations into 2D images, provided the 3D graphics representations are available as labels. When only the unlabeled images are available, however, learning to derender is much harder. We consider a simple model which is just a set of free floating parts. Each part has its own relation to the camera and its own triangular mesh which can be deformed to model the shape of the part. At test time, a neural network looks at a single image and extracts the shapes of the parts and their relations to the camera. Each part can be viewed as one head of a multi-headed derenderer. During training, the extracted parts are used as input to a differentiable 3D renderer and the reconstruction error is backpropagated to train the neural net. We make the learning task easier by encouraging the deformations of the part meshes to be invariant to changes in viewpoint and invariant to the changes in the relative positions of the parts that occur when the pose of an articulated body changes. Cerberus, our multi-headed derenderer, outperforms previous methods for extracting 3D parts from single images without part annotations, and it does quite well at extracting natural parts of human figures.

openalex-author · arXiv (Cornell University)

Similarity of Neural Network Representations Revisited

Recent work has sought to understand the behavior of neural networks by comparing representations between layers and between different trained models. We examine methods for comparing neural network representations based on canonical correlation analysis (CCA). We show that CCA belongs to a family of statistics for measuring multivariate similarity, but that neither CCA nor any other statistic that is invariant to invertible linear transformation can measure meaningful similarities between representations of higher dimension than the number of data points. We introduce a similarity index that measures the relationship between representational similarity matrices and does not suffer from this limitation. This similarity index is equivalent to centered kernel alignment (CKA) and is also closely connected to CCA. Unlike CCA, CKA can reliably identify correspondences between representations in networks trained from different initializations.

openalex-author · arXiv (Cornell University)

Analyzing and Improving Representations with the Soft Nearest Neighbor\n Loss

We explore and expand the $\\textit{Soft Nearest Neighbor Loss}$ to measure\nthe $\\textit{entanglement}$ of class manifolds in representation space: i.e.,\nhow close pairs of points from the same class are relative to pairs of points\nfrom different classes. We demonstrate several use cases of the loss. As an\nanalytical tool, it provides insights into the evolution of class similarity\nstructures during learning. Surprisingly, we find that $\\textit{maximizing}$\nthe entanglement of representations of different classes in the hidden layers\nis beneficial for discrimination in the final layer, possibly because it\nencourages representations to identify class-independent similarity structures.\nMaximizing the soft nearest neighbor loss in the hidden layers leads not only\nto improved generalization but also to better-calibrated estimates of\nuncertainty on outlier data. Data that is not from the training distribution\ncan be recognized by observing that in the hidden layers, it has fewer than the\nnormal number of neighbors from the predicted class.\n

openalex-author · arXiv (Cornell University)

DARCCC: Detecting Adversaries by Reconstruction from Class Conditional Capsules

We present a simple technique that allows capsule models to detect adversarial images. In addition to being trained to classify images, the capsule model is trained to reconstruct the images from the pose parameters and identity of the correct top-level capsule. Adversarial images do not look like a typical member of the predicted class and they have much larger reconstruction errors when the reconstruction is produced from the top-level capsule for that class. We show that setting a threshold on the $l2$ distance between the input image and its reconstruction from the winning capsule is very effective at detecting adversarial images for three different datasets. The same technique works quite well for CNNs that have been trained to reconstruct the image from all or part of the last hidden layer before the softmax. We then explore a stronger, white-box attack that takes the reconstruction error into account. This attack is able to fool our detection technique but in order to make the model change its prediction to another class, the attack must typically make the "adversarial" image resemble images of the other class.

openalex-author · JAMA

Deep Learning—A Technology With the Potential to Transform Health Care

In this Viewpoint, Geoffrey Hinton of Google’s Brain Team discusses the basics of neural networks: their underlying data structures, how they can be trained and combined to process complex health data sets, and future prospects for harnessing their unsupervised learning to clinical challenges.

openalex-author · arXiv (Cornell University)

Assessing the Scalability of Biologically-Motivated Deep Learning\n Algorithms and Architectures

The backpropagation of error algorithm (BP) is impossible to implement in a\nreal brain. The recent success of deep networks in machine learning and AI,\nhowever, has inspired proposals for understanding how the brain might learn\nacross multiple layers, and hence how it might approximate BP. As of yet, none\nof these proposals have been rigorously evaluated on tasks where BP-guided deep\nlearning has proved critical, or in architectures more structured than simple\nfully-connected networks. Here we present results on scaling up biologically\nmotivated models of deep learning on datasets which need deep networks with\nappropriate architectures to achieve good performance. We present results on\nthe MNIST, CIFAR-10, and ImageNet datasets and explore variants of\ntarget-propagation (TP) and feedback alignment (FA) algorithms, and explore\nperformance in both fully- and locally-connected architectures. We also\nintroduce weight-transport-free variants of difference target propagation (DTP)\nmodified to remove backpropagation from the penultimate layer. Many of these\nalgorithms perform well for MNIST, but for CIFAR and ImageNet we find that TP\nand FA variants perform significantly worse than BP, especially for networks\ncomposed of locally connected units, opening questions about whether new\narchitectures and algorithms are required to scale these approaches. Our\nresults and implementation details help establish baselines for biologically\nmotivated deep learning schemes going forward.\n

openalex-author · Figshare

Zero-Shot Learning with Semantic Output Codes

We consider the problem of zero-shot learning, where the goal is to learn a classifier f: X → Y that must predict novel values of Y that were omitted from the training set. To achieve this, we define the notion of a semantic output code classifier (SOC) which utilizes a knowledge base of semantic properties of Y to extrapolate to novel classes. We provide a formalism for this type of classifier and study its theoretical properties in a PAC framework, showing conditions under which the classifier can accurately predict novel classes. As a case study, we build a SOC classifier for a neural decoding task and show that it can often predict words that people are thinking about from functional magnetic resonance images (fMRI) of their neural activity, even without training examples for those words.

openalex-author · Proceedings of the AAAI Conference on Artificial Intelligence

Who Said What: Modeling Individual Labelers Improves Classification

Data are often labeled by many different experts with each expert only labeling a small fraction of the data and each data point being labeled by several experts. This reduces the workload on individual experts and also gives a better estimate of the unobserved ground truth. When experts disagree, the standard approaches are to treat the majority opinion as the correct label or to model the correct label as a distribution. These approaches, however, do not make any use of potentially valuable information about which expert produced which label. To make use of this extra information, we propose modeling the experts individually and then learning averaging weights for combining them, possibly in sample-specific ways. This allows us to give more weight to more reliable experts and take advantage of the unique strengths of individual experts at classifying certain types of data. Here we show that our approach leads to improvements in computer-aided diagnosis of diabetic retinopathy. We also show that our method performs better than competing algorithms by Welinder and Perona (2010); Mnih and Hinton (2012). Our work offers an innovative approach for dealing with the myriad real-world settings that use expert opinions to define labels for training.

openalex-author · arXiv (Cornell University)

Large scale distributed neural network training through online distillation

Techniques such as ensembling and distillation promise model quality improvements when paired with almost any base model. However, due to increased test-time cost (for ensembles) and increased complexity of the training pipeline (for distillation), these techniques are challenging to use in industrial settings. In this paper we explore a variant of distillation which is relatively straightforward to use as it does not require a complicated multi-stage setup or many new hyperparameters. Our first claim is that online distillation enables us to use extra parallelism to fit very large datasets about twice as fast. Crucially, we can still speed up training even after we have already reached the point at which additional parallelism provides no benefit for synchronous or asynchronous stochastic gradient descent. Two neural networks trained on disjoint subsets of the data can share knowledge by encouraging each model to agree with the predictions the other model would have made. These predictions can come from a stale version of the other model so they can be safely computed using weights that only rarely get transmitted. Our second claim is that online distillation is a cost-effective way to make the exact predictions of a model dramatically more reproducible. We support our claims using experiments on the Criteo Display Ad Challenge dataset, ImageNet, and the largest to-date dataset used for neural language modeling, containing $6\times 10^{11}$ tokens and based on the Common Crawl repository of web data.

openalex-author · International Conference on Learning Representations

Matrix capsules with EM routing

A capsule is a group of neurons whose outputs represent different properties of the same entity. Each layer in a capsule network contains many capsules [a group of capsules forms a capsule layer and can be used in place of a traditional layer in a neural net]. We describe a version of capsules in which each capsule has a logistic unit to represent the presence of an entity and a 4x4 matrix which could learn to represent the relationship between that entity and the viewer (the pose). A capsule in one layer votes for the pose matrix of many different capsules in the layer above by multiplying its own pose matrix by trainable viewpoint-invariant transformation matrices that could learn to represent part-whole relationships. Each of these votes is weighted by an assignment coefficient. These coefficients are iteratively updated for each image using the Expectation-Maximization algorithm such that the output of each capsule is routed to a capsule in the layer above that receives a cluster of similar votes. The transformation matrices are trained discriminatively by backpropagating through the unrolled iterations of EM between each pair of adjacent capsule layers. On the smallNORB benchmark, capsules reduce the number of test errors by 45\% compared to the state-of-the-art. Capsules also show far more resistance to white box adversarial attack than our baseline convolutional neural network.

openalex-author · Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Illustrative Language Understanding: Large-Scale Visual Grounding with Image Search

We introduce Picturebook, a large-scale lookup operation to ground language via 'snapshots' of our physical world accessed through image search. For each word in a vocabulary, we extract the top-k images from Google image search and feed the images through a convolutional network to extract a word embedding. We introduce a multimodal gating function to fuse our Picturebook embeddings with other word representations. We also introduce Inverse Picturebook, a mechanism to map a Picturebook embedding back into words. We experiment and report results across a wide range of tasks: word similarity, natural language inference, semantic relatedness, sentiment/topic classification, image-sentence ranking and machine translation. We also show that gate activations corresponding to Picturebook embeddings are highly correlated to human judgments of concreteness ratings.

openalex-author · Computer Science Department

Distributed representations

Computer Science Department

openalex-author · Paper

Distilling a Neural Network Into a Soft Decision Tree

No abstract available from the OpenAlex source record.

openalex-author · arXiv (Cornell University)

Dynamic Routing Between Capsules

A capsule is a group of neurons whose activity vector represents the instantiation parameters of a specific type of entity such as an object or an object part. We use the length of the activity vector to represent the probability that the entity exists and its orientation to represent the instantiation parameters. Active capsules at one level make predictions, via transformation matrices, for the instantiation parameters of higher-level capsules. When multiple predictions agree, a higher level capsule becomes active. We show that a discrimininatively trained, multi-layer capsule system achieves state-of-the-art performance on MNIST and is considerably better than a convolutional net at recognizing highly overlapping digits. To achieve these results we use an iterative routing-by-agreement mechanism: A lower-level capsule prefers to send its output to higher level capsules whose activity vectors have a big scalar product with the prediction coming from the lower-level capsule.

openalex-author · Communications of the ACM

ImageNet classification with deep convolutional neural networks

We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0%, respectively, which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overfitting in the fully connected layers we employed a recently developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

openalex-author · arXiv (Cornell University)

Regularizing Neural Networks by Penalizing Confident Output\n Distributions

We systematically explore regularizing neural networks by penalizing low\nentropy output distributions. We show that penalizing low entropy output\ndistributions, which has been shown to improve exploration in reinforcement\nlearning, acts as a strong regularizer in supervised learning. Furthermore, we\nconnect a maximum entropy based confidence penalty to label smoothing through\nthe direction of the KL divergence. We exhaustively evaluate the proposed\nconfidence penalty and label smoothing on 6 common benchmarks: image\nclassification (MNIST and Cifar-10), language modeling (Penn Treebank), machine\ntranslation (WMT'14 English-to-German), and speech recognition (TIMIT and WSJ).\nWe find that both label smoothing and the confidence penalty improve\nstate-of-the-art models across benchmarks without modifying existing\nhyperparameters, suggesting the wide applicability of these regularizers.\n

openalex-author · arXiv (Cornell University)

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational efficiency on modern GPU clusters. We introduce a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks. A trainable gating network determines a sparse combination of these experts to use for each example. We apply the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora. We present model architectures in which a MoE with up to 137 billion parameters is applied convolutionally between stacked LSTM layers. On large language modeling and machine translation benchmarks, these models achieve significantly better results than state-of-the-art at lower computational cost.

openalex-author · Encyclopedia of Machine Learning and Data Mining

Deep Belief Nets.

Deep belief nets are probabilistic generative models that are composed of multiple layers of stochastic latent variables (also called “feature detectors” or “hidden units”). The top two layers have undirected, symmetric connections between them and form an associative memory. The lower layers receive top-down, directed connections from the layer above. Deep belief nets have two important computational properties. First, there is an efficient procedure for learning the topdown, generative weights that specify how the variables in one layer determine the probabilities of variables in the layer below. This procedure learns one layer of latent variables at a time. Second, after learning multiple layers, the values of the latent variables in every layer can be inferred by a single, bottom-up pass that starts with an observed data vector in the bottom layer and uses the generative weights in the reverse direction.

openalex-author · Encyclopedia of Machine Learning and Data Mining

Boltzmann Machines

No abstract available from the OpenAlex source record.

openalex-author · Paper

Who Said What: Modelling Individual Labels Improves Classification

No abstract available from the OpenAlex source record.

openalex-author · arXiv (Cornell University)

Using Fast Weights to Attend to the Recent Past

Until recently, research on artificial neural networks was largely restricted to systems with only two types of variable: Neural activities that represent the current or recent input and weights that learn to capture regularities among inputs, outputs and payoffs. There is no good reason for this restriction. Synapses have dynamics at many different time-scales and this suggests that artificial neural networks might benefit from variables that change slower than activities but much faster than the standard weights. These "fast weights" can be used to store temporary memories of the recent past and they provide a neurally plausible way of implementing the type of attention to the past that has recently proved very helpful in sequence-to-sequence models. By using fast weights we can avoid the need to store copies of neural activity patterns.

openalex-author · arXiv (Cornell University)

Layer Normalization

Training state-of-the-art, deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons. A recently introduced technique called batch normalization uses the distribution of the summed input to a neuron over a mini-batch of training cases to compute a mean and variance which are then used to normalize the summed input to that neuron on each training case. This significantly reduces the training time in feed-forward neural networks. However, the effect of batch normalization is dependent on the mini-batch size and it is not obvious how to apply it to recurrent neural networks. In this paper, we transpose batch normalization into layer normalization by computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case. Like batch normalization, we also give each neuron its own adaptive bias and gain which are applied after the normalization but before the non-linearity. Unlike batch normalization, layer normalization performs exactly the same computation at training and test times. It is also straightforward to apply to recurrent neural networks by computing the normalization statistics separately at each time step. Layer normalization is very effective at stabilizing the hidden state dynamics in recurrent networks. Empirically, we show that layer normalization can substantially reduce the training time compared with previously published techniques.

openalex-author · Nature

Deep learning

No abstract available from the OpenAlex source record.

openalex-author · International Journal of Computer Vision

Guest Editorial: Deep Learning

No abstract available from the OpenAlex source record.

openalex-author · arXiv (Cornell University)

A Simple Way to Initialize Recurrent Networks of Rectified Linear Units

Learning long term dependencies in recurrent networks is difficult due to vanishing and exploding gradients. To overcome this difficulty, researchers have developed sophisticated optimization techniques and network architectures. In this paper, we propose a simpler solution that use recurrent neural networks composed of rectified linear units. Key to our solution is the use of the identity matrix or its scaled version to initialize the recurrent weight matrix. We find that our solution is comparable to LSTM on our four benchmarks: two toy problems involving long-range temporal structures, a large language modeling problem and a benchmark speech recognition problem.

openalex-author · arXiv (Cornell University)

Distilling the Knowledge in a Neural Network

A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.

openalex-author · arXiv (Cornell University)

Grammar as a Foreign Language

Syntactic constituency parsing is a fundamental problem in natural language processing and has been the subject of intensive research and engineering for decades. As a result, the most accurate parsers are domain specific, complex, and inefficient. In this paper we show that the domain agnostic attention-enhanced sequence-to-sequence model achieves state-of-the-art results on the most widely used syntactic constituency parsing dataset, when trained on a large synthetic corpus that was annotated using existing parsers. It also matches the performance of standard parsers when trained only on a small human-annotated dataset, which shows that this model is highly data-efficient, in contrast to sequence-to-sequence models without the attention mechanism. Our parser is also fast, processing over a hundred sentences per second with an unoptimized CPU implementation.

openalex-author · Interspeech 2014

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

We describe a simple but effective way of using multi-frame targets to improve the accuracy of Artificial Neural NetworkHidden Markov Model (ANN-HMM) hybrid systems. In this approach a Deep Neural Network (DNN) is trained to predict the forced-alignment state of multiple frames using a separate softmax unit for each of the frames. This is in contrast to the usual method of training a DNN to predict only the state of the central frame. By itself this is not sufficient to improve accuracy of the system significantly. However, if we average the predictions for each frame from the different contexts it is associated with we achieve state of the art results on TIMIT using a fully connected Deep Neural Network without convolutional architectures or dropout training. On a 14 hour subset of Wall Street Journal (WSJ) using a context dependent DNN-HMM system it leads to a relative improvement of 6.4% on the dev set (testdev93) and 9.3% on test set (test-eval92).

openalex-author · IEEE/ACM Transactions on Audio, Speech, and Language Processing

Application of Deep Belief Networks for Natural Language Understanding

Applications of Deep Belief Nets (DBN) to various problems have been the subject of a number of recent studies ranging from image classification and speech recognition to audio classification. In this study we apply DBNs to a natural language understanding problem. The recent surge of activity in this area was largely spurred by the development of a greedy layer-wise pretraining method that uses an efficient learning algorithm called Contrastive Divergence (CD). CD allows DBNs to learn a multi-layer generative model from unlabeled data and the features discovered by this model are then used to initialize a feed-forward neural network which is fine-tuned with backpropagation. We compare a DBN-initialized neural network to three widely used text classification algorithms: Support Vector Machines (SVM), boosting and Maximum Entropy (MaxEnt). The plain DBN-based model gives a call-routing classification accuracy that is equal to the best of the other models. However, using additional unlabeled data for DBN pre-training and combining DBN-based learned features with the original features provides significant gains over SVMs, which, in turn, performed better than both MaxEnt and Boosting.

openalex-author · http://www.cs.toronto.edu/%7Ersalakhu/papers/srivastava14a.pdf

Dropout: a simple way to prevent neural networks from overfitting

Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different “thinned ” networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.

openalex-author · arXiv (Cornell University)

Modeling Documents with Deep Boltzmann Machines

We introduce a Deep Boltzmann Machine model suitable for modeling and extracting latent semantic representations from a large unstructured collection of documents. We overcome the apparent difficulty of training a DBM with judicious parameter tying. This parameter tying enables an efficient pretraining algorithm and a state initialization scheme that aids inference. The model can be trained just as efficiently as a standard Restricted Boltzmann Machine. Our experiments show that the model assigns better log probability to unseen data than the Replicated Softmax model. Features extracted from our model outperform LDA, Replicated Softmax, and DocNADE models on document retrieval and document classification tasks.

openalex-author · Interspeech 2013

Using an autoencoder with deformable templates to discover features for automated speech recognition

In this paper we show how we can discover non-linear features of frames of spectrograms using a novel autoencoder. The autoencoder uses a neural network encoder that predicts how a set of prototypes called templates need to be transformed to reconstruct the data, and a decoder that is a function that performs this operation of transforming prototypes and reconstructing the input. We demonstrate this method on spectrograms from the TIMIT database. The features are used in a Deep Neural Network Hidden Markov Model (DNN-HMM) hybrid system for automatic speech recognition. On the TIMIT monophone recognition task we were able to achieve gains of 0.5% over Mel log spectra, by augmenting traditional the spectra with the predicted transformation parameters. Further, using the recently discovered ‘dropout’ training, we were able to achieve a phone error rate (PER) of 17.9% on the dev set and 19.5% on the test set, which, to our knowledge is the best reported number on this task using a hybrid system.

openalex-author · http://www.cs.toronto.edu/~rsalakhu/papers/uai13.pdf

Modeling documents with a Deep Boltzmann Machine

We introduce a type of Deep Boltzmann Machine (DBM) that is suitable for extracting distributed semantic representations from a large unstructured collection of documents. We overcome the apparent difficulty of training a DBM with judicious parameter tying. This enables an efficient pretraining algorithm and a state initialization scheme for fast inference. The model can be trained just as efficiently as a standard Restricted Boltzmann Machine. Our experiments show that the model assigns better log probability to unseen data than the Replicated Softmax model. Features extracted from our model outperform LDA, Replicated Softmax, and DocNADE models on document retrieval and document classification tasks. 1

openalex-author · Cognitive Science

Where Do Features Come From?

It is possible to learn multiple layers of non-linear features by backpropagating error derivatives through a feedforward neural network. This is a very effective learning procedure when there is a huge amount of labeled training data, but for many learning tasks very few labeled examples are available. In an effort to overcome the need for labeled data, several different generative models were developed that learned interesting features by modeling the higher order statistical structure of a set of input vectors. One of these generative models, the restricted Boltzmann machine (RBM), has no connections between its hidden units and this makes perceptual inference and learning much simpler. More significantly, after a layer of hidden features has been learned, the activities of these features can be used as training data for another RBM. By applying this idea recursively, it is possible to learn a deep hierarchy of progressively more complicated features without requiring any labeled data. This deep hierarchy can then be treated as a feedforward neural network which can be discriminatively fine-tuned using backpropagation. Using a stack of RBMs to initialize the weights of a feedforward neural network allows backpropagation to work effectively in much deeper networks and it leads to much better generalization. A stack of RBMs can also be used to initialize a deep Boltzmann machine that has many hidden layers. Combining this initialization method with a new method for fine-tuning the weights finally leads to the first efficient way of training Boltzmann machines with many hidden layers and millions of weights.

openalex-author · International Conference on Machine Learning

Tensor Analyzers

Factor Analysis is a statistical method that seeks to explain linear variations in data by using unobserved latent variables. Due to its additive nature, it is not suitable for modeling data that is generated by multiple groups of latent factors which interact multiplicatively. In this paper, we introduce Tensor Analyzers which are a multilinear generalization of Factor Analyzers. We describe an efficient way of sampling from the posterior distribution over factor values and we demonstrate that these samples can be used in the EM algorithm for learning interesting mixture models of natural image patches. Tensor Analyzers can also accurately recognize a face under significant pose and illumination variations when given only one previous image of that face. We also show that Tensor Analyzers can be trained in an unsupervised, semi-supervised, or fully supervised settings.

openalex-author · http://www.cs.toronto.edu/~gdahl/papers/momentumNesterovDeepLearning.pdf

On the importance of initialization and momentum in deep learning

Deep and recurrent neural networks (DNNs and RNNs respectively) are powerful models that were considered to be almost impossible to train using stochastic gradient descent with momentum. In this paper, we show that when stochastic gradient descent with momentum uses a well-designed random initialization and a particular type of slowly increasing schedule for the momentum parameter, it can train both DNNs and RNNs (on datasets with long-term dependencies) to levels of performance that were previously achievable only with Hessian-Free optimization. We find that both the initialization and the momentum are crucial since poorly initialized networks cannot be trained with momentum and well-initialized networks perform markedly worse when the momentum is absent or poorly tuned. Our success training these models suggests that previous attempts to train deep and recurrent neural networks from random initializations have likely failed due to poor initialization schemes. Furthermore, carefully tuned momentum methods suffice for dealing with the curvature issues in deep and recurrent network training objectives without the need for sophisticated second-order methods. 1.

openalex-author · 2013 IEEE International Conference on Acoustics, Speech and Signal Processing

On rectified linear units for speech processing

Deep neural networks have recently become the gold standard for acoustic modeling in speech recognition systems. The key computational unit of a deep network is a linear projection followed by a point-wise non-linearity, which is typically a logistic function. In this work, we show that we can improve generalization and make training of deep networks faster and simpler by substituting the logistic units with rectified linear units. These units are linear when their input is positive and zero otherwise. In a supervised setting, we can successfully train very deep nets from random initialization on a large vocabulary speech recognition task achieving lower word error rates than using a logistic network with the same topology. Similarly in an unsupervised setting, we show how we can learn sparse features that can be useful for discriminative tasks. All our experiments are executed in a distributed environment using several hundred machines and several hundred hours of speech data.

openalex-author · 2013 IEEE International Conference on Acoustics, Speech and Signal Processing

New types of deep neural network learning for speech recognition and related applications: an overview

In this paper, we provide an overview of the invited and contributed papers presented at the special session at ICASSP-2013, entitled “New Types of Deep Neural Network Learning for Speech Recognition and Related Applications,” as organized by the authors. We also describe the historical context in which acoustic models based on deep neural networks have been developed. The technical overview of the papers presented in our special session is organized into five ways of improving deep learning methods: (1) better optimization; (2) better types of neural activation function and better network architectures; (3) better ways to determine the myriad hyper-parameters of deep neural networks; (4) more appropriate ways to preprocess speech for deep neural networks; and (5) ways of leveraging multiple languages or dialects that are more easily achieved with deep neural networks than with Gaussian mixture models.

openalex-author · 2013 IEEE International Conference on Acoustics, Speech and Signal Processing

Speech recognition with deep recurrent neural networks

Recurrent neural networks (RNNs) are a powerful model for sequential data. End-to-end training methods such as Connectionist Temporal Classification make it possible to train RNNs for sequence labelling problems where the input-output alignment is unknown. The combination of these methods with the Long Short-term Memory RNN architecture has proved particularly fruitful, delivering state-of-the-art results in cursive handwriting recognition. However RNN performance in speech recognition has so far been disappointing, with better results returned by deep feedforward networks. This paper investigates deep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs. When trained end-to-end with suitable regularisation, we find that deep Long Short-term Memory RNNs achieve a test set error of 17.7% on the TIMIT phoneme recognition benchmark, which to our knowledge is the best recorded score.

openalex-author · 2013 IEEE International Conference on Acoustics, Speech and Signal Processing

Improving deep neural networks for LVCSR using rectified linear units and dropout

Recently, pre-trained deep neural networks (DNNs) have outperformed traditional acoustic models based on Gaussian mixture models (GMMs) on a variety of large vocabulary speech recognition benchmarks. Deep neural nets have also achieved excellent results on various computer vision tasks using a random “dropout” procedure that drastically improves generalization error by randomly omitting a fraction of the hidden units in all layers. Since dropout helps avoid over-fitting, it has also been successful on a small-scale phone recognition task using larger neural nets. However, training deep neural net acoustic models for large vocabulary speech recognition takes a very long time and dropout is likely to only increase training time. Neural networks with rectified linear unit (ReLU) non-linearities have been highly successful for computer vision tasks and proved faster to train than standard sigmoid units, sometimes also improving discriminative performance. In this work, we show on a 50-hour English Broadcast News task that modified deep neural networks using ReLUs trained with dropout during frame level training provide an 4.2% relative improvement over a DNN trained with sigmoid units, and a 14.4% relative improvement over a strong GMM/HMM system. We were able to obtain our results with minimal human hyper-parameter tuning using publicly available Bayesian optimization code.

openalex-author · http://openreview.net/file/fbfec4a2-5503-45ce-ba9d-78f290c2f044.pdf

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

We introduce a type of Deep Boltzmann Machine (DBM) that is suitable for ex-tracting distributed semantic representations from a large unstructured collection of docu-ments. We propose an approximate inference method that interacts with learning in a way that makes it possible to train the DBM more efficiently than previously proposed methods. Even though the model has two hidden lay-ers, it can be trained just as efficiently as a standard Restricted Boltzmann Machine. Our experiments show that the model as-signs better log probability to unseen data than the Replicated Softmax model. Fea-tures extracted from our model outperform LDA, Replicated Softmax, and DocNADE models on document retrieval and document classification tasks. 1.

openalex-author · IEEE Transactions on Pattern Analysis and Machine Intelligence

Modeling Natural Images Using Gated MRFs

This paper describes a Markov Random Field for real-valued image modeling that has two sets of latent variables. One set is used to gate the interactions between all pairs of pixels, while the second set determines the mean intensities of each pixel. This is a powerful model with a conditional distribution over the input that is Gaussian, with both mean and covariance determined by the configuration of latent variables, which is unlike previous models that were restricted to using Gaussians with either a fixed mean or a diagonal covariance matrix. Thanks to the increased flexibility, this gated MRF can generate more realistic samples after training on an unconstrained distribution of high-resolution natural images. Furthermore, the latent variables of the model can be inferred efficiently and can be used as very effective descriptors in recognition tasks. Both generation and discrimination drastically improve as layers of binary latent variables are added to the model, yielding a hierarchical model called a Deep Belief Network.

openalex-author · arXiv (Cornell University)

Discovering Multiple Constraints that are Frequently Approximately Satisfied

Some high-dimensional data.sets can be modelled by assuming that there are many different linear constraints, each of which is Frequently Approximately Satisfied (FAS) by the data. The probability of a data vector under the model is then proportional to the product of the probabilities of its constraint violations. We describe three methods of learning products of constraints using a heavy-tailed probability distribution for the violations.

openalex-author · Paper

Machine learning for aerial image labeling

Information extracted from aerial photographs has found applications in a wide range of areas including urban planning, crop and forest management, disaster relief, and climate modeling. At present, much of the extraction is still performed by human experts, making the process slow, costly, and error prone. The goal of this thesis is to develop methods for automatically extracting the locations of objects such as roads, buildings, and trees directly from aerial images. We investigate the use of machine learning methods trained on aligned aerial images and possibly outdated maps for labeling the pixels of an aerial image with semantic labels. We show how deep neural networks implemented on modern GPUs can be used to efficiently learn highly discriminative image features. We then introduce new loss functions for training neural networks that are partially robust to incomplete and poorly registered target maps. Finally, we propose two ways of improving the predictions of our system by introducing structure into the outputs of the neural networks. We evaluate our system on the largest and most-challenging road and building detection datasets considered in the literature and show that it works reliably under a wide variety of conditions. Furthermore, we are releasing the first large-scale road and building detection datasets to the public in order to facilitate future comparisons with other methods.

openalex-author · Paper

Training recurrent neural networks

Recurrent Neural Networks (RNNs) are powerful sequence models that were believed to be difficult to train, and as a result they were rarely used in machine learning applications. This thesis presents methods that overcome the difficulty of training RNNs, and applications of RNNs to challenging problems. We first describe a new probabilistic sequence model that combines Restricted Boltzmann Machines and RNNs. The new model is more powerful than similar models while being less difficult to train. Next, we present a new variant of the Hessian-free (HF) optimizer and show that it can train RNNs on tasks that have extreme long-range temporal dependencies, which were previously considered to be impossibly hard. We then apply HF to character-level language modelling and get excellent results. We also apply HF to optimal control and obtain RNN control laws that can successfully operate under conditions of delayed feedback and unknown disturbances. Finally, we describe a random parameter initialization scheme that allows gradient descent with momentum to train RNNs on problems with long-term dependencies. This directly contradicts widespread beliefs about the inability of first-order methods to do so, and suggests that previous attempts at training RNNs failed partly due to flaws in the random initialization.

openalex-author · Neural Information Processing Systems

A Better Way to Pretrain Deep Boltzmann Machines

We describe how the pretraining algorithm for Deep Boltzmann Machines (DBMs) is related to the pretraining algorithm for Deep Belief Networks and we show that under certain conditions, the pretraining procedure improves the variational lower bound of a two-hidden-layer DBM. Based on this analysis, we develop a different method of pretraining DBMs that distributes the modelling work more evenly over the hidden layers. Our results on the MNIST and NORB datasets demonstrate that the new pretraining algorithm allows us to learn better generative models.

openalex-author · http://research.google.com/pubs/archive/38131.pdf

Deep Neural Networks for Acoustic Modeling in Speech Recognition

Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative way to evaluate the fit is to use a feedforward neural network that takes several frames of coefficients as input and produces posterior probabilities over HMM states as output. Deep neural networks with many hidden layers, that are trained using new methods have been shown to outperform Gaussian mixture models on a variety of speech recognition benchmarks, sometimes by a large margin. This paper provides an overview of this progress and represents the shared views of four research groups who have had recent successes in using deep neural networks for acoustic modeling in speech recognition. I.

openalex-author · arXiv (Cornell University)

Efficient Parametric Projection Pursuit Density Estimation

Product models of low dimensional experts are a powerful way to avoid the curse of dimensionality. We present the ``under-complete product of experts' (UPoE), where each expert models a one dimensional projection of the data. The UPoE is fully tractable and may be interpreted as a parametric probabilistic model for projection pursuit. Its ML learning rules are identical to the approximate learning rules proposed before for under-complete ICA. We also derive an efficient sequential learning algorithm and discuss its relationship to projection pursuit density estimation and feature induction algorithms for additive random field models.

openalex-author · IEEE Signal Processing Magazine

Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups

Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models (GMMs) to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative way to evaluate the fit is to use a feed-forward neural network that takes several frames of coefficients as input and produces posterior probabilities over HMM states as output. Deep neural networks (DNNs) that have many hidden layers and are trained using new methods have been shown to outperform GMMs on a variety of speech recognition benchmarks, sometimes by a large margin. This article provides an overview of this progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.

openalex-author · arXiv (Cornell University)

Improving neural networks by preventing co-adaptation of feature detectors

When a large feedforward neural network is trained on a small training set, it typically performs poorly on held-out test data. This "overfitting" is greatly reduced by randomly omitting half of the feature detectors on each training case. This prevents complex co-adaptations in which a feature detector is only helpful in the context of several other specific feature detectors. Instead, each neuron learns to detect a feature that is generally helpful for producing the correct answer given the combinatorially large variety of internal contexts in which it must operate. Random "dropout" gives big improvements on many benchmark tasks and sets new records for speech and object recognition.

openalex-author · arXiv (Cornell University)

Deep Lambertian Networks

Visual perception is a challenging problem in part due to illumination variations. A possible solution is to first estimate an illumination invariant representation before using it for recognition. The object albedo and surface normals are examples of such representations. In this paper, we introduce a multilayer generative model where the latent variables include the albedo, surface normals, and the light source. Combining Deep Belief Nets with the Lambertian reflectance assumption, our model can learn good priors over the albedo from 2D images. Illumination variations can be explained by changing only the lighting latent variable in our model. By transferring learned knowledge from similar objects, albedo and surface normals estimation from a single image is possible in our model. Experiments demonstrate that our model is able to generalize as well as improve over standard baselines in one-shot face recognition.

openalex-author · http://www.cs.utoronto.ca/%7Ehinton/absps/noisy_maps.pdf

Learning to Label Aerial Images from Noisy Data

When training a system to label images, the amount of labeled training data tends to be a limiting factor. We consider the task of learning to label aerial images from existing maps. These provide abundant labels, but the labels are often incomplete and sometimes poorly registered. We propose two robust loss functions for dealing with these kinds of label noise and use the loss functions to train a deep neural network on two challenging aerial image datasets. The robust loss functions lead to big improvements in performance and our best system substantially outperforms the best published results on the task we consider. 1.

openalex-author · arXiv (Cornell University)

Deep Mixtures of Factor Analysers

An efficient way to learn deep density models that have many layers of latent variables is to learn one layer at a time using a model that has only one layer of latent variables. After learning each layer, samples from the posterior distributions for that layer are used as training data for learning the next layer. This approach is commonly used with Restricted Boltzmann Machines, which are undirected graphical models with a single hidden layer, but it can also be used with Mixtures of Factor Analysers (MFAs) which are directed graphical models. In this paper, we present a greedy layer-wise learning algorithm for Deep Mixtures of Factor Analysers (DMFAs). Even though a DMFA can be converted to an equivalent shallow MFA by multiplying together the factor loading matrices at different levels, learning and inference are much more efficient in a DMFA and the sharing of each lower-level factor loading matrix by many different higher level MFAs prevents overfitting. We demonstrate empirically that DMFAs learn better density models than both MFAs and two types of Restricted Boltzmann Machine on a wide variety of datasets.

openalex-author · 2012 IEEE Conference on Computer Vision and Pattern Recognition

Robust Boltzmann Machines for recognition and denoising

While Boltzmann Machines have been successful at unsupervised learning and density modeling of images and speech data, they can be very sensitive to noise in the data. In this paper, we introduce a novel model, the Robust Boltzmann Machine (RoBM), which allows Boltzmann Machines to be robust to corruptions. In the domain of visual recognition, the RoBM is able to accurately deal with occlusions and noise by using multiplicative gating to induce a scale mixture of Gaussians over pixels. Image denoising and in-painting correspond to posterior inference in the RoBM. Our model is trained in an unsupervised fashion with unlabeled noisy data and can learn the spatial structure of the occluders. Compared to standard algorithms, the RoBM is significantly better at recognition and denoising on several face databases.

openalex-author · arXiv (Cornell University)

Products of Hidden Markov Models: It Takes N>1 to Tango

Products of Hidden Markov Models(PoHMMs) are an interesting class of generative models which have received little attention since their introduction. This maybe in part due to their more computationally expensive gradient-based learning algorithm,and the intractability of computing the log likelihood of sequences under the model. In this paper, we demonstrate how the partition function can be estimated reliably via Annealed Importance Sampling. We perform experiments using contrastive divergence learning on rainfall data and data captured from pairs of people dancing. Our results suggest that advances in learning and evaluation for undirected graphical models and recent increases in available computing power make PoHMMs worth considering for complex time-series modeling tasks.

openalex-author · Neural Computation

An Efficient Learning Procedure for Deep Boltzmann Machines

We present a new learning algorithm for Boltzmann machines that contain many layers of hidden variables. Data-dependent statistics are estimated using a variational approximation that tends to focus on a single mode, and data-independent statistics are estimated using persistent Markov chains. The use of two quite different techniques for estimating the two types of statistic that enter into the gradient of the log likelihood makes it practical to learn Boltzmann machines with multiple hidden layers and millions of parameters. The learning can be made more efficient by using a layer-by-layer pretraining phase that initializes the weights sensibly. The pretraining also allows the variational inference to be initialized sensibly with a single bottom-up pass. We present results on the MNIST and NORB data sets showing that deep Boltzmann machines learn very good generative models of handwritten digits and 3D objects. We also show that the features discovered by deep Boltzmann machines are a very effective way to initialize the hidden layers of feedforward neural nets, which are then discriminatively fine-tuned.

openalex-author · 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Understanding how Deep Belief Networks perform acoustic modelling

Deep Belief Networks (DBNs) are a very competitive alternative to Gaussian mixture models for relating states of a hidden Markov model to frames of coefficients derived from the acoustic input. They are competitive for three reasons: DBNs can be fine-tuned as neural networks; DBNs have many non-linear hidden layers; and DBNs are generatively pre-trained. This paper illustrates how each of these three aspects contributes to the DBN's good recognition performance using both phone recognition performance on the TIMIT corpus and a dimensionally reduced visualization of the relationships between the feature vectors learned by the DBNs that preserves the similarity structure of the feature vectors at multiple scales. The same two methods are also used to investigate the most suitable type of input representation for a DBN.

openalex-author · arXiv (Cornell University)

Conditional Restricted Boltzmann Machines for Structured Output\n Prediction

Conditional Restricted Boltzmann Machines (CRBMs) are rich probabilistic\nmodels that have recently been applied to a wide range of problems, including\ncollaborative filtering, classification, and modeling motion capture data.\nWhile much progress has been made in training non-conditional RBMs, these\nalgorithms are not applicable to conditional models and there has been almost\nno work on training and generating predictions from conditional RBMs for\nstructured output problems. We first argue that standard Contrastive\nDivergence-based learning may not be suitable for training CRBMs. We then\nidentify two distinct types of structured output prediction problems and\npropose an improved learning algorithm for each. The first problem type is one\nwhere the output space has arbitrary structure but the set of likely output\nconfigurations is relatively small, such as in multi-label classification. The\nsecond problem is one where the output space is arbitrarily structured but\nwhere the output space variability is much greater, such as in image denoising\nor pixel labeling. We show that the new learning algorithms can work much\nbetter than Contrastive Divergence on both types of problems.\n

openalex-author · Lecture Notes in Computer Science

A Practical Guide to Training Restricted Boltzmann Machines

No abstract available from the OpenAlex source record.

openalex-author · Paper

The shared views of four research groups )

No abstract available from the OpenAlex source record.

openalex-author · Machine Learning

Visualizing non-metric similarities in multiple maps

Techniques for multidimensional scaling visualize objects as points in a low-dimensional metric map. As a result, the visualizations are subject to the fundamental limitations of metric spaces. These limitations prevent multidimensional scaling from faithfully representing non-metric similarity data such as word associations or event co-occurrences. In particular, multidimensional scaling cannot faithfully represent intransitive pairwise similarities in a visualization, and it cannot faithfully visualize “central” objects. In this paper, we present an extension of a recently proposed multidimensional scaling technique called t-SNE. The extension aims to address the problems of traditional multidimensional scaling techniques when these techniques are used to visualize non-metric similarities. The new technique, called multiple maps t-SNE, alleviates these problems by constructing a collection of maps that reveal complementary structure in the similarity data. We apply multiple maps t-SNE to a large data set of word association data and to a data set of NIPS co-authorships, demonstrating its ability to successfully visualize non-metric similarities.

openalex-author · IEEE Transactions on Audio, Speech, and Language Processing

Introduction to the Special Section on Deep Learning for Speech and Language Processing

Current speech recognition systems, for example, typically use Gaussian mixture models (GMMs), to estimate the observation (or emission) probabilities of hidden Markov models (HMMs), and GMMs are generative models that have only one layer of latent variables. Instead of developing more powerful models, most of the research effort has gone into finding better ways of estimating the GMM parameters so that error rates are decreased or the margin between different classes is increased. The same observation holds for natural language processing (NLP) in which maximum entropy (MaxEnt) models and conditional random fields (CRFs) have been popular for the last decade. Both of these approaches use shallow models whose success largely depends on the use of carefully handcrafted features.

openalex-author · Communications of the ACM

A better way to learn features

No abstract available.

openalex-author · Neural Systems & Circuits

Machine learning for neuroscience

What is machine learning? Machine learning is a type of statistics that places particular emphasis on the use of advanced computational algorithms. As computers become more powerful, and modern experimental methods in areas such as imaging generate vast bodies of data, machine learning is becoming ever more important for extracting reliable and meaningful relationships and for making accurate predictions. Key strands of modern machine learning grew out of attempts to understand how large numbers of interconnected, more or less neuron-like elements could learn to achieve behaviourally meaningful computations and to extract useful features from images or sound waves. By the 1990s, key approaches had converged on an elegant framework called ‘graphical models’, explained in Koller and Friedman, in which the nodes of a graph represent variables such as edges and corners in an image, or phonemes and words in speech. The probabilistic relationships between nodes are represented by conditional probability tables or simple functions whose parameters are learned from the data. There are three main problems in fitting graphical models to data: inference, parameter learning and structure learning. The inference problem is how to infer the probable values of unobserved variables when the values of a subset of the variables have been observed, and is a problem that perceptual systems need to solve if they are to infer the hidden causes of their sensory input. The parameter-learning problem is how to adjust the parameters governing the way in which one variable influences another, so that the graphical model is a better fit to some observed data. In the brain, this is presumably done by changing synapse strengths. The structure-learning problem is how to decide which unobserved variables are needed and how they must be connected to model the correlations between observed variables. In the brain, evolution and early pruning of connections presumably have a large role to play in determining the structure. Could you provide a brief description of the methods of machine learning? Machine learning can be divided into three parts: 1) in supervised learning, the aim is to predict a class label or a real value from an input (classifying objects in images or predicting the future value of a stock are examples of this type of learning); 2) in unsupervised learning, the aim is to discover good features for representing the input data; and 3) in reinforcement learning, the aim is to discover what action should be performed next in order to maximize the eventual payoff.

openalex-author · http://www.cs.toronto.edu/%7Ejmartens/docs/RNN_Language.pdf

Generating Text with Recurrent Neural Networks

Recurrent Neural Networks (RNNs) are very powerful sequence models that do not enjoy widespread use because it is extremely difficult to train them properly. Fortunately, recent advances in Hessian-free optimization have been able to overcome the difficulties associated with training RNNs, making it possible to apply them successfully to challenging sequence problems. In this paper we demonstrate the power of RNNs trained with the new Hessian-Free optimizer (HF) by applying them to character-level language modeling tasks. The standard RNN architecture, while effective, is not ideally suited for such tasks, so we introduce a new RNN variant that uses multiplicative (or “gated”) connections which allow the current input character to determine the transition matrix from one hidden state vector to the next. After training the multiplicative RNN with the HF optimizer for five days on 8 high-end Graphics Processing Units, we were able to surpass the performance of the best previous single method for characterlevel language modeling – a hierarchical nonparametric sequence model. To our knowledge this represents the largest recurrent neural network application to date. 1.

openalex-author · CVPR 2011

On deep generative models with applications to recognition

The most popular way to use probabilistic models in vision is first to extract some descriptors of small image patches or object parts using well-engineered features, and then to use statistical learning tools to model the dependencies among these features and eventual labels. Learning probabilistic models directly on the raw pixel values has proved to be much more difficult and is typically only used for regularizing discriminative methods. In this work, we use one of the best, pixel-level, generative models of natural images-a gated MRF-as the lowest level of a deep belief network (DBN) that has several hidden layers. We show that the resulting DBN is very good at coping with occlusion when predicting expression categories from face images, and it can produce features that perform comparably to SIFT descriptors for discriminating different types of scene. The generative ability of the model also makes it easy to see what information is captured and what is lost at each level of representation.

openalex-author · CVPR 2011

Modeling the joint density of two images under a variety of transformations

We describe a generative model of the relationship between two images. The model is defined as a factored three-way Boltzmann machine, in which hidden variables collaborate to define the joint correlation matrix for image pairs. Modeling the joint distribution over pairs makes it possible to efficiently match images that are the same according to a learned measure of similarity. We apply the model to several face matching tasks, and show that it learns to represent the input images using task-specific basis functions. Matching performance is superior to previous similar generative models, including recent conditional models of transformations. We also show that the model can be used as a plug-in matching score to perform invariant classification.

openalex-author · 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Deep Belief Networks using discriminative features for phone recognition

Deep Belief Networks (DBNs) are multi-layer generative models. They can be trained to model windows of coefficients extracted from speech and they discover multiple layers of features that capture the higher-order statistical structure of the data. These features can be used to initialize the hidden units of a feed-forward neural network that is then trained to predict the HMM state for the central frame of the window. Initializing with features that are good at generating speech makes the neural network perform much better than initializing with random weights. DBNs have already been used successfully for phone recognition with input coefficients that are MFCCs or filterbank outputs. In this paper, we demonstrate that they work even better when their inputs are speaker adaptive, discriminative features. On the standard TIMIT corpus, they give phone error rates of 19.6% using monophone HMMs and a bigram language model and 19.4% using monophone HMMs and a trigram language model.

openalex-author · 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Deep belief nets for natural language call-routing

This paper considers application of Deep Belief Nets (DBNs) to natural language call routing. DBNs have been successfully applied to a number of tasks, including image, audio and speech classification, thanks to the recent discovery of an efficient learning technique. DBNs learn a multi-layer generative model from unlabeled data and the features discovered by this model are then used to initialize a feed-forward neural network which is fine-tuned with backpropagation. We compare a DBN-initialized neural network to three widely used text classification algorithms; Support Vector machines (SVM), Boosting and Maximum Entropy (MaxEnt). The DBN-based model gives a call-routing classification accuracy that is equal to the best of the other models even though it currently uses an impoverished representation of the input.

openalex-author · 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Learning a better representation of speech soundwaves using restricted boltzmann machines

State of the art speech recognition systems rely on preprocessed speech features such as Mel cepstrum or linear predictive coding coefficients that collapse high dimensional speech sound waves into low dimensional encodings. While these have been successfully applied in speech recognition systems, such low dimensional encodings may lose some relevant information and express other information in a way that makes it difficult to use for discrimination. Higher dimensional encodings could both improve performance in recognition tasks, and also be applied to speech synthesis by better modeling the statistical structure of the sound waves. In this paper we present a novel approach for modeling speech sound waves using a Restricted Boltzmann machine (RBM) with a novel type of hidden variable and we report initial results demonstrating phoneme recognition performance better than the current state-of-the-art for methods based on Mel cepstrum coefficients.

openalex-author · http://www.cs.toronto.edu/%7Ehinton/absps/taylorJMLR.pdf

Two Distributed-State Models For Generating High-Dimensional Time Series

In this paper we develop a class of nonlinear generative models for high-dimensional time series. We first propose a model based on the restricted Boltzmann machine (RBM) that uses an undirected model with binary latent variables and real-valued “visible” variables. The latent and visible variables at each time step receive directed connections from the visible variables at the last few time-steps. This “conditional” RBM (CRBM) makes on-line inference efficient and allows us to use a simple approximate learning procedure. We demonstrate the power of our approach by synthesizing various sequences from a model trained on motion capture data and by performing on-line filling in of data lost during capture. We extend the CRBM in a way that preserves its most important computational properties and introduces multiplicative three-way interactions that allow the effective interaction weight between two variables to be modulated by the dynamic state of a third variable. We introduce a factoring of the implied three-way weight tensor to permit a more compact parameterization. The resulting model can capture diverse styles of motion with a single set of parameters, and the three-way interactions greatly improve its ability to blend motion styles or to transition smoothly among them. Videos and source code can be found at

openalex-author · IEEE Transactions on Audio, Speech, and Language Processing

Acoustic Modeling Using Deep Belief Networks

Gaussian mixture models are currently the dominant technique for modeling the emission distribution of hidden Markov models for speech recognition. We show that better phone recognition on the TIMIT dataset can be achieved by replacing Gaussian mixture models by deep neural networks that contain many layers of features and a very large number of parameters. These networks are first pre-trained as a multi-layer generative model of a window of spectral feature vectors without making use of any discriminative information. Once the generative pre-training has designed the features, we perform discriminative fine-tuning using backpropagation to adjust the features slightly to make them better at predicting a probability distribution over the states of monophone hidden Markov models.

openalex-author · http://www.cs.toronto.edu/%7Ehinton/absps/esann-deep-final.pdf

Using very deep autoencoders for content-based image retrieval.

Abstract. We show how to learn many layers of features on color images and we use these features to initialize deep autoencoders. We then use the autoencoders to map images to short binary codes. Using semantic hashing [6], 28-bit codes can be used to retrieve images that are similar to a query image in a time that is independent of the size of the database. This extremely fast retrieval makes it possible to search using multiple di erent transformations of the query image. 256-bit binary codes allow much more accurate matching and can be used to prune the set of images found using the 28-bit codes. 1

openalex-author · Lecture Notes in Computer Science

Transforming Auto-Encoders

No abstract available from the OpenAlex source record.

openalex-author · http://learning.cs.toronto.edu/%7Ehinton/absps/gatedsoftmax.pdf

Gated Softmax Classification

We describe a ”log-bilinear ” model that computes class probabilities by combining an input vector multiplicatively with a vector of binary latent variables. Even though the latent variables can take on exponentially many possible combinations of values, we can efficiently compute the exact probability of each class by marginalizing over the latent variables. This makes it possible to get the exact gradient of the log likelihood. The bilinear score-functions are defined using a three-dimensional weight tensor, and we show that factorizing this tensor allows the model to encode invariances inherent in a task by learning a dictionary of invariant basis functions. Experiments on a set of benchmark problems show that this fully probabilistic model can achieve classification performance that is competitive with (kernel) SVMs, backpropagation, and deep belief nets. 1

openalex-author · Neural Information Processing Systems

Generating more realistic images using gated MRF's

Probabilistic models of natural images are usually evaluated by measuring performance on rather indirect tasks, such as denoising and inpainting. A more direct way to evaluate a generative model is to draw samples from it and to check whether statistical properties of the samples match the statistics of natural images. This method is seldom used with high-resolution images, because current models produce samples that are very different from natural images, as assessed by even simple visual inspection. We investigate the reasons for this failure and we show that by augmenting existing models so that there are two sets of latent variables, one set modelling pixel intensities and the other set modelling image-specific pixel covariances, we are able to generate high-resolution images that look much more realistic than before. The overall model can be interpreted as a gated MRF where both pair-wise dependencies and mean intensities of pixels are modulated by the states of latent variables. Finally, we confirm that if we disallow weight-sharing between receptive fields that overlap each other, the gated MRF learns more efficient internal representations, as demonstrated in several recognition tasks.

openalex-author · http://learning.cs.toronto.edu/%7Ehinton/absps/nips_eyebm.pdf

Learning to combine foveal glimpses with a third-order Boltzmann machine

We describe a model based on a Boltzmann machine with third-order connections that can learn how to accumulate information about a shape over several fixations. The model uses a retina that only has enough high resolution pixels to cover a small area of the image, so it must decide on a sequence of fixations and it must combine the “glimpse ” at each fixation with the location of the fixation before integrating the information with information from other glimpses of the same object. We evaluate this model on a synthetic dataset and two image classification datasets, showing that it can perform at least as well as a model trained on whole images. 1

openalex-author · Interspeech 2010

Binary coding of speech spectrograms using a deep auto-encoder

This paper reports our recent exploration of the layer-by-layer learning strategy for training a multi-layer generative model of patches of speech spectrograms. The top layer of the generative model learns binary codes that can be used for efficient compression of speech and could also be used for scalable speech recognition or rapid speech content retrieval. Each layer of the generative model is fully connected to the layer below and the weights on these connections are pretrained efficiently by using the contrastive divergence approximation to the log likelihood gradient. After layer-bylayer pre-training we “unroll” the generative model to form a deep auto-encoder, whose parameters are then fine-tuned using back-propagation. To reconstruct the full-length speech spectrogram, individual spectrogram segments predicted by their respective binary codes are combined using an overlapand-add method. Experimental results on speech spectrogram coding demonstrate that the binary codes produce a logspectral distortion that is approximately 2 dB lower than a subband vector quantization technique over the entire frequency range of wide-band speech. Index Terms: deep learning, speech feature extraction, neural networks, auto-encoder, binary codes, Boltzmann machine

openalex-author · Neural Computation

Comparing Classification Methods for Longitudinal fMRI Studies

We compare 10 methods of classifying fMRI volumes by applying them to data from a longitudinal study of stroke recovery: adaptive Fisher's linear and quadratic discriminant; gaussian naive Bayes; support vector machines with linear, quadratic, and radial basis function (RBF) kernels; logistic regression; two novel methods based on pairs of restricted Boltzmann machines (RBM); and K-nearest neighbors. All methods were tested on three binary classification tasks, and their out-of-sample classification accuracies are compared. The relative performance of the methods varies considerably across subjects and classification tasks. The best overall performers were adaptive quadratic discriminant, support vector machines with RBF kernels, and generatively trained pairs of RBMs.

openalex-author · Topics in Cognitive Science

Discovering Binary Codes for Documents by Learning Deep Generative Models

We describe a deep generative model in which the lowest layer represents the word-count vector of a document and the top layer represents a learned binary code for that document. The top two layers of the generative model form an undirected associative memory and the remaining layers form a belief net with directed, top-down connections. We present efficient learning and inference procedures for this type of generative model and show that it allows more accurate and much faster retrieval than latent semantic analysis. By using our method as a filter for a much slower method called TF-IDF we achieve higher accuracy than TF-IDF alone and save several orders of magnitude in retrieval time. By using short binary codes as addresses, we can perform retrieval on very large document sets in a time that is independent of the size of the document set using only one word of memory to describe each document.

openalex-author · Journal of Vision

Turn that frown upside-down! Inferring facial actions from pairs of images in a neurally plausible computational model

Most approaches to image recognition focus on the problem of inferring a categorical label or action code from a static image, ignoring dynamic aspects of appearance that may be critical to perception. Even methods that examine behavior over time, such as in a video sequence, tend to label each image frame independently, ignoring frame-to-frame dynamics. This viewpoint suggests that it is time-independent categorical information that is important, and not the patterns of actions that relate stimulus configurations together across time. The current work focuses on face perception and demonstrates that there is important information that can be extracted from pairs of images by examining how the face transforms in appearance from one image to another. Using a biologically plausible neural network model called a conditional Restricted Boltzmann Machine that performs unsupervised Hebbian learning, we show that the network can infer various facial actions from a sequence of images (e.g., transforming a frown into a smile or moving the face from one location of the image frame to another). Critically, after inferring the actions relating two face images from one individual, the network can apply the transformation to a test face from an unknown individual, without any knowledge of facial identity, expressions, or muscle movements. By visualizing the factors that encode and break down facial actions into a distributed representation, we demonstrate a kind of factorial action code that the network learns in an unsupervised manner to separate identity characteristics from rigid (affine) and non-rigid expression transformations. Models of this sort suggest that neural representations of action can factor out specific information about a face or object such as its identity that remain constant from its dynamic behavior, both of which are important aspects of perceptual inference.

openalex-author · International Conference on Machine Learning

Rectified Linear Units Improve Restricted Boltzmann Machines

Restricted Boltzmann machines were developed using binary stochastic hidden units. These can be generalized by replacing each binary unit by an infinite number of copies that all have the same weights but have progressively more negative biases. The learning and inference rules for these Stepped Sigmoid Units are unchanged. They can be approximated efficiently by noisy, rectified linear units. Compared with binary units, these units learn features that are better for object recognition on the NORB dataset and face verification on the Labeled Faces in the Wild dataset. Unlike binary units, rectified linear units preserve information about relative intensities as information travels through multiple layers of feature detectors.

openalex-author · 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition

Modeling pixel means and covariances using factorized third-order boltzmann machines

Learning a generative model of natural images is a useful way of extracting features that capture interesting regularities. Previous work on learning such models has focused on methods in which the latent features are used to determine the mean and variance of each pixel independently, or on methods in which the hidden units determine the covariance matrix of a zero-mean Gaussian distribution. In this work, we propose a probabilistic model that combines these two approaches into a single framework. We represent each image using one set of binary latent features that model the image-specific covariance and a separate set that model the mean. We show that this approach provides a probabilistic framework for the widely used simple-cell complex-cell architecture, it produces very realistic samples of natural images and it extracts features that yield state-of-the-art recognition accuracy on the challenging CIFAR 10 dataset.

openalex-author · 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition

Dynamical binary latent variable models for 3D human pose tracking

We introduce a new class of probabilistic latent variable model called the Implicit Mixture of Conditional Restricted Boltzmann Machines (imCRBM) for use in human pose tracking. Key properties of the imCRBM are as follows: (1) learning is linear in the number of training exemplars so it can be learned from large datasets; (2) it learns coherent models of multiple activities; (3) it automatically discovers atomic “movemes” and (4) it can infer transitions between activities, even when such transitions are not present in the training set. We describe the model and how it is learned and we demonstrate its use in the context of Bayesian filtering for multi-view and monocular pose tracking. The model handles difficult scenarios including multiple activities and transitions among activities. We report state-of-the-art results on the HumanEva dataset.

openalex-author · http://learning.cs.toronto.edu/%7Ehinton/absps/ranzato_aistats2010.pdf

Factored 3-Way Restricted Boltzmann Machines For Modeling Natural Images

Deep belief nets have been successful in modeling handwritten characters, but it has proved more difficult to apply them to real images. The problem lies in the restricted Boltzmann machine (RBM) which is used as a module for learning deep belief nets one layer at a time. The Gaussian-Binary RBMs that have been used to model real-valued data are not a good way to model the covariance structure of natural images. We propose a factored 3-way RBM that uses the states of its hidden units to represent abnormalities in the local covariance structure of an image. This provides a probabilistic framework for the widely used simple/complex cell architecture. Our model learns binary features that work very well for object recognition on the “tiny images” data set. Even better features are obtained by then using standard binary RBM’s to learn a deeper model.

openalex-author · Neural Computation

Learning to Represent Spatial Transformations with Factored Higher-Order Boltzmann Machines

To allow the hidden units of a restricted Boltzmann machine to model the transformation between two successive images, Memisevic and Hinton (2007) introduced three-way multiplicative interactions that use the intensity of a pixel in the first image as a multiplicative gain on a learned, symmetric weight between a pixel in the second image and a hidden unit. This creates cubically many parameters, which form a three-dimensional interaction tensor. We describe a low-rank approximation to this interaction tensor that uses a sum of factors, each of which is a three-way outer product. This approximation allows efficient learning of transformations between larger image patches. Since each factor can be viewed as an image filter, the model as a whole learns optimal filter pairs for efficiently representing transformations. We demonstrate the learning of optimal filter pairs from various synthetic and real image sequences. We also show how learning about image transformations allows the model to perform a simple visual analogy task, and we show how a completely unsupervised network trained on transformations perceives multiple motions of transparent dot patterns in the same way as humans.

openalex-author · 2010 IEEE International Conference on Acoustics, Speech and Signal Processing

Phone recognition using Restricted Boltzmann Machines

For decades, Hidden Markov Models (HMMs) have been the state-of-the-art technique for acoustic modeling despite their unrealistic independence assumptions and the very limited representational capacity of their hidden states. Conditional Restricted Boltzmann Machines (CRBMs) have recently proved to be very effective for modeling motion capture sequences and this paper investigates the application of this more powerful type of generative model to acoustic modeling. On the standard TIMIT corpus, one type of CRBM outperforms HMMs and is comparable with the best other methods, achieving a phone error rate (PER) of 26.7% on the TIMIT core test set.

openalex-author · Lecture Notes in Computer Science

Learning to Detect Roads in High-Resolution Aerial Images

No abstract available from the OpenAlex source record.

openalex-author · Paper

A Practical Guide to Training

No abstract available from the OpenAlex source record.

openalex-author · Neural Information Processing Systems

Replicated Softmax: an Undirected Topic Model

We introduce a two-layer undirected graphical model, called a Replicated Softmax, that can be used to model and automatically extract low-dimensional latent semantic representations from a large unstructured collection of documents. We present efficient learning and inference algorithms for this model, and show how a Monte-Carlo based method, Annealed Importance Sampling, can be used to produce an accurate estimate of the log-probability the model assigns to test data. This allows us to demonstrate that the proposed model is able to generalize much better compared to Latent Dirichlet Allocation in terms of both the log-probability of held-out documents and the retrieval accuracy.

openalex-author · http://learning.cs.toronto.edu/%7Ehinton/absps/vinodNORB.pdf

3D Object Recognition with Deep Belief Nets

We introduce a new type of top-level model for Deep Belief Nets and evaluate it on a 3D object recognition task. The top-level model is a third-order Boltzmann machine, trained using a hybrid algorithm that combines both generative and discriminative gradients. Performance is evaluated on the NORB database (normalized-uniform version), which contains stereo-pair images of objects under different lighting conditions and viewpoints. Our model achieves 6.5 % error on the test set, which is close to the best published result for NORB (5.9%) using a convolutional neural net that has built-in knowledge of translation invariance. It substantially outperforms shallow models such as SVMs (11.6%). DBNs are especially suited for semi-supervised learning, and to demonstrate this we consider a modified version of the NORB recognition task in which additional unlabeled images are created by applying small translations to the images in the database. With the extra unlabeled data (and the same amount of labeled data as before), our model achieves 5.2 % error. 1

openalex-author · Philosophical Transactions of the Royal Society B: Biological Sciences

Learning to represent visual input

One of the central problems in computational neuroscience is to understand how the object-recognition pathway of the cortex learns a deep hierarchy of nonlinear feature detectors. Recent progress in machine learning shows that it is possible to learn deep hierarchies without requiring any labelled data. The feature detectors are learned one layer at a time and the goal of the learning procedure is to form a good generative model of images, not to predict the class of each image. The learning procedure only requires the pairwise correlations between the activations of neuron-like processing units in adjacent layers. The original version of the learning procedure is derived from a quadratic 'energy' function but it can be extended to allow third-order, multiplicative interactions in which neurons gate the pairwise interactions between other neurons. A technique for factoring the third-order interactions leads to a learning module that again has a simple learning rule based on pairwise correlations. This module looks remarkably like modules that have been proposed by both biologists trying to explain the responses of neurons and engineers trying to create systems that can recognize objects.

openalex-author · Neural Networks

Temporal-Kernel Recurrent Neural Networks

No abstract available from the OpenAlex source record.

openalex-author · Proceedings of the 26th Annual International Conference on Machine Learning

Using fast weights to improve persistent contrastive divergence

The most commonly used learning algorithm for restricted Boltzmann machines is contrastive divergence which starts a Markov chain at a data point and runs the chain for only a few iterations to get a cheap, low variance estimate of the sufficient statistics under the model. Tieleman (2008) showed that better learning can be achieved by estimating the model's statistics using a small set of persistent "fantasy particles" that are not reinitialized to data points after each weight update. With sufficiently small weight updates, the fantasy particles represent the equilibrium distribution accurately but to explain why the method works with much larger weight updates it is necessary to consider the interaction between the weight updates and the Markov chain. We show that the weight updates force the Markov chain to mix fast, and using this insight we develop an even faster mixing chain that uses an auxiliary set of "fast weights" to implement a temporary overlay on the energy landscape. The fast weights learn rapidly but also decay rapidly and do not contribute to the normal energy landscape that defines the model.

openalex-author · Proceedings of the 26th Annual International Conference on Machine Learning

Factored conditional restricted Boltzmann Machines for modeling motion style

The Conditional Restricted Boltzmann Machine (CRBM) is a recently proposed model for time series that has a rich, distributed hidden state and permits simple, exact inference. We present a new model, based on the CRBM that preserves its most important computational properties and includes multiplicative three-way interactions that allow the effective interaction weight between two units to be modulated by the dynamic state of a third unit. We factor the three-way weight tensor implied by the multiplicative model, reducing the number of parameters from O(N3) to O(N2). The result is an efficient, compact model whose effectiveness we demonstrate by modeling human motion. Like the CRBM, our model can capture diverse styles of motion with a single set of parameters, and the three-way interactions greatly improve the model's ability to blend motion styles or to transition smoothly among them.

openalex-author · http://jmlr.org/proceedings/papers/v5/salakhutdinov09a/salakhutdinov09a.pdf

Deep Boltzmann machines

We present a new learning algorithm for Boltz-mann machines that contain many layers of hid-den variables. Data-dependent expectations are estimated using a variational approximation that tends to focus on a single mode, and data-independent expectations are approximated us-ing persistent Markov chains. The use of two quite different techniques for estimating the two types of expectation that enter into the gradient of the log-likelihood makes it practical to learn Boltzmann machines with multiple hidden lay-ers and millions of parameters. The learning can be made more efficient by using a layer-by-layer “pre-training ” phase that allows variational in-ference to be initialized with a single bottom-up pass. We present results on the MNIST and NORB datasets showing that deep Boltzmann machines learn good generative models and per-form well on handwritten digit and visual object recognition tasks. 1

openalex-author · 2009 IEEE International Solid-State Circuits Conference - Digest of Technical Papers

A family of 45nm IA processors

Nehalem is a family of next-generation IA processors for mobile, desktop and server segments implemented in 45nm high-kappa metal-gate CMOS. The family features a new system architecture, significantly enhanced Core architecture, innovations in power management and modular design. The 4-core 8MB-L3-cache die has 731M transistors. We introduce a coherent point-to- point link called QuickPath Interconnect that is the foundation for coherent communication between IA processors, chipsets, I/O hubs and coprocessors/accelerators for multiple generations. It features advanced power management, RAS capabilities and reduced hops/latency. At the physical level, it uses unidirectional variable-width differential signaling. The 45nm implementation features current-mode signaling with cascode driver, clock forwarding, source and sink termination, per-bit skew compensation and 2-tap driver equalization. Signaling speed is up to 6.4GT/S in 45nm but the basic approach scales to 20GT/s using conventional copper interconnect.

openalex-author · Neurocomputing

Improving a statistical language model through non-linear prediction

No abstract available from the OpenAlex source record.

openalex-author · Procedings of the British Machine Vision Conference 2009

Learning generative texture models with extended Fields-of-Experts

We evaluate the ability of the popular Field-of-Experts (FoE) to model structure in images. As a test case we focus on modeling synthetic and natural textures. We find that even for modeling single textures, the FoE provides insufficient flexibility to learn good generative models – it does not perform any better than the much simpler Gaussian FoE. We propose an extended version of the FoE (allowing for bimodal potentials) and demonstrate that this novel formulation, when trained with a better approximation of the likelihood gradient, gives rise to a more powerful generative model of specific visual structure that produces significantly better results for the texture task.

openalex-author · The European Symposium on Artificial Neural Networks

Modeling pigeon behavior using a Conditional Restricted Boltzmann Machine.

In an effort to better understand the complex courtship be- haviour of pigeons, we have built a model learned from motion capture data. We employ a Conditional Restricted Boltzmann Machine (CRBM) with binary latent features and real-valued visible units. The units are conditioned on information from previous time steps to capture dynam- ics. We validate a trained model by quantifying the characteristic head- bobbing present in pigeons. We also show how to predict missing data by marginalizing out the hidden variables and minimizing free energy.

openalex-author · Scholarpedia

Deep belief networks

No abstract available from the OpenAlex source record.

openalex-author · International Journal of Approximate Reasoning

Semantic hashing

No abstract available from the OpenAlex source record.

openalex-author · http://www.cs.toronto.edu/%7Eamnih/papers/hlbl_final.pdf

A Scalable Hierarchical Distributed Language Model

Neural probabilistic language models (NPLMs) have been shown to be competitive with and occasionally superior to the widely-used n-gram language models. The main drawback of NPLMs is their extremely long training and testing times. Morin and Bengio have proposed a hierarchical language model built around a binary tree of words, which was two orders of magnitude faster than the nonhierarchical model it was based on. However, it performed considerably worse than its non-hierarchical counterpart in spite of using a word tree created using expert knowledge. We introduce a fast hierarchical language model along with a simple feature-based algorithm for automatic construction of word trees from the data. We then show that the resulting models can outperform non-hierarchical neural models as well as the best n-gram models. 1

openalex-author · http://papers.nips.cc/paper/3577-generative-versus-discriminative-training-of-rbms-for-classification-of-fmri-images.pdf

Generative versus discriminative training of RBMs for classification of fMRI images

Neuroimaging datasets often have a very large number of voxels and a very small number of training cases, which means that overfitting of models for this data can become a very serious problem. Working with a set of fMRI images from a study on stroke recovery, we consider a classification task for which logistic regression performs poorly, even when L1- or L2- regularized. We show that much better discrimination can be achieved by fitting a generative model to each separate condition and then seeing which model is most likely to have generated the data. We compare discriminative training of exactly the same set of models, and we also consider convex blends of generative and discriminative training. 1

openalex-author · http://papers.nips.cc/paper/3482-using-matrices-to-model-symbolic-relationship.pdf

Using matrices to model symbolic relationship

We describe a way of learning matrix representations of objects and relationships. The goal of learning is to allow multiplication of matrices to represent symbolic relationships between objects and symbolic relationships between relationships, which is the main novelty of the method. We demonstrate that this leads to ex-cellent generalization in two different domains: modular arithmetic and family relationships. We show that the same system can learn first-order propositions such as (2, 5) ∈ +3 or (Christopher, Penelope) ∈ has wife, and higher-order propositions such as (3,+3) ∈ plus and (+3,−3) ∈ inverse or (has husband, has wife) ∈ higher oppsex. We further demonstrate that the system understands how higher-order propositions are related to first-order ones by showing that it can correctly answer questions about first-order propositions involving the relations +3 or has wife even though it has not been trained on any first-order examples involving these relations. 1

openalex-author · http://papers.nips.cc/paper/3567-the-recurrent-temporal-restricted-boltzmann-machine.pdf

The Recurrent Temporal Restricted Boltzmann Machine

The Temporal Restricted Boltzmann Machine (TRBM) is a probabilistic model for sequences that is able to successfully model (i.e., generate nice-looking samples of) several very high dimensional sequences, such as motion capture data and the pixels of low resolution videos of balls bouncing in a box. The major disadvan-tage of the TRBM is that exact inference is extremely hard, since even computing a Gibbs update for a single variable of the posterior is exponentially expensive. This difficulty has necessitated the use of a heuristic inference procedure, that nonetheless was accurate enough for successful learning. In this paper we intro-duce the Recurrent TRBM, which is a very slight modification of the TRBM for which exact inference is very easy and exact gradient learning is almost tractable. We demonstrate that the RTRBM is better than an analogous TRBM at generating motion capture and videos of bouncing balls. 1

openalex-author · http://books.nips.cc/papers/files/nips21/NIPS2008_0935.pdf

Implicit Mixtures of Restricted Boltzmann Machines

We present a mixture model whose components are Restricted Boltzmann Ma-chines (RBMs). This possibility has not been considered before because com-puting the partition function of an RBM is intractable, which appears to make learning a mixture of RBMs intractable as well. Surprisingly, when formulated as a third-order Boltzmann machine, such a mixture model can be learned tractably using contrastive divergence. The energy function of the model captures three-way interactions among visible units, hidden units, and a single hidden discrete variable that represents the cluster label. The distinguishing feature of this model is that, unlike other mixture models, the mixing proportions are not explicitly parameterized. Instead, they are defined implicitly via the energy function and depend on all the parameters in the model. We present results for the MNIST and NORB datasets showing that the implicit mixture of RBMs learns clusters that reflect the class structure in the data. 1

openalex-author · Lecture Notes in Computer Science

Analysis-by-Synthesis by Learning to Invert Generative Black Boxes

No abstract available from the OpenAlex source record.

openalex-author · Neural Computation

Deep, Narrow Sigmoid Belief Networks Are Universal Approximators

In this note, we show that exponentially deep belief networks can approximate any distribution over binary vectors to arbitrary accuracy, even when the width of each layer is limited to the dimensionality of the data. We further show that such networks can be greedily learned in an easy yet impractical way.

openalex-author · Journal of Machine Learning Research

Visualizing Data using t-SNE

We present a new technique called “t-SNE” that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map. The technique is a variation of Stochastic Neighbor Embedding (Hinton and Roweis, 2002) that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map. t-SNE is better than existing techniques at creating a single map that reveals structure at many different scales. This is particularly important for high-dimensional data that lie on several different, but related, low-dimensional manifolds, such as images of objects from multiple classes seen from multiple viewpoints. For visualizing the structure of very large datasets, we show how t-SNE can use random walks on neighborhood graphs to allow the implicit structure of all of the data to influence the way in which a subset of the data is displayed. We illustrate the performance of t-SNE on a wide variety of datasets and compare it with many other non-parametric visualization techniques, including Sammon mapping, Isomap, and Locally Linear Embedding. The visualizations produced by t-SNE are significantly better than those produced by the other techniques on almost all of the datasets.

openalex-author · In: (pp. pp. 493-498). (2008)

Improving a statistical language model by modulating the effects of context words

We show how to improve a state-of-the-art neural network language model that converts the previous context words into feature vectors and combines these feature vectors to predict the feature vector of the next word. Significant improvements in predictive accuracy are achieved by using higher-level features to modulate the effects of the con- text words. This is more effective than using the higher-level features to directly predict the feature vector of the next word, but it is also possible to combine both methods.

openalex-author · http://www.cs.toronto.edu/~hinton/absps/lateral.pdf

Modeling image patches with a directed hierarchy of Markov random fields

We describe an efficient learning procedure for multilayer generative models that combine the best aspects of Markov random fields and deep, directed belief nets. The generative models can be learned one layer at a time and when learning is complete they have a very fast inference procedure for computing a good approximation to the posterior distribution in all of the hidden layers. Each hidden layer has its own MRF whose energy function is modulated by the top-down directed connections from the layer above. To generate from the model, each layer in turn must settle to equilibrium given its top-down input. We show that this type of model is good at capturing the statistics of patches of natural images. 1

openalex-author · Neural Information Processing Systems

Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes

We show how to use unlabeled data and a deep belief net (DBN) to learn a good covariance kernel for a Gaussian process. We first learn a deep generative model of the unlabeled data using the fast, greedy algorithm introduced by [7]. If the data is high-dimensional and highly-structured, a Gaussian kernel applied to the top layer of features in the DBN works much better than a similar kernel applied to the raw input. Performance at both regression and classification can then be further improved by using backpropagation through the DBN to discriminatively fine-tune the covariance kernel.

openalex-author · Trends in Cognitive Sciences

Learning multiple layers of representation

No abstract available from the OpenAlex source record.

openalex-author · Advances in Neural Information Processing Systems 19

Modeling Human Motion Using Binary Latent Variables

We propose a non-linear generative model for human motion data that uses an undirected model with binary latent variables and real-valued “visible ” variables that represent joint angles. The latent and visible variables at each time step receive directed connections from the visible variables at the last few time-steps. Such an architecture makes on-line inference efficient and allows us to use a simple approximate learning procedure. After training, the model finds a single set of parameters that simultaneously capture several different kinds of motion. We demonstrate the power of our approach by synthesizing various motion sequences and by performing on-line filling in of data lost during motion capture. Website:

openalex-author · Proceedings of the 24th international conference on Machine learning

Restricted Boltzmann machines for collaborative filtering

Most of the existing approaches to collaborative filtering cannot handle very large data sets. In this paper we show how a class of two-layer undirected graphical models, called Restricted Boltzmann Machines (RBM's), can be used to model tabular data, such as user's ratings of movies. We present efficient learning and inference procedures for this class of models and demonstrate that RBM's can be successfully applied to the Netflix data set, containing over 100 million user/movie ratings. We also show that RBM's slightly outperform carefully-tuned SVD models. When the predictions of multiple RBM models and multiple SVD models are linearly combined, we achieve an error rate that is well over 6% better than the score of Netflix's own system.

openalex-author · Proceedings of the 24th international conference on Machine learning

Three new graphical models for statistical language modelling

The supremacy of n-gram models in statistical language modelling has recently been challenged by parametric models that use distributed representations to counteract the difficulties caused by data sparsity. We propose three new probabilistic language models that define the distribution of the next word in a sequence given several preceding words by using distributed representations of those words. We show how real-valued distributed representations for words can be learned at the same time as learning a large set of stochastic binary hidden features that are used to predict the distributed representation of the next word from previous distributed representations. Adding connections from the previous states of the binary hidden features improves performance as does adding direct connections between the real-valued distributed representations. One of our models significantly outperforms the very best n-gram models.

openalex-author · 2007 IEEE Conference on Computer Vision and Pattern Recognition

Unsupervised Learning of Image Transformations

We describe a probabilistic model for learning rich, distributed representations of image transformations. The basic model is defined as a gated conditional random field that is trained to predict transformations of its inputs using a factorial set of latent variables. Inference in the model consists in extracting the transformation, given a pair of images, and can be performed exactly and efficiently. We show that, when trained on natural videos, the model develops domain specific motion features, in the form of fields of locally transformed edge filters. When trained on affine, or more general, transformations of still images, the model develops codes for these transformations, and can subsequently perform recognition tasks that are invariant under these transformations. It can also fantasize new transformations on previously unseen images. We describe several variations of the basic model and provide experimental results that demonstrate its applicability to a variety of tasks.

openalex-author · http://www.cs.toronto.edu/~rsalakhu/papers/nonlinnca.ps.gz

Learning a Nonlinear Embedding by Preserving Class Neighbourhood Structure

Abstract We show how to pretrain and fine-tune a mul-tilayer neural network to learn a nonlinear transformation from the input space to a low-dimensional feature space in which K-nearest neighbour classification performs well. We alsoshow how the non-linear transformation can be improved using unlabeled data. Our methodachieves a much lower error rate than Support Vector Machines or standard backpropagation ona widely used version of the MNIST handwritten digit recognition task. If some of the dimen-sions of the low-dimensional feature space are not used for nearest neighbor classification, ourmethod uses these dimensions to explicitly represent transformations of the digits that do notaffect their identity. 1 Introduction Learning a similarity measure or distance metric over theinput space

openalex-author · http://learning.cs.toronto.edu/~hinton/absps/trbmTR.pdf

Learning Multilevel Distributed Representations for High-Dimensional Sequences

We describe a family of non-linear sequence models that is substantially more powerful than hidden Markov models or linear dynamical systems. Our models have simple approximate inference and learning procedures that work well in practice. Multilevel representations of sequential data can be learned one hidden layer at a time, and adding extra hidden layers improves the resulting generative models. The models can be trained with very high-dimensional, very non-linear data such as raw pixel sequences. Their performance is demonstrated using synthetic video sequences of two balls bouncing in a box.

openalex-author · http://learning.cs.toronto.edu/~hinton/absps/ampaper.pdf

Visualizing Similarity Data with a Mixture of Maps

We show how to visualize a set of pairwise similarities between objects by using several different two-dimensional maps, each of which captures different aspects of the similarity structure. When the objects are ambiguous words, for example, different senses of a word occur in different maps, so “river ” and “loan ” can both be close to “bank ” without being at all close to each other. Aspect maps resemble clustering because they model pair-wise similarities as a mixture of different types of similarity, but they also resemble local multi-dimensional scaling because they model each type of similarity by a twodimensional map. We demonstrate our method on a toy example, a database of human wordassociation data, a large set of images of handwritten digits, and a set of feature vectors that represent words. 1

openalex-author · Scholarpedia

Boltzmann machine

No abstract available from the OpenAlex source record.

openalex-author · Progress in Brain Research

To recognize shapes, first learn to generate images

No abstract available from the OpenAlex source record.

openalex-author · Science

Reducing the Dimensionality of Data with Neural Networks

High-dimensional data can be converted to low-dimensional codes by training a multilayer neural network with a small central layer to reconstruct high-dimensional input vectors. Gradient descent can be used for fine-tuning the weights in such "autoencoder" networks, but this works well only if the initial weights are close to a good solution. We describe an effective way of initializing the weights that allows deep autoencoder networks to learn low-dimensional codes that work much better than principal components analysis as a tool to reduce the dimensionality of data.

openalex-author · Cognitive Science

Unsupervised Discovery of Nonlinear Structure Using Contrastive Backpropagation

We describe a way of modeling high-dimensional data vectors by using an unsupervised, nonlinear, multilayer neural network in which the activity of each neuron-like unit makes an additive contribution to a global energy score that indicates how surprised the network is by the data vector. The connection weights that determine how the activity of each unit depends on the activities in earlier layers are learned by minimizing the energy assigned to data vectors that are actually observed and maximizing the energy assigned to "confabulations" that are generated by perturbing an observed data vector in a direction that decreases its energy under the current model.

openalex-author · Neural Computation

A Fast Learning Algorithm for Deep Belief Nets

We show how to use "complementary priors" to eliminate the explaining-away effects that make inference difficult in densely connected belief nets that have many hidden layers. Using complementary priors, we derive a fast, greedy algorithm that can learn deep, directed belief networks one layer at a time, provided the top two layers form an undirected associative memory. The fast, greedy algorithm is used to initialize a slower learning procedure that fine-tunes the weights using a contrastive version of the wake-sleep algorithm. After fine-tuning, a network with three hidden layers forms a very good generative model of the joint distribution of handwritten digit images and their labels. This generative model gives better digit classification than the best discriminative learning algorithms. The low-dimensional manifolds on which the digits lie are modeled by long ravines in the free-energy landscape of the top-level associative memory, and it is easy to explore these ravines by using the directed connections to display what the associative memory has in mind.

openalex-author · Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005.

Embedding via clustering: using spectral information to guide dimensionality reduction

We describe an approach to improve iterative dimensionality reduction methods by using information contained in the leading eigenvectors of a data affinity matrix. Using an insight from the area of spectral clustering, we suggest modifying the gradient of an iterative method, so that latent space elements belonging to the same cluster are encouraged to move in similar directions during optimization. We also describe way to achieve this without actually having to explicitly perform an eigendecomposition. Preliminary experiments show that our approach makes it possible to speed up iterative methods and helps them to find better local minima of their objective function.

openalex-author · Neural Computation

Topographic Product Models Applied to Natural Scene Statistics

We present an energy-based model that uses a product of generalized Student-t distributions to capture the statistical structure in data sets. This model is inspired by and particularly applicable to "natural" data sets such as images. We begin by providing the mathematical framework, where we discuss complete and overcomplete models and provide algorithms for training these models from data. Using patches of natural scenes, we demonstrate that our approach represents a viable alternative to independent component analysis as an interpretive model of biological visual systems. Although the two approaches are similar in flavor, there are also important differences, particularly when the representations are overcomplete. By constraining the interactions within our model, we are also able to study the topographic organization of Gabor-like receptive fields that our model learns. Finally, we discuss the relation of our new approach to previous work--in particular, gaussian scale mixture models and variants of independent components analysis.

openalex-author · Neural Information Processing Systems

Inferring Motor Programs from Images of Handwritten Digits

We describe a generative model for handwritten digits that uses two pairs of opposing springs whose stiffnesses are controlled by a motor program. We show how neural networks can be trained to infer the motor programs required to accurately reconstruct the MNIST digits. The inferred motor programs can be used directly for digit classification, but they can also be used in other ways. By adding noise to the motor program inferred from an MNIST image we can generate a large set of very different images of the same class, thus enlarging the training set available to other methods. We can also use the motor programs as additional, highly informative outputs which reduce overfitting when training a feed-forward classifier.

openalex-author · Neural Networks

Improving dimensionality reduction with spectral gradient descent

No abstract available from the OpenAlex source record.

openalex-author · http://www.gatsby.ucl.ac.uk/aistats/fullpapers/217.pdf

On Contrastive Divergence Learning.

Maximum-likelihood (ML) learning of Markov random fields is challenging because it requires estimates of averages that have an exponential number of terms. Markov chain Monte Carlo methods typically take a long time to converge on unbiased estimates, but Hinton (2002) showed that if the Markov chain is only run for a few steps, the learning can still work well and it approximately minimizes a di#erent function called &quot;contrastive divergence&quot; (CD). CD learning has been successfully applied to various types of random fields. Here, we study the properties of CD learning and show that it provides biased estimates in general, but that the bias is typically very small. Fast CD learning can therefore be used to get close to an ML solution and slow ML learning can then be used to fine-tune the CD solution.

openalex-author · http://learning.cs.toronto.edu/~hinton/absps/chmrf9.ps.gz

Learning Causally Linked Markov Random Fields.

1 Introduction Generative models are widely used within machinelearning. However, in many applications the graphical models involve exclusively causal, or exclusivelyundirected edges. In this paper we consider models that contain both types of edge, and suggest approx-imate learning methods for such models. The main contribution of this paper is the proposal of combiningvariational inference with the contrastive divergence algorithm to facilitate learning in systems involvingcausally linked Markov Random Fields (MRF&apos;s). We support our proposal with examples of learning in sev-eral domains. 2 Learning Causal Models One way to make generative models with stochastichidden variables is to use a directed acyclic graph as shown in Figure 1 (a). The difficulty in learning such&amp;quot;causal &amp;quot; models is that the posterior distribution over the hidden variables is intractable (except in certainspecial cases such as factor analysis, mixture models, square ICA or graphs that are very sparsely con-nected). Despite the intractability of the posterior, it is possible to optimize a bound on the log proba-bility of the data by using a simple factorial distribution, Q(h|x), as an approximation to the true pos-terior,

openalex-author · Ninth International Workshop on Frontiers in Handwriting Recognition

Distinguishing Text from Graphics in On-Line Handwritten Ink

We present a system that separates text from graphics strokes in handwritten digital ink. It utilizes not just the characteristics of the strokes, but also the information provided by the gaps between the strokes, as well as the temporal characteristics of the stroke sequence. It is built using machine learning techniques that infer the internal parameters of the system from real digital ink, collected using a tablet PC.

openalex-author · http://jmlr.csail.mit.edu/papers/volume5/sallans04a/sallans04a.pdf

Reinforcement Learning with Factored States and Actions

A novel approximation method is presented for approximating the value function and selecting good actions for Markov decision processes with large state and action spaces. The method approximates state-action values as negative free energies in an undirected graphical model called a product of experts. The model parameters can be learned efficiently because values and derivatives can be efficiently computed for a product of experts. Actions can be found even in large factored action spaces by the use of Markov chain Monte Carlo sampling. Simulation results show that the product of experts approximation can be used to solve large problems. In one simulation it is used to find actions in action spaces of size 2 40.

openalex-author · http://www.eng.biu.ac.il/~goldbej/papers/ncanips.pdf

Neighbourhood Components Analysis

In this paper we propose a novel method for learning a Mahalanobis distance measure to be used in the KNN classification algorithm. The algorithm directly maximizes a stochastic variant of the leave-one-out KNN score on the training set. It can also learn a low-dimensional linear embedding of labeled data that can be used for data visualization and fast classification. Unlike other methods, our classification model is non-parametric, making no assumptions about the shape of the class distributions or the boundaries between them. The performance of the method is demonstrated on several data sets, both for metric learning and linear dimensionality reduction. 1

openalex-author · http://papers.nips.cc/paper/2651-multiple-relational-embedding.pdf

Multiple Relational Embedding

We describe a way of using multiple different types of similarity rela-tionship to learn a low-dimensional embedding of a dataset. Our method chooses different, possibly overlapping representations of similarity by individually reweighting the dimensions of a common underlying latent space. When applied to a single similarity relation that is based on Eu-clidean distances between the input data points, the method reduces to simple dimensionality reduction. If additional information is available about the dataset or about subsets of it, we can use this information to clean up or otherwise improve the embedding. We demonstrate the po-tential usefulness of this form of semi-supervised dimensionality reduc-tion on some simple examples. 1

openalex-author · http://papers.nips.cc/paper/2672-exponential-family-harmoniums-with-an-application-to-information-retrieval.pdf

Exponential Family Harmoniums with an Application to Information Retrieval

Directed graphical models with one layer of observed random variables and one or more layers of hidden random variables have been the dom-inant modelling paradigm in many research fields. Although this ap-proach has met with considerable success, the causal semantics of these models can make it difficult to infer the posterior distribution over the hidden variables. In this paper we propose an alternative two-layer model based on exponential family distributions and the semantics of undi-rected models. Inference in these “exponential family harmoniums ” is fast while learning is performed by minimizing contrastive divergence. A member of this family is then studied as an alternative probabilistic model for latent semantic indexing. In experiments it is shown that they perform well on document retrieval tasks and provide an elegant solution to searching with keywords. 1

openalex-author · Neural Information Processing Systems

Wormholes Improve Contrastive Divergence

In models that define probabilities via energies, maximum likelihood learning typically involves using Markov Chain Monte Carlo to sample from the model's distribution. If the Markov chain is started at the data distribution, learning often works well even if the chain is only run for a few time steps [3]. But if the data distribution contains modes separated by regions of very low density, brief MCMC will not ensure that different modes have the correct relative energies because it cannot move particles from one mode to another. We show how to improve brief MCMC by allowing long-range moves that are suggested by the data distribution. If the model is approximately correct, these long-range moves have a reasonable acceptance rate.

openalex-author · Canadian Psychology / Psychologie canadienne

The ups and downs of Hebb synapses.

Abstract Modelers have come up with many different learning rules for neural networks. When a teacher specifies the correct output, rules work better than pure Hebb rules in which the changes in synapse strength depend on the correlation between pre and postsynaptic activities. But for unsupervised learning, Hebb rules can be very effective if they are combined with suitable normalization or unlearning terms to prevent the synapses growing without bound. Hebb rules that use rates of change of activity instead of activity itself are useful for discovering perceptual invariants and may also provide a way of implementing learning. It would be truly wonderful if randomly connected neural networks could turn themselves into useful computing devices by using some simple rule to modify the strengths of synapses. This was the hope that lay behind the original Hebb learning rule and it is the vision that has driven neural network modelers for half a century. Initially, researchers tried simulating various rules to see what would happen. After a decade or two of messing around, researchers realized that there was a much better way to explore the space of possible learning rules: First write down an objective function (a quantitative definition of how well the network is performing) and then use elementary calculus to derive a learning rule that will improve the objective function. For the last few decades, the big theoretical advances in learning rules for neural networks have been associated with new optimization methods and new ideas about what objective function should be optimized. If we think of a neural network as a device for converting input vectors into output vectors, it is obvious that one sensible objective is to minimize some measure of the difference between the output the network actually produces and the output it ought to produce. This approach led to effective error-driven learning rules such as the Widrow-Hoff rule (Widrow & Hoff, 1960) and the perceptron convergence procedure (Rosenblatt, 1961) and it was later generalized to multilayer networks by using backpropagation of the errors to get training signals for intermediate hidden layers (Rumelhart, Hinton, & Williams, 1986). Within the neural network community, the approach of using the product of pre and postsynaptic activities to drive learning was seen as inferior to error-- driven methods that use the product of the presynaptic activity and the postsynaptic activity derivative - the rate at which the objective function changes as the postsynaptic activity is changed. Even when the task was merely to associate random input vectors with random output vectors, it was shown that an rule worked much better than a Hebbian rule. Unfortunately, learning has some serious drawbacks. It requires a teacher to specify the right answer and it is hard to see how neurons could implement the backpropagation required by multilayer versions. It is possible to get the teaching signal from the data itself by trying to predict the next term in a temporal sequence (Elman, 1991) or by trying to reconstruct the input data at the output (Hinton, 1989) but it is also possible to use quite different objective functions for learning. Some of these alternative objective functions lead to learning rules that are far more Hebbian in flavour. A common objective in processing high-dimensional data is to reduce the dimensionality without losing the ability to reconstruct the raw data from the reduced representation. If we measure the accuracy of the reconstruction by the squared error, the optimal strategy is to extract the principal components - the dominant directions of variation in the data. Oja (1982) showed how to extract the first principal component using Hebbian learning to maximize the squared output of a neuron combined with normalization of the synapse strengths to prevent them growing without bound. …

openalex-author · http://www.cs.toronto.edu/~crismin/PAPERS/mode-hop.pdf

A Mode-Hopping MCMC sampler

One of the main shortcomings of Markov chain Monte Carlo samplers is their inability to mix between modes of the target distribution. In this paper we show that advance knowledge of the location of these modes can be incorporated into the MCMC sampler by introducing mode-hopping moves that satisfy detailed balance. The proposed sampling algorithm explores local mode structure through local MCMC moves (e.g. diffusion or Hybrid Monte Carlo) but in addition also represents the relative strengths of the different modes correctly using a set of global moves.

openalex-author · UNSPECIFIED (2003)

Bethe free energy and contrastive divergence approximations for undirected graphical models

As the machine learning community tackles more complex and harder problems, the graphical models needed to solve such problems become larger and more complicated. As a result performing inference and learning exactly for such graphical models become ever more expensive, and approximate inference and learning techniques become ever more prominent. There are a variety of techniques for approximate inference and learning in the literature. This thesis contributes some new ideas in the products of experts (PoEs) class of models (Hinton, 2002), and the Bethe free energy approximations (Yedidia et al., 2001). For PoEs, our contribution is in developing new PoE models for continuous-valued domains. We developed RBMrate, a model for discretized continuous-valued data. We applied it to face recognition to demonstrate its abilities. We also developed energy-based models (EBMs)—flexible probabilistic models where the building blocks consist of energy terms computed using a feed-forward network. We show that standard square noiseless independent components analysis (ICA) (Bell and Sejnowski, 1995) can be viewed as a restricted form of EBMs. Extending this relationship with ICA, we describe sparse and over-complete representations of data where the inference process is trivial since it is simply an EBM. For Bethe free energy approximations, our contribution is a theory relating belief propagation and iterative scaling. We show that both belief propagation and iterative scaling updates can be derived as fixed point equations for constrained minimization of the Bethe free energy. This allows us to develop a new algorithm to directly minimize the Bethe free energy, and to apply the Bethe free energy to learning in addition to inference. We also describe improvements to the efficiency of standard learning algorithms for undirected graphical models (Jirousek and Preucil, 1995).

openalex-author · Proceedings of Data Compression Conference - DCC '96

Free energy coding

We introduce a new approach to the problem of optimal compression when a source code produces multiple codewords for a given symbol. It may seem that the most sensible codeword to use in this case is the shortest one. However, in the proposed free energy approach, random codeword selection yields an effective codeword length that can be less than the shortest codeword length. If the random choices are Boltzmann distributed, the effective length is optimal for the given source code. The expectation-maximization parameter estimation algorithms minimize this effective codeword length. We illustrate the performance of free energy coding on a simple problem where a compression factor of two is gained by using the new method.

openalex-author · Cognitive Modeling

Learning Representations by Back-Propagating Errors

No abstract available from the OpenAlex source record.

openalex-author · Cognitive Modeling

How Neural Networks Learn from Experience

No abstract available from the OpenAlex source record.

openalex-author · Computer Graphics Forum

Local Physical Models for Interactive Character Animation

Our goal is to design and build a tool for the creation of expressive character animation. Virtual puppetry, also known as performance animation, is a technique in which the user interactively controls a character's motion. In this paper we introduce local physical models for performance animation and describe how they can augment an existing kinematic method to achieve very effective animation control. These models approximate specific physically-generated aspects of a character's motion. They automate certain behaviours, while still letting the user override such motion via a PD-controller if he so desires. Furthermore, they can be tuned to ignore certain undesirable effects, such as the risk of having a character fall over, by ignoring corresponding components of the force. Although local physical models are a quite simple approximation to real physical behaviour, we show that they are extremely useful for interactive character control, and contribute positively to the expressiveness of the character's motion. In this paper, we develop such models at the knees and ankles of an interactively-animated 3D anthropomorphic character, and demonstrate a resulting animation. This approach can be applied in a straight-forward way to other joints. Categories and Subject Descriptors (according to ACM CCS): I.3.7 [Computer Graphics]: Three-Dimensional Graphics and Realism, Interaction Techniques

openalex-author · NeuroImage

Classical and Bayesian Inference in Neuroimaging: Theory

No abstract available from the OpenAlex source record.

openalex-author · Probabilistic Models of the Brain

Learning to Use Spike Timing in a Restricted Boltzmann Machine

No abstract available from the OpenAlex source record.

openalex-author · AI Magazine

In Memory of Ray Reiter (1939-2002)

Ray dedicated his life to his research with the wonder of a child, the fearlessness of an explorer, the precision of a mathematician, and the tirelessness of a researcher who found shallowness and confusion intolerable. He leaves a legacy of groundbreaking, deep insights that have changed the course of AI.

openalex-author · http://www.graphicsinterface.org/cgi-bin/DownloadPaper?name=2002/219/paper219.pdf

A Desktop Input Device and Interface for Interactive 3D Character Animation

We present a novel input device and interface for interactively controlling the animation of graphical human character from a desktop environment. The trackers are embedded in a new physical design, which is both simple yet also provides significant benefits, and establishes a tangible interface with coordinate frames inherent to the character. A layered kinematic motion recording strategy accesses subsets of the total degrees of freedom of the character. We present the experiences of three novice users with the system, and that of a long-term user who has prior experience with other complex continuous interfaces.

openalex-author · IEEE Transactions on Pattern Analysis and Machine Intelligence

Recognizing handwritten digits using hierarchical products of experts

The product of experts learning procedure can discover a set of stochastic binary features that constitute a nonlinear generative model of handwritten images of digits. The quality of generative models learned in this way can be assessed by learning a separate model for each class of digit and then comparing the unnormalized probabilities of test images under the 10 different class-specific models. To improve discriminative performance, a hierarchy of separate models can be learned, for each digit class. Each model in the hierarchy learns a layer of binary feature detectors that model the probability distribution of vectors of activity of feature detectors in the layer below. The models in the hierarchy are trained sequentially and each model uses a layer of binary feature detectors to learn a generative model of the patterns of feature activities in the preceding layer. After training, each layer of feature detectors produces a separate, unnormalized log probability score. With three layers of feature detectors for each of the 10 digit classes, a test image produces 30 scores which can be used as inputs to a supervised, logistic classification network that is trained on separate data.

openalex-author · Neural Information Processing Systems

Learning Sparse Topographic Representations with Products of Student-t Distributions

We propose a model for natural images in which the probability of an image is proportional to the product of the probabilities of some outputs. We encourage the system to find sparse features by using a Student-t distribution to model each output. If the t-distribution is used to model the combined outputs of sets of neurally adjacent filters, the system learns a topographic map in which the orientation, spatial frequency and location of the filters change smoothly across the map. Even though maximum likelihood learning is intractable in our model, the product form allows a relatively efficient learning procedure that works well even for highly overcomplete sets of filters. Once the model has been learned it can be used as a prior to derive the iterated Wiener filter for the purpose of denoising images.

openalex-author · http://www.cs.toronto.edu/~roweis/papers/sne_final.ps.gz

Stochastic Neighbor Embedding

We describe a probabilistic approach to the task of embedding highdimensional objects into a low-dimensional space in a way that preserves neighbor identities. A Gaussian is centered on each object in the highdimensional space and the densities under this Gaussian are used to define a probability distribution over all the potential neighbors of the object.

openalex-author · http://www.cs.toronto.edu/pub/zemel/Papers/UboostNips.ps.gz

Self Supervised Boosting

Boosting algorithms and successful applications thereof abound for classification and regression learning problems, but not for unsupervised learning. We propose a sequential approach to adding features to a random field model by training them to improve classification performance between the data and an equal-sized sample of &quot;negative examples&quot; generated from the model&apos;s current estimate of the data density. Training in each boosting round proceeds in three stages: first we sample negative examples from the model&apos;s current Boltzmann distribution. Next, a feature is trained to improve classification performance between data and negative examples. Finally, a coefficient is learned which determines the importance of this feature relative to ones already in the pool. Negative examples only need to be generated once to learn each new feature. The validity of the approach is demonstrated on binary digits and continuous synthetic data.

openalex-author · Lecture Notes in Computer Science

A New Learning Algorithm for Mean Field Boltzmann Machines

No abstract available from the OpenAlex source record.

openalex-author · Paper

Digital marionette: augmenting kinematics with physics for multi-track desktop performance animation

No abstract available from the OpenAlex source record.

openalex-author · Graphical Models

Learning and Relearning in Boltzmann Machines

This chapter contains sections titled: Relaxation Searches, Easy and Hard Learning, The Boltzmann Machine Learning Algorithm, An Example of Hard Learning, Achieving Reliable Computation with Unreliable Hardware, An Example of the Effects of Damage, Conclusion, Acknowledgments, Appendix: Derivation of the Learning Algorithm, References

openalex-author · Graphical Models

Deterministic Boltzmann Learning Performs Steepest Descent in Weight-Space

The Boltzmann machine learning procedure has been successfully applied in deterministic networks of analog units that use a mean field approximation to efficiently simulate a truly stochastic system (Peterson and Anderson 1987). This type of “deterministic Boltzmann machine” (DBM) learns much faster than the equivalent “stochastic Boltzmann machine” (SBM), but since the learning procedure for DBM's is only based on an analogy with SBM's, there is no existing proof that it performs gradient descent in any function, and it has only been justified by simulations. By using the appropriate interpretation for the way in which a DBM represents the probability of an output vector given an input vector, it is shown that the DBM performs steepest descent in the same function as the original SBM, except at rare discontinuities. A very simple way of forcing the weights to become symmetrical is also described, and this makes the DBM more biologically plausible than back-propagation (Werbos 1974; Parker 1985; Rumelhart...

openalex-author · Graphical Models

Variational Learning for Switching State-Space Models

This chapter contains sections titled: Introduction, Background, The Generative Model, Learning, Simulations, Discussion, Appendix A: Notation, Appendix B: Derivation of the Variational Fixed-Point Equations, References

openalex-author · published

Global Coordination of Local Linear Models

High dimensional data that lies on or near a low dimensional manifold can be described by a collection of local linear models. Such a description, however, does not provide a global parameterization of the manifold---arguably an important goal of unsupervised learning. In this paper, we show how to learn a collection of local linear models that solves this more difficult problem. Our local linear models are represented by a mixture of factor analyzers, and the &quot;global coordination&quot; of these models is achieved by adding a regularizing term to their objective function. The regularizer breaks a degeneracy in the mixture model&apos;s parameter space, favoring models whose internal coordinate systems are aligned in a consistent way. As a result, the internal coordinates change smoothly and continuously as one traverses a connected path on the manifold---even when the path crosses the domains of many different local models. The regularizer takes the form of a Kullback-Leibler divergerence and illustrates an unexpected application of variational methods: not to perform approximate inference in intractable probabilistic models, but to learn more useful internal representations in tractable ones.

openalex-author · http://learning.cs.toronto.edu/~hinton/absps/nipsLRE.pdf

Learning Hierarchical Structures with Linear Relational Embedding

We present Linear Relational Embedding (LRE), a new method of learning a distributed representation of concepts from data consisting of instances of relations between given concepts. Its final goal is to be able to generalize, i.e. infer new instances of these relations among the concepts. On a task involving family relationships we show that LRE can generalize better than any previously published method. We then show how LRE can be used effectively to find compact distributed representations for variable-sized recursive data structures, such as trees and lists. 1 Linear Relational Embedding Our aim is to take a large set of facts about a domain expressed as tuples of arbitrary symbols in a simple and rigid syntactic format and to be able to infer other “common-sense” facts without having any prior knowledge about the domain. Let us imagine a situation in which we have a set of concepts and a set of relations among these concepts, and that our data consists of few instances of these relations that hold among the concepts. We want

openalex-author · IEEE Transactions on Knowledge and Data Engineering

Learning distributed representations of concepts using linear relational embedding

We introduce linear relational embedding as a means of learning a distributed representation of concepts from data consisting of binary relations between these concepts. The key idea is to represent concepts as vectors, binary relations as matrices, and the operation of applying a relation to a concept as a matrix-vector multiplication that produces an approximation to the related concept. A representation for concepts and relations is learned by maximizing an appropriate discriminative goodness function using gradient ascent. On a task involving family relationships, learning is fast and leads to good generalization.

openalex-author · Unfallchirurgie

Approximate Contrastive Free Energies for Learning in Undirected Graphical Models

The feasibility of inspecting the spinal canal with a needle-arthroscope of 2.7 mm diameter was tested in the spine of fresh autopsy specimens. It was possible to identify all the structures within the spinal canal, which was also photographically documentated. The clinical application of spinaloscopy for diagnostic purposes and for providing prognostic signs respectively in paraplegic spinal injuries is discussed.

openalex-author · http://www.gatsby.ucl.ac.uk/~andy/papers/aistats_2001.ps.gz

Products of Hidden Markov Models.

We present products of hidden Markov models (PoHMM&apos;s), a way of combining HMM&apos;s to form a distributed state time series model. Inference in a PoHMM is tractable and efficient. Learning of the parameters, although intractable, can be effectively done using the Product of Experts learning rule. The distributed state helps the model to explain data which has multiple causes, and the fact that each model need only explain part of the data means a PoHMM can capture longer range structure than an HMM is capable of. We show some results on modelling character strings, a simple language task and the symbolic family trees problem, which highlight these advantages.

openalex-author · Neural Computation

SMEM Algorithm for Mixture Models

We present a split-and-merge expectation-maximization (SMEM) algorithm to overcome the local maxima problem in parameter estimation of finite mixture models. In the case of mixture models, local maxima often involve having too many components of a mixture model in one part of the space and too few in another, widely separated part of the space. To escape from such configurations, we repeatedly perform simultaneous split-and-merge operations using a new criterion for efficiently selecting the split-and-merge candidates. We apply the proposed algorithm to the training of gaussian mixtures and mixtures of factor analyzers using synthetic and real data and show the effectiveness of using the split-and-merge operations to improve the likelihood of both the training data and of held-out test data. We also show the practical usefulness of the proposed algorithm by applying it to image compression and pattern recognition problems.

openalex-author · National Conference on Artificial Intelligence

Modeling High-Dimensional Data by Combining Simple Experts

It is possible to combine multiple non-linear probabilistic models of the same data by multiplying the probability distributions together and then renormalizing. A “productof experts”is a very efficient way to model data that simultaneously satisfies many different constraints. It is difficult to fit a product of experts to data using maximum likelihood because the gradient of the log likelihood is intractable, but there is an efficient way of optimizing a different objective function and this produces good models of high-dimensional data.

openalex-author · http://www.cs.toronto.edu/%7Ehinton/absps/icml-lre.pdf

Learning Distributed Representations by Mapping Concepts and Relations into a Linear Space

Linear Relational Embedding is a method of learning a distributed representation of concepts from data consisting of binary relations between concepts. Concepts are represented as vectors, binary relations as matrices, and the operation of applying a relation to a concept as a matrix-vector multiplication that produces an approximation to the related concept. A representation for concepts and relations is learned by maximizing an appropriate discriminative goodness function using gradient ascent. On a task involving family relationships, learning is fast and leads to good generalization.

openalex-author · Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New M

Extracting distributed representations of concepts and relations from positive and negative propositions

Linear relational embedding (LRE) was introduced previously by the authors (1999) as a means of extracting a distributed representation of concepts from relational data. The original formulation cannot use negative information and cannot properly handle data in which there are multiple correct answers. In this paper we propose an extended formulation of LRE that solves both these problems. We present results in two simple domains, which show that learning leads to good generalization.

openalex-author · http://www.cs.cmu.edu/Web/Groups/NIPS/00papers-pub-on-web/SallansHinton.ps.gz

Using Free Energies to Represent Q-values in a Multiagent Reinforcement Learning Task

The problem of reinforcement learning in large factored Markov decision processes is explored. The Q-value of a state-action pair is approximated by the free energy of a product of experts network. Network parameters are learned on-line using a modified SARSA algorithm which minimizes the inconsistency of the Q-values of consecutive state-action pairs. Actions are chosen based on the current value estimates by fixing the current state and sampling actions from the network using Gibbs sampling. The algorithm is tested on a co-operative multi-agent task. The product of experts model is found to perform comparably to table-based Q-learning for small instances of the task, and continues to perform well when the problem becomes too large for a table-based representation. 1 Introduction Online Reinforcement Learning (RL) algorithms try to find a policy which maximizes the expected time-discounted reward provided by the environment. They do this by performing sample backups to lea...

openalex-author · http://www.gatsby.ucl.ac.uk/~ywteh/research/faces/nips2000.pdf

Rate-coded Restricted Boltzmann Machines for Face Recognition

We describe a neurally-inspired, unsupervised learning algorithm that builds a non-linear generative model for pairs of face images from the same individual. Individuals are then recognized by finding the highest relative probability pair among all pairs that consist of a test image and an image whose identity is known. Our method compares favorably with other methods in the literature. The generative model consists of a single layer of rate-coded, non-linear feature detectors and it has the property that, given a data vector, the true posterior probability distribution over the feature detector activities can be inferred rapidly without iteration or approximation. The weights of the feature detectors are learned by comparing the correlations of pixel intensities and feature activations in two phases: When the network is observing real data and when it is observing reconstructions of real data generated from the feature activations. 1 Introduction Face recognition is diff...

openalex-author · http://learning.cs.toronto.edu/~hinton/absps/nips99.pdf

Spiking Boltzmann Machines

We first show how to represent sharp posterior probability distributions using real valued coefficients on broadly-tuned basis functions. Then we show how the precise times of spikes can be used to convey the real-valued coefficients on the basis functions quickly and accurately. Finally we describe a simple simulation in which spiking neurons learn to model an image sequence by fitting a dynamic generative model.

openalex-author · Paper

Learning population coes by minimizing description length

No abstract available from the OpenAlex source record.

openalex-author · Paper

Unsupervised learning

No abstract available from the OpenAlex source record.

openalex-author · Unsupervised Learning

The Helmholtz Machine

Discovering the structure inherent in a set of patterns is a fundamental aim of statistical inference or learning. One fruitful approach is to build a parameterised stochastic generative model, independent draws from which are likely to produce the patterns. For all but the simplest generative models, each pattern can be generated in exponentially manyways. It is thus intractable to adjust the parameters to maximize the probability of the observed patterns, We describe a way of finessing this combinatorial explosion by maximising an easily computed lower bound on the probability of the observations. Our method can be viewed as a form of hierarchical self-supervised learning that may relate to the function of bottom-up and top-down cortical processing pathways. 1 Introduction Following Helmholtz, we view the human perceptual system as a statistical inference engine whose function is to infer the probable causes of sensory input. We show that a device of this kind can learn h...

openalex-author · Unsupervised Learning

Learning Population Codes by Minimizing Description Length

The Minimum Description Length principle (MDL) can be used to train the hidden units of a neural network to extract a representation that is cheap to describe but nonetheless allows the input to be reconstructed accurately.We show how MDL can be used to develop highly redundant population codes.Each hidden unit has a location in a low-dimensional implicit space.If the hidden unit activities form a bump of a standard shape in this space, they can be cheaply encoded by the center of this bump.So the weights from the input units to the hidden units in an autoencoder are trained to make the activities form a standard bump.The coordinates of the hidden units in the implicit space are also learned, thus allowing flexibility, as the network develops a discontinuous topography when presented with different input classes.

openalex-author · 9th International Conference on Artificial Neural Networks: ICANN '99

Products of experts

It is possible to combine multiple probabilistic models of the same data by multiplying the probabilities together and then renormalizing. This is a very efficient way to model high-dimensional data which simultaneously satisfies many different low dimensional constraints. Each individual expert model can focus on giving high probability to data vectors that satisfy just one of the constraints. Data vectors that satisfy this one constraint but violate other constraints will be ruled out by their low probability under the other expert models. Training a product of models appears difficult because, in addition to maximizing the probabilities that the individual models assign to the observed data, it is necessary to make the models disagree on unobserved regions of the data space. However, if the individual models are tractable there is a fairly efficient way to train a product of models. This training algorithm suggests a biologically plausible way of learning neural population codes.

openalex-author · Paper

Unsupervised learning : foundations of neural computation

Since its founding in 1989 by Terrence Sejnowski, Neural Computation has become the leading journal in the field. Foundations of Neural Computation collects, by topic, the most significant papers that have appeared in the journal over the past nine years. This volume of Foundations of Neural Computation, on unsupervised learning algorithms, focuses on neural network learning algorithms that do not require an explicit teacher. The goal of unsupervised learning is to extract an efficient internal representation of the statistical structure implicit in the inputs. These algorithms provide insights into the development of the cerebral cortex and implicit learning in humans. They are also of interest to engineers working in areas such as computer vision and speech recognition who seek efficient representations of raw input data.

openalex-author · Paper

Fast Neural Network Emulation and Control of Dynamical Systems

Computer animation through the numerical simulation of physics-based graphics models offers unsurpassed realism, but it can be computationally demanding. This paper demonstrates the possibility of replacing the numerical simulation of nontrivial dynamic models with a dramatically more efficient NeuroAnimator that exploits neural networks. NeuroAnimators are automatically trained off-line to emulate physical dynamics through the observation of physics-based models in action. Depending on the model, its neural network emulator can yield physically realistic animation one or two orders of magnitude faster than conventional numerical simulation. We demonstrate NeuroAnimators for a variety of physics-based models. By exploiting the network structure of the NeuroAnimator, we also introduce a remarkably fast algorithm for learning controllers that enables either complex physics-based models or their neural network emulators to synthesize motions satisfying prescribed animation goals.

openalex-author · http://papers.nips.cc/paper/1562-fast-neural-network-emulation-of-dynamical-systems-for-computer-animation.pdf

Fast Neural Network Emulation of Dynamical Systems for Computer Animation

Computer animation through the numerical simulation of physics-based graphics models offers unsurpassed realism, but it can be computation-ally demanding. This paper demonstrates the possibility of replacing the numerical simulation of nontrivial dynamic models with a dramatically more efficient &quot;NeuroAnimator &quot; that exploits neural networks. Neu-roAnimators are automatically trained off-line to emulate physical dy-namics through the observation of physics-based models in action. De-pending on the model, its neural network emulator can yield physically realistic animation one or two orders of magnitude faster than conven-tional numerical simulation. We demonstrate NeuroAnimators for a va-riety of physics-based models. 1

openalex-author · Statistics in Medicine

A comparison of statistical learning methods on the GUSTO database

We apply a battery of modern, adaptive non-linear learning methods to a large real database of cardiac patient data. We use each method to predict 30 day mortality from a large number of potential risk factors, and we compare their performances. We find that none of the methods could outperform a relatively simple logistic regression model previously developed for this problem.

openalex-author · Statistics and Computing

Coaching variables for regression and classification

No abstract available from the OpenAlex source record.

openalex-author · Network: Computation in Neural Systems

Cascaded redundancy reduction

We describe a method for incrementally constructing a hierarchical generative model of an ensemble of binary data vectors. The model is composed of stochastic, binary, logistic units. Hidden units are added to the model one at a time with the goal of minimizing the information required to describe the data vectors using the model. In addition to the top-down generative weights that define the model, there are bottom-up recognition weights that determine the binary states of the hidden units given a data vector. Even though the stochastic generative model can produce each data vector in many ways, the recognition model is forced to pick just one of these ways. The recognition model therefore underestimates the ability of the generative model to predict the data, but this underestimation greatly simplifies the process of searching for the generative and recognition weights of a new hidden unit.

openalex-author · Proceedings of the 25th annual conference on Computer graphics and interactive techniques - SIGGRAPH '98

NeuroAnimator

Animation through the numerical simulation of physics- based graphics models offers unsurpassed realism, but it can be computationally demanding. Likewise, the search for controllers that enable physics-based models to produce desired animations usually entails formidable computational cost. This paper demon- strates the possibility of replacing the numerical simulation and control of dynamic models with a dramatically more efficient al- ternative. In particular, we propose the NeuroAnimator, a novel ap- proach to creating physically realistic animation that exploits neu- ral networks. NeuroAnimators are automatically trained off-line to emulate physical dynamics through the observation of physics- based models in action. Depending on the model, its neural net- work emulator can yield physically realistic animation one or two orders of magnitude faster than conventional numerical simulation. Furthermore, by exploiting the network structure of the NeuroAni- mator, we introduce a fast algorithm for learning controllers that en- ables either physics-based models or their neural network emulators to synthesize motions satisfying prescribed animation goals. We demonstrate NeuroAnimators for a variety of physics-based mod- els.

openalex-author · IEEE Transactions on Neural Networks

Glove-TalkII-a neural-network interface which maps gestures to parallel formant speech synthesizer controls

Glove-Talk II is a system which translates hand gestures to speech through an adaptive interface. Hand gestures are mapped continuously to ten control parameters of a parallel formant speech synthesizer. The mapping allows the hand to act as an artificial vocal tract that produces speech in real time. This gives an unlimited vocabulary in addition to direct control of fundamental frequency and volume. Currently, the best version of Glove-Talk II uses several input devices, a parallel formant speech synthesizer, and three neural networks. The gesture-to-speech task is divided into vowel and consonant production by using a gating network to weight the outputs of a vowel and a consonant neural network. The gating network and the consonant network are trained with examples from the user. The vowel network implements a fixed user-defined relationship between hand position and vowel sound and does not require any training examples from the user. Volume, fundamental frequency, and stop consonants are produced with a fixed mapping from the input devices. With Glove-Talk II, the subject can speak slowly but with far more natural sounding pitch variations than a text-to-speech synthesizer.

openalex-author · Paper

Fast Neural Network Emulation and Control of Physics-Based Models

Animation through the numerical simulation of physicsbased graphics models offers unsurpassed realism, but it can be computationally demanding. Likewise, the search for controllers that enable physics-based models to produce desired animations usually entails formidable computational cost. This paper demonstrates the possibility of replacing the numerical simulation and control of dynamic models with a dramatically more efficient alternative. In particular, we propose the NeuroAnimator, a novel approach to creating physically realistic animation that exploits neural networks. NeuroAnimators are automatically trained off-line to emulate physical dynamics through the observation of physicsbased models in action. Depending on the model, its neural network emulator can yield physically realistic animation one or two orders of magnitude faster than conventional numerical simulation. Furthermore, by exploiting the network structure of the NeuroAnimator, we introduce a fast algorithm for learning controllers that enables either physics-based models or their neural network emulators to synthesize motions satisfying prescribed animation goals. We demonstrate NeuroAnimators for a variety of physics-based models. CR Categories: I.3.7 [Computer Graphics]: Three-Dimensional Graphics and Realism—Animation; I.6.8 [Simulation and Modeling]: Types of Simulation—Animation

openalex-author · Learning in Graphical Models

A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants

No abstract available from the OpenAlex source record.

openalex-author · Learning in Graphical Models

A Hierarchical Community of Experts

No abstract available from the OpenAlex source record.

openalex-author · http://learning.cs.toronto.edu/~hinton/absps/nips97.pdf

Hierarchical Non-linear Factor Analysis and Topographic Maps

We first describe a hierarchical, generative model that can be viewed as a non-linear generalisation of factor analysis and can be implemented in a neural network. The model performs perceptual inference in a probabilistically consistent manner by using top-down, bottom-up and lateral connections. These connections can be learned using simple rules that require only locally available information. We then show how to incorporate lateral connections into the generative model. The model extracts a sparse, distributed, hierarchical representation of depth from simplified random-dot stereograms and the localised disparity detectors in the first hidden layer form a topographic map. When presented with image patches from natural scenes, the model develops topographically organised local feature detectors. 1 Introduction Factor analysis is a probabilistic model for real-valued data which assumes that the data is a linear combination of real-valued uncorrelated Gaussian sources (the factors). ...

openalex-author · Computer Vision and Image Understanding

Instantiating Deformable Models with a Neural Net

No abstract available from the OpenAlex source record.

openalex-author · IEEE Transactions on Neural Networks

Glove-talk II - a neural-network interface which maps gestures to parallel formant speech synthesizer controls

openalex-author · Computational and Psychophysical Mechanisms of Visual Coding

A simple algorithm that discovers efficient perceptual codes

We describe the &quot;wake-sleep&quot; algorithm that allows a multilayer, unsupervised, neural network to build a hierarchy of representations of sensory input. The network has bottom-up &quot;recognition&quot; connections that are used to convert sensory input into underlying representations. Unlike most artificial neural networks, it also has top-down &quot;generative&quot; connections that can be used to reconstruct the sensory input from the representations. In the &quot;wake&quot; phase of the learning algorithm, the network is driven by the bottom-up recognition connections and the top-down generative connections are trained to be better at reconstructing the sensory input from the representation chosen by the recognition process. In the &quot;sleep&quot; phase, the network is driven top-down by the generative connections to produce a fantasized representation and a fantasized sensory input. The recognition connections are then trained to be better at recovering the fantasized representation from the fantasized sensory input. In both phases, the synaptic learning rule is simple and local. The combined effect of the two phases is to create representations of the sensory input that are efficient inthe following sense: On average, it takes more bits to describe each sensory input vector directly than to rst describe the representation of the sensory input chosen by the recognition process and then describe the difference between the sensory input and its reconstruction from the chosen representation.

openalex-author · Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences

Generative models for discovering sparse distributed representations

We describe a hierarchical, generative model that can be viewed as a nonlinear generalization of factor analysis and can be implemented in a neural network. The model uses bottom-up, top-down and lateral connections to perform Bayesian perceptual inference correctly. Once perceptual inference has been performed the connection strengths can be updated using a very simple learning rule that only requires locally available information. We demonstrate that the network learns to extract sparse, distributed, hierarchical representations.

openalex-author · Neural Computation

A Mobile Robot That Learns Its Place

We show how a neural network can be used to allow a mobile robot to derive an accurate estimate of its location from noisy sonar sensors and noisy motion information. The robot's model of its location is in the form of a probability distribution across a grid of possible locations. This distribution is updated using both the motion information and the predictions of a neural network that maps locations into likelihood distributions across possible sonar readings. By predicting sonar readings from locations, rather than vice versa, the robot can handle the very nongaussian noise in the sonar sensors. By using the constraint provided by the noisy motion information, the robot can use previous readings to improve its estimate of its current location. By treating the resulting estimates as if they were correct, the robot can learn the relationship between location and sonar readings without requiring an external supervision signal that specifies the actual location of the robot. It can learn to locate itself in a new environment with almost no supervision, and it can maintain its location ability even when the environment is nonstationary.

openalex-author · Neural Computation

Using Expectation-Maximization for Reinforcement Learning

We discuss Hinton's (1989) relative payoff procedure (RPP), a static reinforcement learning algorithm whose foundation is not stochastic gradient ascent. We show circumstances under which applying the RPP is guaranteed to increase the mean return, even though it can make large changes in the values of the parameters. The proof is based on a mapping between the RPP and a form of the expectation-maximization procedure of Dempster, Laird, and Rubin (1977).

openalex-author · IEEE Transactions on Neural Networks

Modeling the manifolds of images of handwritten digits

This paper describes two new methods for modeling the manifolds of digitized images of handwritten digits. The models allow a priori information about the structure of the manifolds to be combined with empirical data. Accurate modeling of the manifolds allows digits to be discriminated using the relative probability densities under the alternative models. One of the methods is grounded in principal components analysis, the other in factor analysis. Both methods are based on locally linear low-dimensional approximations to the underlying data manifold. Links with other methods that model the manifold are discussed.

openalex-author · ACM SIGGRAPH 97 Visual Proceedings: The art and interdisciplinary programs of SIGGRAPH '97 on - SIGGRAPH '97

Learning fast neural network emulators for physics-based models

No abstract available.

openalex-author · Paper

Automated motif discovery in protein structure prediction

The protein structure prediction problem (PSP) is one of the central problems in molecular and structural biology. A computational method that could produce a correct detailed three-dimensional structural model for a protein, given its linear sequence of amino acids, would greatly accelerate progress in the biomedical sciences and industries. This thesis presents PSP as a combinatorial optimization problem, the straightforward formulations of which require search of an exponentially-large conformation space and are known to be NP-Hard. This otherwise intractable search can in practice be reduced or eliminated through the discovery and use of motifs. Motifs are abstractions of observed patterns that encode structurally important relationships among constituent parts of a complex object like a protein tertiary structure. Motif discovery is accomplished by particular combinatorial search and statistical estimation methods. This thesis explores in detail two particular motif discovery subproblems, and discusses how their solutions can be applied to the overall structure prediction problem: (1) For a complex multi-stage prediction task, what makes a good intermediate representation language? We address this question by presenting and analyzing methods for the discovery of protein secondary structure classes that are more predictable from amino acid sequence than the standard classes of $\alpha$-helix, $\beta$-sheet, and random coil. (2) Given a database of M objects, each characterized by values $a\sb{ij}\in {\cal A}\sb{j}$ for each of N discrete variables $\{c\sb{j}\}\sbsp{j=1}{N},$ return the list of most interesting higher-order features $\gamma\sb{l},$ i.e., sets of $k\sb{l}$ variables with highest estimated correlation, for any $2 \le k\sb{l} \le N$. In the PSP context, the problem is the detection of correlations between amino acid residues in an aligned set of evolutionarily-related protein sequences. We present and analyze a fast procedure, based on multinomial sampling and a novel coding scheme, that avoids the exhaustive search, prior limits on the order k, and exponentially large parameter space of other methods. The focus of this thesis is PSP, but the techniques and analysis are also aimed at wider application to other hard, multi-stage prediction problems.

openalex-author · Paper

Bayesian networks for pattern classification, data compression, and channel coding

Pattern classification, data compression, and channel are tasks that usually must deal with complex but structured natural or artificial systems. Patterns that we wish to classify are a consequence of a causal physical process. Images that we wish to compress are also a consequence of a causal physical process. Noisy outputs from a telephone line are corrupted versions of a signal produced by a structured man-made telephone modem. Not only are these tasks characterized by complex structure, but they also contain random elements. Graphical models such as Bayesian networks provide a way to describe the relationships between random variables in a stochastic system. In this thesis, I use Bayesian networks as an overarching framework to describe and solve problems in the areas of pattern classification, data compression, and channel coding. Results on the classification of handwritten digits show that Bayesian network pattern classifiers outperform other standard methods, such as the k-nearest neighbor method. When Bayesian networks are used as source models for data compression, an exponentially large number of codewords are associated with each input pattern. It turns out that the code can still be used efficiently, if a new technique called bits-back coding is used. Several new error-correcting decoding algorithms are instances of probability propagation in various Bayesian networks. These new schemes are rapidly closing the gap between the performances of practical channel systems and Shannon's 50-year-old channel limit. The Bayesian network framework exposes the similarities between these and leads the way to a new class of trellis-constraint codes which also operate close to Shannon's limit.

openalex-author · Paper

Evaluation of gaussian processes and other methods for non-linear regression

This thesis develops two Bayesian learning methods relying on Gaussian processes and a rigorous statistical approach for evaluating such methods. In these experimental designs the sources of uncertainty in the estimated generalisation performances due to both variation in training and test sets are accounted for. The framework allows for estimation of generalisation performance as well as statistical tests of significance for pairwise comparisons. Two experimental designs are recommended and supported by the DELVE software environment. Two new non-parametric Bayesian learning methods relying on Gaussian process priors over functions are developed. These priors are controlled by hyperparameters which set the characteristic length scale for each input dimension. In the simplest method, these parameters are fit from the data using optimization. In the second, fully Bayesian method, a Markov chain Monte Carlo technique is used to integrate over the hyperparameters. One advantage of these Gaussian process methods is that the priors and hyperparameters of the trained models are easy to interpret. The Gaussian process methods are benchmarked against several other methods, on regression tasks using both real data and data generated from realistic simulations. The experiments show that small datasets are unsuitable for benchmarking purposes because the uncertainties in performance measurements are large. A second set of experiments provide strong evidence that the bagging procedure is advantageous for the Multivariate Adaptive Regression Splines (MARS) method. The simulated datasets have controlled characteristics which make them useful for understanding the relationship between properties of the dataset and the performance of different methods. The dependency of the performance on available computation time is also investigated. It is shown that a Bayesian approach to learning in multi-layer perceptron neural networks achieves better performance than the commonly used early stopping procedure, even for reasonably short amounts of computation time. The Gaussian process methods are shown to consistently outperform the more conventional methods.

openalex-author · Neural Networks

Varieties of Helmholtz Machine

No abstract available from the OpenAlex source record.

openalex-author · IEEE Transactions on Pattern Analysis and Machine Intelligence

Using generative models for handwritten digit recognition

We describe a method of recognizing handwritten digits by fitting generative models that are built from deformable B-splines with Gaussian "ink generators" spaced along the length of the spline. The splines are adjusted using a novel elastic matching procedure based on the expectation maximization algorithm that maximizes the likelihood of the model generating the data. This approach has many advantages: 1) the system not only produces a classification of the digit but also a rich description of the instantiation parameters which can yield information such as the writing style; 2) the generative models can perform recognition driven segmentation; 3) the method involves a relatively small number of parameters and hence training is relatively easy and fast; and 4) unlike many other recognition schemes, it does not rely on some form of pre-normalization of input images, but can handle arbitrary scalings, translations and a limited degree of image rotation. We have demonstrated that our method of fitting models to images does not get trapped in poor local minima. The main disadvantage of the method is that it requires much more computation than more standard OCR techniques.

openalex-author · http://eprints.aston.ac.uk/664/1/getPDF.pdf

nerative Models for andwritten Digit Recognition

Abstract-We describe a method of recognizing handwritten digits by fitting generative models that are built from deformable B-splines with Gaussian &quot;ink generators &quot; spaced along the length of the spline. The splines are adjusted using a novel elastic matching procedure based on the Expectation Maximization (EM) algorithm that maximizes the likelihood of the model generating the data. This approach has many advantages. 1) After identifying the model most likely to have generated the data, the system not only produces a classification of the digit but also a rich description of the instantiation parameters which can yield information such as the writing style. 2) During the process of explaining the image, generative models can perform recognition driven segmentation. 3) The method involves a relatively small number of parameters and hence training is relatively easy and fast. 4) Unlike many other recognition schemes, it does not rely on some form of pre-normalization of input images, but can handle arbitrary scalings, translations and a limited degree of image rotation. We have demonstrated our method of fitting models to images does not get trapped in poor local minima. The main disadvantage of the method is it requires much more computation than more standard OCR techniques. Index Terms-Deformable model, elastic net, optical character recognition, generative model, probabilistic model, mixture model 1

openalex-author · http://www.cs.utoronto.ca/~revow/papers/bCart.ps.Z

Using Pairs of Data-Points to Define Splits for Decision Trees

Conventional binary classification trees such as CART either split the data using axis-aligned hyperplanes or they perform a computationally expensive search in the continuous space of hyperplanes with unrestricted orientations. We show that the limitations of the former can be overcome without resorting to the latter. For every pair of training data-points, there is one hyperplane that is orthogonal to the line joining the data-points and bisects this line. Such hyperplanes are plausible candidates for splits. In a comparison on a suite of 12 datasets we found that this method of generating candidate splits outperformed the standard methods, particularly when the training sets were small. 1 Introduction Binary decision trees come in many flavours, but they all rely on splitting the set of k-dimensional data-points at each internal node into two disjoint sets. Each split is usually performed by projecting the data onto some direction in the k-dimensional space and then thresholding th...

openalex-author · Advances in Neural Information Processing Systems 8

Does the Wake-sleep Algorithm Produce Good Density Estimators?

The wake-sleep algorithm (Hinton, Dayan, Frey and Neal 1995) is a relatively efficient method of fitting a multilayer stochastic generative model to high-dimensional data.In addition to the top-down connections in the generative model, it makes use of bottom-up connections for approximating the probability distribution over the hidden units given the data, and it trains these bottom-up connections using a simple delta rule.We use a variety of synthetic and real data sets to compare the performance of the wake-sleep algorithm with Monte Carlo and mean field methods for fitting the same generative model and also compare it with other models that are less powerful but easier to fit. COMPETITORSWe compare the wake-sleep algorithm with six other density estimation methods.All data units are binary and can take on values d k = 1 (on) and d k = 0 (off).Gzip.Gzip (Gailly, 1993) is a practical compression method based on Lempel-Ziv coding.This sequential data compression technique encodes future segments of data by transmit-

openalex-author · Science

The "Wake-Sleep" Algorithm for Unsupervised Neural Networks

An unsupervised learning algorithm for a multilayer network of stochastic neurons is described. Bottom-up "recognition" connections convert the input into representations in successive hidden layers, and top-down "generative" connections reconstruct the representation in one layer from the representation in the layer above. In the "wake" phase, neurons are driven by recognition connections, and generative connections are adapted to increase the probability that they would reconstruct the correct activity vector in the layer below. In the "sleep" phase, neurons are driven by generative connections, and recognition connections are adapted to increase the probability that they would produce the correct activity vector in the layer above.

openalex-author · Proceedings of the SIGCHI conference on Human factors in computing systems - CHI '95

Glove-TalkII

Article Glove-TalkII: an adaptive gesture-to-formant interface Share on Authors: Sidney Fels Department of Computer Science & Department of Computer Science, University of Toronto & University of Toronto, Toronto, ON, Canada, M5S 1A4 & Toronto, ON, Canada, M5S 1A4 Department of Computer Science & Department of Computer Science, University of Toronto & University of Toronto, Toronto, ON, Canada, M5S 1A4 & Toronto, ON, Canada, M5S 1A4View Profile , Geoffrey Hinton Department of Computer Science & Department of Computer Science, University of Toronto & University of Toronto, Toronto, ON, Canada, M5S 1A4 & Toronto, ON, Canada, M5S 1A4 Department of Computer Science & Department of Computer Science, University of Toronto & University of Toronto, Toronto, ON, Canada, M5S 1A4 & Toronto, ON, Canada, M5S 1A4View Profile Authors Info & Claims CHI '95: Proceedings of the SIGCHI Conference on Human Factors in Computing SystemsMay 1995 Pages 456–463https://doi.org/10.1145/223904.223966Online:01 May 1995Publication History 24citation473DownloadsMetricsTotal Citations24Total Downloads473Last 12 Months18Last 6 weeks4 Get Citation AlertsNew Citation Alert added!This alert has been successfully added and will be sent to:You will be notified whenever a record that you have chosen has been cited.To manage your alert preferences, click on the button below.Manage my AlertsNew Citation Alert!Please log in to your account Save to BinderSave to BinderCreate a New BinderNameCancelCreateExport CitationPublisher SiteGet Access

openalex-author · http://www.science.mcmaster.ca/Psychology/becker/papers/becker-hinton-chapter95.ps.Z

Spatial coherence as an internal teacher for a neural network

Supervised learning procedures for neural networks have recently met with considerable success in learning difficult mappings. So far, however, they have been limited by their poor scaling behaviour, particularly for networks with many hidden layers. A promising alternative is to develop unsupervised learning algorithms by defining objective functions that characterize the quality of an internal representation without requiring knowledge of the desired outputs of the system. Our major goal is to build self-organizing network modules which capture important regularities in the environment in a simple form. A layered hierarchy of such modules should be able to learn in a time roughly linear in the number of layers. We propose that a good objective for perceptual learning is to extract higher-order features that exhibit simple coherence across time or space. This can be done by transforming the input representation into an underlying representation in which the mutual information between ...

openalex-author · Williams, C KI & Hinton, G E 1993, Hand-printed digit recognition using deformable models. in Proceedings of the 1991 York conference on Spacial vision in human

Hand-printed digit recognition using deformable models

Deformable models are an attractive way for characterizing handwritten digits since they have relatively few parameters, are able to capture many topological variations, and incorporate much prior knowledge. We have described a system [8] that uses learned digit models consisting of splines whose shape is governed by a small number of control points. Images can be classied by separately tting each digit model to the image, and using a simple neural network to decide which model ts best. We use an elastic matching algorithm to minimize an energy function that includes both the deformation energy of the digit model and the log probability that the model would generate the inked pixels in the image. The use of multiple models for each digit can characterize the population of handwritten digits better. We show how multiple models may be used without increasing the time required for elastic matching.

openalex-author · http://www.cse.cuhk.edu.hk/~lxu/papers/nips94-ngate.pdf

An Alternative Model for Mixtures of Experts

An alternative model is proposed for mixtures-of-experts, by utilizing a different parametric form for the gating network. The modified model is trained by an EM algorithm. In comparison with earlier models---trained by either EM or gradient ascent---there is no need to select a learning stepsize to guarantee the convergence of the learning procedure. We report simulation experiments which show that the new architecture yields significantly faster convergence. We also apply the new model to two problems domains: piecewise nonlinear function approximation and combining multiple previously trained classifiers.

openalex-author · Paper

Combining deformable models and neural networks for handprinted digit recognition

In this thesis I develop a method for recognizing isolated handprinted digits using trainable deformable models. Each digit is modelled by a cubic B-spline whose basic shape is defined by the positions of the control points. A Gaussian distribution over displacements of the control points away from their home locations defines a probability distribution over shapes. The quality of the match of a spline model to an image is calculated as the likelihood of the data under a mixture of Gaussian ink generators placed along the length of the spline. Each spline model is adjusted to minimize an energy function that includes both the deformation energy of the model and the likelihood of the data, using a elastic matching procedure which is a generalization of the Expectation Maximization (EM) algorithm. I show that the matching procedure can be significantly speeded up by using a neural net to provide better starting points for the search. The use of deformable models has a number of advantages. (1) After identifying the model most likely to have generated the data, the system not only produces a classification of the digit but also a rich description of the instantiation parameters. I have shown that these can be used to detect writing style consistency within a string of digits. (2) During the process of explaining the image, generative models can perform recognition-driven segmentation. (3) Unlike many other recognition schemes the method does not rely on some form of pre-normalization of input images, but can handle arbitrary scalings, translations and a limited degree of image rotation. The main disadvantage of the method is it requires much more computation than more standard optical character recognition techniques.

openalex-author · Neural Information Processing Systems

Glove-TalkII: Mapping Hand Gestures to Speech Using Neural Networks

Glove-TalkII is a system which translates hand gestures to speech through an adaptive interface. Hand gestures are mapped continuously to 10 control parameters of a parallel formant speech synthesizer. The mapping allows the hand to act as an artificial vocal tract that produces speech in real time. This gives an unlimited vocabulary in addition to direct control of fundamental frequency and volume. Currently, the best version of Glove-TalkII uses several input devices (including a CyberGlove, a ContactGlove, a 3- space tracker, and a foot-pedal), a parallel formant speech synthesizer and 3 neural networks. The gesture-to-speech task is divided into vowel and consonant production by using a gating network to weight the outputs of a vowel and a consonant neural network. The gating network and the consonant network are trained with examples from the user. The vowel network implements a fixed, user-defined relationship between hand-position and vowel sound and does not require any training examples from the user. Volume, fundamental frequency and stop consonants are produced with a fixed mapping from the input devices. One subject has trained to speak intelligibly with Glove-TalkII. He speaks slowly with speech quality similar to a text-to-speech synthesizer but with far more natural-sounding pitch variations.

openalex-author · http://learning.cs.toronto.edu/~hinton/absps/nips-ckiw.pdf

Using a neural net to instantiate a deformable model

Deformable models are an attractive approach to recognizing nonrigid objects which have considerable within class variability. However, there are severe search problems associated with fitting the models to data. We show that by using neural networks to provide better starting points, the search time can be significantly reduced. The method is demonstrated on a character recognition task.

openalex-author · http://www.gatsby.ucl.ac.uk/~dayan/papers/hrd95.pdf

Recognizing Handwritten Digits Using Mixtures of Linear Models

We construct a mixture of locally linear generative models of a col-lection of pixel-based images of digits, and use them for recogni-tion. Different models of a given digit are used to capture different styles of writing, and new images are classified by evaluating their log-likelihoods under each model. We use an EM-based algorithm in which the M-step is computationally straightforward principal components analysis (PCA). Incorporating tangent-plane informa-tion [12] about expected local deformations only requires adding tangent vectors into the sample covariance matrices for the PCA, and it demonstrably improves performance. 1

openalex-author · Paper

Glove-talkii: mapping hand gestures to speech using neural networks. an approach to building adaptive interfaces

No abstract available from the OpenAlex source record.

openalex-author · http://www.gatsby.ucl.ac.uk/hinton/absps/cvq.ps

Autoencoders, Minimum Description Length and Helmholtz Free Energy

An autoencoder network uses a set of recognition weights to convert an input vector into a code vector. It then uses a set of generative weights to convert the code vector into an approximate reconstruction of the input vector. We derive an objective function for training autoencoders based on the Minimum Description Length (MDL) principle. The aim is to minimize the information required to describe both the code vector and the reconstruction error. We show that this information is minimized by choosing code vectors stochastically according to a Boltzmann distri-bution, where the generative weights define the energy of each possible code vector given the input vector. Unfortunately, if the code vectors use distributed representations, it is exponentially expensive to compute this Boltzmann distribution because it involves all possible code vectors. We show that the recognition weights of an autoencoder can be used to compute an approximation to the Boltzmann distribution and that this ap-proximation gives an upper bound on the description length. Even when this bound is poor, it can be used as a Lyapunov function for learning both the generative and the recognition weights. We demonstrate that this approach can be used to learn factorial codes. 1

openalex-author · Scientific American

Simulating Brain Damage

No abstract available from the OpenAlex source record.

openalex-author · Neural Computation

Learning Mixture Models of Spatial Coherence

We have previously described an unsupervised learning procedure that discovers spatially coherent properties of the world by maximizing the information that parameters extracted from different parts of the sensory input convey about some common underlying cause. When given random dot stereograms of curved surfaces, this procedure learns to extract surface depth because that is the property that is coherent across space. It also learns how to interpolate the depth at one location from the depths at nearby locations (Becker and Hinton 1992b). In this paper, we propose two new models that handle surfaces with discontinuities. The first model attempts to detect cases of discontinuities and reject them. The second model develops a mixture of expert interpolators. It learns to detect the locations of discontinuities and to invoke specialized, asymmetric interpolators that do not cross the discontinuities.

openalex-author · Proceedings of the sixth annual conference on Computational learning theory - COLT '93

Keeping the neural networks simple by minimizing the description length of the weights

Supervised neural networks generalize well if there is much less information in the weights than there is in the output vectors of the training cases. So during learning, it is important to keep the weights simple by penalizing the amount of information they contain. The amount of information in a weight can be controlled by adding Gaussian noise and the noise level can be adapted during learning to optimize the trade-off between the expected squared error of the network and the amount of information in the weights. We describe a method of computing the derivatives of the expected squared error and of the amount of information in the noisy weights in a network that contains a layer of non-linear hidden units. Provided the output units are linear, the exact derivatives can be computed efficiently without time-consuming Monte Carlo simulations. The idea of minimizing the amount of information that is required to communicate the weights of a neural network leads to a number of interesting...

openalex-author · IEEE Transactions on Neural Networks

Glove-Talk: a neural network interface between a data-glove and a speech synthesizer

To illustrate the potential of multilayer neural networks for adaptive interfaces, a VPL Data-Glove connected to a DECtalk speech synthesizer via five neural networks was used to implement a hand-gesture to speech system. Using minor variations of the standard backpropagation learning procedure, the complex mapping of hand movements to speech is learned using data obtained from a single ;speaker' in a simple training phase. With a 203 gesture-to-word vocabulary, the wrong word is produced less than 1% of the time, and no word is produced about 5% of the time. Adaptive control of the speaking rate and word stress is also available. The training times and final performance speed are improved by using small, separate networks for each naturally defined subtask. The system demonstrates that neural networks can be used to develop the complex mappings required in a high bandwidth interface that adapts to the individual user.

openalex-author · IEEE Transactions on Communications

A soft decision-directed LMS algorithm for blind equalization

An adaptation algorithm for equalizers operating on very distorted channels is presented. The algorithm is based on the idea of adjusting the equalizer tap gains to maximize the likelihood that the equalizer outputs would be generated by a mixture of two Gaussians with known means. The decision-directed least-mean-square algorithm is shown to be an approximation to maximizing the likelihood that the equalizer outputs come from such an independently and identically distributed source. The algorithm is developed in the context of a binary pulse-amplitude-modulation channel, and simulations demonstrate that the algorithm converges in channels for which the decision-directed LMS algorithms does not converge.< <ETX xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">></ETX>

openalex-author · Paper

Developing Population Codes For

An efficient and useful representation for an object viewed from different positions is in terms of its instantiation parameters. We show how the Minimum Description Length principle (MDL) can be used to train the hidden units of a neural network to develop a population code for the instantiation parameters of an object in an image. Each hidden unit has a location in a low-dimensional implicit space. If the hidden unit activities form a standard shape (a bump) in this space, they can be cheaply encoded by the center of this bump. So the weights from the input units to the hidden units in a self-supervised network are trained to make the activities form a bump. The coordinates of the hidden units in the implicit space are also learned, thus allowing flexibility, as the network develops separate population codes when presented with different objects.

openalex-author · ICANN ’93

Keeping Neural Networks Simple

No abstract available from the OpenAlex source record.

openalex-author · Investigación y ciencia

Simulación de lesiones cerebrales

No abstract available from the OpenAlex source record.

openalex-author · http://vocal.gatsby.ucl.ac.uk/hinton/absps/dh93.ps.gz

Feudal Reinforcement Learning

One way to speed up reinforcement learning is to enable learning to happen simultaneously at multiple resolutions in space and time. This paper shows how to create a Q-learning managerial hierarchy in which high level managers learn how to set tasks to their sub-managers who, in turn, learn how to satisfy them. Sub-managers need not initially understand their managers&apos; commands. They simply learn to maximise their reinforcement in the context of the current command. We illustrate the system using a simple maze task.. As the system learns how to get around, satisfying commands at the multiple levels, it explores more efficiently than standard, flat, Q-learning and builds a more comprehensive map.

openalex-author · Connectionism Theory and Practice

Using Coherence Assumptions to Discover the Underlying Causes of the Sensory Input

Abstract This chapter is based on a conference talk given by the first author. In order to make the ideas as intelligible as possible we have attempted to preserve the informal style of the talk. Some of the more technical details can be found in Hinton and Becker (1990), and the full details are given in Becker and Hinton (1989). During the last decade people have discovered new learning procedures for multi-foyer networks of simple neuron-like units. Using these new procedures they have succeeded in getting neural networks to solve much more complicated tasks than was previously possible.

openalex-author · Neural Computation

Simplifying Neural Networks by Soft Weight-Sharing

One way of simplifying neural networks so they generalize better is to add an extra term to the error function that will penalize complexity. Simple versions of this approach include penalizing the sum of the squares of the weights or penalizing the number of nonzero weights. We propose a more complicated penalty term in which the distribution of weight values is modeled as a mixture of multiple gaussians. A set of weights is simple if the weights have high probability density under the mixture model. This can be achieved by clustering the weights into subsets with the weights in each cluster having very similar values. Since we do not know the appropriate means or variances of the clusters in advance, we allow the parameters of the mixture model to adapt at the same time as the network learns. Simulations on two different problems demonstrate that this complexity term is more effective than previous complexity terms.

openalex-author · Nature

Self-organizing neural network that discovers surfaces in random-dot stereograms

No abstract available from the OpenAlex source record.

openalex-author · Investigación y ciencia

Redes neuronales que aprenden de la experiencia

No abstract available from the OpenAlex source record.

openalex-author · Hinton, G E, Williams, C K I & Revow, M D 1991, Adaptive elastic models for hand-printed character recognition. in Advances in Neural Information Processing Sys

Adaptive Elastic Models for Hand-Printed Character Recognition

Hand-printed digits can be modeled as splines that are governed by about 8 control points. For each known digit, the control points have preferred &quot;home&quot; locations, and deformations of the digit are generated by moving the control points away from their home locations. Images of digits can be produced by placing Gaussian ink generators uniformly along the spline. Real images can be recognized by nding the digit model most likely to have generated the data. For each digit model we use an elastic matching algorithm to minimize an energy function that includes both the deformation energy of the digit model and the log probability that the model would generate the inked pixels in the image. The model with the lowest total energy wins. If a uniform noise process is included in the model of image generation, some of the inked pixels can be rejected as noise as a digit model is tting a poorly segmented image. The digit models learn by modifying the home locations of the control points.

openalex-author · Neural Information Processing Systems

Learning to Make Coherent Predictions in Domains with Discontinuities

We have previously described an unsupervised learning procedure that discovers spatially coherent properties of the world by maximizing the information that parameters extracted from different parts of the sensory input convey about some common underlying cause. When given random dot stereograms of curved surfaces, this procedure learns to extract surface depth because that is the property that is coherent across space. It also learns how to interpolate the depth at one location from the depths at nearby locations (Becker and Hinton, 1992). In this paper, we propose two new models which handle surfaces with discontinuities. The first model attempts to detect cases of discontinuities and reject them. The second model develops a mixture of expert interpolators. It learns to detect the locations of discontinuities and to invoke specialized, asymmetric interpolators that do not cross the discontinuities.

openalex-author · Connectionist Symbol Processing

Mapping Part-Whole Hierarchies Into Connectionist Networks

No abstract available from the OpenAlex source record.

openalex-author · Connectionist Symbol Processing

Preface to the Special Issue on Connectionist Symbol Processing

No abstract available from the OpenAlex source record.

openalex-author · SPIE Proceedings

<title>Learning spatially coherent properties of the visual world in connectionist networks</title>

In the unsupervised learning paradigm, a network of neuron-like units is presented with an ensemble of input patterns from a structured environment, such as the visual world, and learns to represent the regularities in that input. The major goal in developing unsupervised learning algorithms is to find objective functions that characterize the quality of the network's representation without explicitly specifying the desired outputs of any of the units. The sort of objective functions considered cause a unit to become tuned to spatially coherent features of visual images (such as texture, depth, shading, and surface orientation), by learning to predict the outputs of other units which have spatially adjacent receptive fields. Simulations show that using an information-theoretic algorithm called IMAX, a network can be trained to represent depth by observing random dot stereograms of surfaces with continuously varying disparities. Once a layer of depth-tuned units has developed, subsequent layers are trained to perform surface interpolation of curved surfaces, by learning to predict the depth of one image region based on depth measurements in surrounding regions. An extension of the basic model allows a population of competing neurons to learn a distributed code for disparity, which naturally gives rise to a representation of discontinuities.

openalex-author · Neural Computation

Adaptive Mixtures of Local Experts

We present a new supervised learning procedure for systems composed of many separate networks, each of which learns to handle a subset of the complete set of training cases. The new procedure can be viewed either as a modular version of a multilayer supervised network, or as an associative version of competitive learning. It therefore provides a new link between these two apparently different approaches. We demonstrate that the learning procedure divides up a vowel discrimination task into appropriate subtasks, each of which can be solved by a very simple expert network.

openalex-author · Paper

Proceedings of the 1990 Summer School on Connectionist models

No abstract available from the OpenAlex source record.

openalex-author · Paper

Tensor Product Variable Binding and the Representation of Symbolic Structures in Connectionist Systems

This chapter contains sections titled: 1. Introduction, 2. Connectionist Representation and Tensor Product Binding: Definition and Examples, 3. Tensor Product Representation: Properties, 4. Conclusion

openalex-author · Connectionist Models

Deterministic Boltzmann Learning in Networks with Asymmetric Connectivity

No abstract available from the OpenAlex source record.

openalex-author · Psychological Review

Lesioning an attractor network: Investigations of acquired dyslexia.

No abstract available from the OpenAlex source record.

openalex-author · Connectionist Models

Mean field networks that learn to discriminate temporally distorted strings

No abstract available from the OpenAlex source record.

openalex-author · Neural Information Processing Systems

Adaptive Soft Weight Tying using Gaussian Mixtures

No abstract available from the OpenAlex source record.

openalex-author · MIT Press eBooks

Advances in Neural Information Processing Systems 4 (NIPS 1991)

No abstract available from the OpenAlex source record.

openalex-author · Neural Information Processing Systems

Evaluation of Adaptive Mixtures of Competing Experts

We compare the performance of the modular architecture, composed of competing expert networks, suggested by Jacobs, Jordan, Nowlan and Hinton (1991) to the performance of a single back-propagation network on a complex, but low-dimensional, vowel recognition task. Simulations reveal that this system is capable of uncovering interesting decompositions in a complex task. The type of decomposition is strongly influenced by the nature of the input to the gating network that decides which expert to use for each case. The modular architecture also exhibits consistently better generalization on many variations of the task.

openalex-author · http://papers.nips.cc/paper/329-discovering-viewpoint-invariant-relationships-that-characterize-objects.pdf

Discovering Viewpoint-Invariant Relationships That Characterize Objects

Using an unsupervised learning procedure, a network is trained on an en-semble of images of the same two-dimensional object at different positions, orientations and sizes. Each half of the network &quot;sees &quot; one fragment of the object, and tries to produce as output a set of 4 parameters that have high mutual information with the 4 parameters output by the other half of the network. Given the ensemble of training patterns, the 4 parameters on which the two halves of the network can agree are the position, orientation, and size of the whole object, or some recoding of them. After training, the network can reject instances of other shapes by using the fact that the predictions made by its two halves disagree. If two competing networks are trained on an unlabelled mixture of images of two objects, they cluster the training cases on the basis of the objects &apos; shapes, independently of the position, orientation, and size. 1

openalex-author · Nature

Mental simulation

No abstract available from the OpenAlex source record.

openalex-author · Wiley-Interscience eBooks

Connectionist architectures for artificial intelligence

No abstract available from the OpenAlex source record.

openalex-author · Neural Computation

The Bootstrap Widrow-Hoff Rule as a Cluster-Formation Algorithm

An algorithm that is widely used for adaptive equalization in current modems is the “bootstrap” or “decision-directed” version of the Widrow-Hoff rule. We show that this algorithm can be viewed as an unsupervised clustering algorithm in which the data points are transformed so that they form two clusters that are as tight as possible. The standard algorithm performs gradient ascent in a crude model of the log likelihood of generating the transformed data points from two gaussian distributions with fixed centers. Better convergence is achieved by using the exact gradient of the log likelihood.

openalex-author · International Conference on Human-Computer Interaction

Building adaptive interfaces with neural networks: The glove-talk pilot study

No abstract available from the OpenAlex source record.

openalex-author · US Dept of the Navy

Connectionist Models: Proceedings of the Summer School Held in San Diego, California on 1990

The simplicity and locality of the contrastive Hebb synapse (CHS) used in Boltzmann machine learning makes it an attractive model for real biological synapses. The slow learning exhibited by the stochastic Boltzmann machine can be greatly improved by using a mean field approximation and it has been shown (Hinton, 1989) that the CHS also performs steepest descent in these deterministic mean field networks. A major weakness of the learning procedure, from a biological perspective, is that the derivation assumes detailed symmetry of the connectivity. Using networks with purely asymmetric connectivity, we show that the CHS still works in practice provided the connectivity is grossly symmetrical so that if unit i sends a connection to unit j, there are numerous indirect feedback paths from j to i.

openalex-author · Medical Entomology and Zoology

Mundane reasoning by parallel constraint satisfaction

This thesis describes a frame system similar to KL-ONE, called micro-KLONE, for representing and reasoning about knowledge which may be incomplete or inconsistent. An unusual semantics appropriate to familiar situations is proposed. It is based on probabilistic sampling to find a single plausible model of the domain in order to answer a query. Correct answering of queries is intractable, so the implementation make two approximations in order to run quickly: (1) The underlying connectionist architecture is only large enough to represent partial models of the domain, and (2) the system is only allowed to search for a limited time, so it may not even find the best partial intepretation. Lacking a provably correct implementation, the usefulness of the system becomes an empirical question. The Ted Turner problem is presented as an example in which the system draws an interesting common sense conclusion to a counterfactual query.

openalex-author · Machine Learning

CONNECTIONIST LEARNING PROCEDURES11This chapter appeared in Volume 40 of Artificial Intelligence in 1989, reprinted with permission of North-Holland Publishing. It is a revised version of Technical Report CMU-CS-87-115, which has the same title and was prepared in June 1987 while the author was at Carnegie Mellon University. The research was supported by contract N00014-86-K-00167 from the Office of Naval Research and by grant IST-8520359 from the National Science Foundation.

No abstract available from the OpenAlex source record.

openalex-author · Neural Networks

A time-delay neural network architecture for isolated word recognition

No abstract available from the OpenAlex source record.

openalex-author · Artificial Intelligence

Connectionist learning procedures

No abstract available from the OpenAlex source record.

openalex-author · IEEE Transactions on Acoustics, Speech, and Signal Processing

Phoneme recognition using time-delay neural networks

The authors present a time-delay neural network (TDNN) approach to phoneme recognition which is characterized by two important properties: (1) using a three-layer arrangement of simple computing units, a hierarchy can be constructed that allows for the formation of arbitrary nonlinear decision surfaces, which the TDNN learns automatically using error backpropagation; and (2) the time-delay arrangement enables the network to discover acoustic-phonetic features and the temporal relationships between them independently of position in time and therefore not blurred by temporal shifts in the input. As a recognition task, the speaker-dependent recognition of the phonemes B, D, and G in varying phonetic contexts was chosen. For comparison, several discrete hidden Markov models (HMM) were trained to perform the same task. Performance evaluation over 1946 testing tokens from three speakers showed that the TDNN achieves a recognition rate of 98.5% correct while the rate obtained by the best of the HMMs was only 93.7%.< <ETX xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">></ETX>

openalex-author · Proceedings of the Annual Meeting of the Cognitive Science Society, vol 8, iss 0

Learning distributed representations of concepts.

Concepts can be represented by distributed patterns of activity in networks of neuron-like units. One advantage of this kind of representation is that it leads to automatic generalization. When the weights in the network are changed to incorporate new knowledge about one concept, the changes affect the knowledge associated with other concepts that are represented by similar activity patterns. There have been numerous\ndemonstrations of sensible generalization which have depended on the experimenter choosing appropriately similar patterns for different concepts. This paper shows how the network can be made to choose the patterns itself when shown a set of propositions that use the concepts. It chooses patterns which make explicit the underlying features that are only implicit in the propositions it is shown.

openalex-author · Neural Information Processing Systems

Discovering High Order Features with Mean Field Modules

A new form of the deterministic Boltzmann machine (DBM) learning procedure is presented which can efficiently train network to discriminate between input vectors according to some criterion. The new technique directly utilizes the free energy of these field modules to represent the probability that the criterion is met, the free energy being readily manipulated by the learning procedure. Although conventional deterministic Boltzmann learning fails to extract the higher order feature of shift at a network bottleneck, combining the new mean field with the mutual information objective function rapidly produces that perfectly extract this important higher order feature without direct external supervision.

openalex-author · Neural Information Processing Systems

Dimensionality Reduction and Prior Knowledge in E-Set Recognition

It is well known that when an automatic learning algorithm is applied to a fixed corpus of data, the size of the corpus places an upper bound on the number of degrees of freedom that the model can contain if it is to generalize well. Because the amount of hardware in a neural network typically increases with the dimensionality of its inputs, it can be challenging to build a high-performance network for classifying large input patterns. In this paper, several techniques for addressing this problem are discussed in the context of an isolated word recognition task.

openalex-author · M. Kaufmann eBooks

Proceedings of the 1988 Connectionist Models Summer School

No abstract available from the OpenAlex source record.

openalex-author · Cognition

Scene-based and viewer-centered representations for comparing shapes

No abstract available from the OpenAlex source record.

openalex-author · Cognitive Science

A Distributed Connectionist Production System

DCPS is a connectionist production system interpreter that uses distributed representations. As a connectionist model it consists of many simple, richly interconnected neuron‐like computing units that cooperate to solve problems in parallel. One motivation for constructing DCPS was to demonstrate that connectionist models are capable of representing and using explicit rules. A second motivation was to show how “coarse coding” or “distributed representations” can be used to construct a working memory that requires far fewer units than the number of different facts that can potentially be stored. The simulation we present is intended as a detailed demonstration of the feasibility of certain ideas and should not be viewed as a full implementation of production systems. Our current model only has a few of the many interesting emergent properties that we eventually hope to demonstrate: It is damage‐resistant, it performs matching and variable binding by massively parallel constraint satisfaction, and the capacity of its working memory is dependent on the similarity of the items being stored.

openalex-author · The Journal of the Acoustical Society of America

Speech recognition using time-delay neural networks

A time-delay neural network (TDNN) approach is presented to speech recognition that is characterized by two important properties: (1) Using multilayer arrangements of simple computing units, a TDNN can represent arbitrary nonlinear classification decision surfaces that are learned automatically using error back propagation. (2) The time-delay arrangement enables the network to discover acoustic-phonetic features and the temporal relationships between them independent of position in time and, hence, not blurred by temporal shifts in the input. The TDNNs are compared with the currently most popular technique in speech recognition, hidden Markov models (HMM). Extensive performance evaluation shows that the TDNN recognizes voiced stops extracted from varying phonetic contexts at an error rate four times lower (1.5% vs 6.3%) than the best of our HMMs. To perform this task, the TDNN “invented” well-known acoustic-phonetic features (e.g., F2 rise, F2 fall, vowel onset) as useful abstractions. It also developed alternate internal representations to link different acoustic realizations to the same concept. The TDNNs trained for other phonetic classes achieve similar high levels of performance. The integration of such smaller networks into large phonetic nets and propose strategies for the design of neural network based large vocabulary speech recognition systems is discussed.

openalex-author · Neurocomputing, Volume 1

(1985) David H. Ackley, Geoffrey E. Hinton, and Terrence J. Sejnowski, "A learning algorithm for Boltzmann machines," Cognitive Science 9: 147-169

No abstract available from the OpenAlex source record.

openalex-author · Neurocomputing, Volume 1

(1986) D. E. Rumelhart, G. E. Hinton, and R. J. Williams, "Learning internal representations by error propagation," Parallel Distributed Processing: Explorations in the Microstructures of Cognition, Vol. I, D. E. Rumelhart and J. L. McClelland (Eds.) Cambridge, MA: MIT Press, pp. 318-362

No abstract available from the OpenAlex source record.

openalex-author · Neurocomputing, Volume 1

(1986) David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams," Learning representations by back-propagating errors," Nature 323: 533-536

No abstract available from the OpenAlex source record.

openalex-author · http://www.research.att.com/~yann/exdb/publis/psgz/lecun-89b.ps.gz

GEMINI: Gradient Estimation Through Matrix Inversion After Noise Injection

Learning procedures that measure how random perturbations of unit activities correlate with changes in reinforcement are inefficient but simple to implement in hardware. Procedures like back-propagation (Rumelhart, Hinton and Williams, 1986) which compute how changes in activities affect the output error are much more efficient, but require more complex hardware. GEMINI is a hybrid procedure for multilayer networks, which shares many of the implementation advantages of correlational reinforcement procedures but is more efficient. GEMINI injects noise only at the first hidden layer and measures the resultant effect on the output error. A linear network associated with each hidden layer iteratively inverts the matrix which relates the noise to the error change, thereby obtaining the error-derivatives. No back-propagation is involved, thus allowing unknown non-linearities in the system. Two simulations demonstrate the effectiveness of GEMINI.

openalex-author · Readings in Cognitive Science

The Appeal of Parallel Distributed Processing

No abstract available from the OpenAlex source record.

openalex-author · Readings in Cognitive Science

Schemata and Sequential Thought Processes in PDP Models

No abstract available from the OpenAlex source record.

openalex-author · Paper

Neural network architectures for artificial intelligence

No abstract available from the OpenAlex source record.

openalex-author · Le Débat

Une nouvelle approche de la cognition : le connexionnisme

No abstract available from the OpenAlex source record.

openalex-author · Perception

The Horizontal—Vertical Delusion

Most people can correctly apply the concepts of horizontal and vertical in describing objects, but a simple demonstration shows that they are confused about how these concepts work. The nature of the confusion and its possible causes are briefly discussed.

openalex-author · Vision, Brain, and Cooperative Computation

Separating Figure from Ground with a Boltzmann Machine

No abstract available from the OpenAlex source record.

openalex-author · Computer Speech & Language

Learning sets of filters using back-propagation

No abstract available from the OpenAlex source record.

openalex-author · Computational Intelligence

Models of human inference

No abstract available from the OpenAlex source record.

openalex-author · http://papers.nips.cc/paper/78-learning-representations-by-recirculation.pdf

Learning Representations by Recirculation

Summary form only given, as follows. A new learning procedure for networks that contain several groups of nonlinear units arranged in a closed loop is described. The procedure modifies the weights on the connections between groups so that the training patterns over the input group return unaltered after passing around the loop. The learning rule amounts to changing each weight by an amount proportional to the product of the presynaptic activity and the rate of change of the postsynaptic activity. It is much simpler to implement in hardware than methods like back-propagation. Simulations show that it usually converges rapidly, and analysis shows that in certain restricted cases it performs gradient descent in a measure of how much the training patterns are altered by passing around the loop.

openalex-author · Lecture Notes in Computer Science

Learning translation invariant recognition in a massively parallel networks

No abstract available from the OpenAlex source record.

openalex-author · Readings in Computer Vision

A Learning Algorithm for Boltzmann Machines**The research reported here was supported by grants from the System Development Foundation. We thank Peter Brown, Francis Crick, Mark Derthick, Scott Fahlman, Jerry Feldman, Stuart Geman, Gail Gong, John Hopfield, Jay McClelland, Barak Pearlmutter, Harry Printz, Dave Rumelhart, Tim Shallice, Paul Smolensky, Rick Szeliski, and Venkataraman Venkatasubramanian for helpful discussions.Reprint requests should be addressed to David Ackley, Computer Science Department, Carnegie-Mellon University, Pittsburgh, PA 15213.

No abstract available from the OpenAlex source record.

openalex-author · Physica D: Nonlinear Phenomena

Learning symmetry groups with hidden units: Beyond the perceptron

Learning to recognize mirror, rotational and translational symmetries is a difficult problem for massively-parallel network models. These symmetries cannot be learned by first-order perceptrons or Hopfield networks, which have no means for incorporating additional adaptive units that are hidden from the input and output layers. We demonstrate that the Boltzmann learning algorithm is capable of finding sets of weights which turn hidden units into useful higher-order feature detectors capable of solving symmetry problems.

openalex-author · National Conference on Artificial Intelligence

Learning in massively parallel nets

The human brain is very different from a conventional digital computer. It relies on massive parallelism rather than raw speed and it stores long-term knowledge by modifying the way its processing elements interact rather than by setting bits in a passive, general purpose memory. It is robust against minor physical damage and it learns from experience instead of being explicitly programmed. We do not yet know how the brain uses the activities of neurons to represent complex, articulated structures, or how the perceptual system turns the raw input into useful internal representations so rapidly. Nor do we know how the brain learns new representational schemes. But over the past few years there have been a lot of new and interesting theories about these issues. Much of the theorizing has been motivated by the belief that the brain is using computational principles which could also be applied to massively parallel artificial systems, if only we knew what the principles were. In the talk, I shall focus on the issue of learning. Early research on perceptrons and associative nets (or matrix memories) showed how to set the weights of the connections between input units and output units so that a pattern of activity on the input units would cause the desired pattern of activity on the output units. A variant, called the auto-associative net, did not distinguish between input and output units. It modified the weights of pairwise inter-connections among the units to ensure that any sufficiently large part of a stored pattern could recreate the rest. Recently, Hopfield has developed an interesting way of analyzing the behavior of iterative, auto-associative nets, but research on simple associative networks is generally of limited interest because most interesting tasks are too complex to be performed by auto-association or by direct connections from the input units to the output units. Many intervening layers of units are generally required and the tough learning problem is to decide how to use these hidden units. The reason this is so difficult is that we are requiring the network to invent its own representational scheme, and the space of possible schemes is immense, even if we restrict ourselves to those that can be implemented conveniently in networks of neuron-like units.

openalex-author · http://www.cnbc.cmu.edu/~plaut/papers/pdf/PlautNowlanHinton86TR.backprop.pdf

Experiments on Learning by Back Propagation.

Rumelhart, Hinton and Williams [Rumelhart et al. 86] describe a learning procedure for layered networks of deterministic, neuron-like units. This paper describes further research on the learning procedure. We start by describing the units, the way they are connected, the learning procedure, and the extension to iterative nets. We then give an example in which a network learns a set of filters that enable it to discriminate formant-like patterns in the presence of noise. The speed of learning is strongly dependent on the shape of the surface formed by the error measure in &amp;quot;weight space. &amp;quot; We give examples of the shape of the error surface for a typical task and illustrate how an acceleration method speeds up descent in weight space. The main drawback of the learning procedure is the way it scales as the size of the task and the network increases. We give some preliminary results on scaling and show how the magnitude of the optimal weight changes depends on the fan-in of the units. Additional results illustrate the effects on learning speed of the amount of interaction between the weights. A variation of the learning procedure that back-propagates desired state information rather than error gradients is developed and compared with the standard procedure. Finally, we discuss the relationship between our iterative networks and the &amp;quot;analog &amp;quot; networks described by Hopfield and Tank [Hopfield and Tank 85]. The learning procedure can discover appropriate weights in their kind of network, as well as determine an optimal schedule for varying the nonlinearity of the units during a search. 1.

openalex-author · Perception

Separating Figure from Ground with a Parallel Network

The differentiation of figure from ground plays an important role in the perceptual organization of visual stimuli. The rapidity with which we can discriminate the inside from the outside of a figure suggests that at least this step in the process may be performed in visual cortex by a large number of neurons in several different areas working together in parallel. We have attempted to simulate this collective computation by designing a network of simple processing units that receives two types of information: bottom-up input from the image containing the outlines of a figure, which may be incomplete, and a top-down attentional input that biases one part of the image to be the inside of the figure. No presegmentation of the image was assumed. Two methods for performing the computation were explored: gradient descent, which seeks locally optimal states, and simulated annealing, which attempts to find globally optimal states by introducing noise into the computation. For complete outlines, gradient descent was faster, but the range of input parameters leading to successful performance was very narrow. In contrast, simulated annealing was more robust: it worked over a wider range of attention parameters and a wider range of outlines, including incomplete ones. Our network model is too simplified to serve as a model of human performance, but it does demonstrate that one global property of outlines can be computed through local interactions in a parallel network. Some features of the model, such as the role of noise in escaping from nonglobal optima, may generalize to more realistic models.

openalex-author · MIT Press eBooks

A general framework for parallel distributed processing

This chapter contains sections titled: Classes of PDP Models, Specific Versions of the General Parallel Activation Model, Sigma-Pi Units, Conclusion, Acknowledgments

openalex-author · AIP Conference Proceedings

G-maximization: an unsupervised learning procedure for discovering regularities

Hill climbing is used to maximize an information theoretic measure of the difference betwen the actual behavior of a unit and the behavior that would be predicted by a statistician who knew the first order statistics of the inputs but believed them to be independent. This causes the unit to detect higher order correlations among its inputs. Initial simulations are presented, and seem encouraging. We describe an extension of the basic idea which makes it resemble competitive learning and which causes members of a population of these units to differentiate, each extracting different structure from the input.

openalex-author · Paper

APPRENTISSAGE DANS LES MACHINES DE BOLTZMANN

A cnW computatbnd problem in ~ercepUon bmarch. Ohrm 8 ~ # ( d q n d k l a t e h y p o l h . s c w ; r b o u t h o w t o k ~ p r m o r m p e c w d m ~ , n d r r t d p l a w i b k c o M t r a i n l e ~ t l ~ r m , m . v r k r s r m u d b s l r i g n r d t O ~ ~ ~ m t o minim&@ lha total vidaUon d the daudbk csc\atninla Thk kdo~byrHowlngrndworkdcompuUnq~reLolmt le into a rtrM. atate. Each element rcpnrentS r Mpsthsd& urb the intaractionrbetween thedmrnlmP#(NMlt ttn~WtmkW

openalex-author · National Conference on Artificial Intelligence

Learning in Massively Parallel Nets (Panel).

No abstract available from the OpenAlex source record.

openalex-author · International Joint Conference on Artificial Intelligence

Shape recognition and illusory conjunctions

The significance of blood pressure alterations during night-time has been already recorded in essential hypertension and several studies have been conducted to guide current clinical practice. To date, however, there is no consensus regarding the need for screening patients with preeclampsia for nocturnal hypertension as evidence in this field remain scarce. The purpose of this study is to accumulate current data in this field and serve as a pilot for the conduct of future studies. The present systematic review was designed according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses. We used the Medline, Scopus, Clinicaltrials.gov, EMBASE, Cochrane Central Register of Controlled Trials CENTRAL and Google Scholar databases in our primary search along with the reference lists of electronically retrieved full-text papers. Overall, six studies were included in our systematic review that recruited 487 pregnant women. Their methodological quality was evaluated as average according to the Newcastle-Ottawa criteria. The majority of those studies pointed towards significant differences in nocturnal blood pressure patterns among patients with preeclampsia and controls. However, its clinical value in determining pregnancy outcomes remains unknown as only one small case control study investigated outcomes of patients with severe preeclampsia and different patterns of nocturnal blood pressure and reported that differences were non-significant. Concluding, current evidence supports that nocturnal hypertension seems to be more prevalent in cases complicated by preeclampsia; however, its clinical usefulness in determining pregnancy outcomes remains, to date, unknown.

openalex-author · Behavioral and Brain Sciences

Three frames suffice

An abstract is not available for this content so a preview has been provided. Please use the Get access link above for information on how to access this content.

openalex-author · Cognitive Science

A learning algorithm for boltzmann machines

The computational power of massively parallel networks of simple processing elements resides in the communication bandwidth provided by the hardware connections between elements. These connections can allow a significant fraction of the knowledge of the system to be applied to an instance of a problem in a very short time. One kind of computation for which massively parallel networks appear to be well suited is large constraint satisfaction searches, but to use the connections efficiently two conditions must be met: First, a search technique that is suitable for parallel networks must be found. Second, there must be some way of choosing internal representations which allow the preexisting hardware connections to be used efficiently for encoding the constraints in the domain being searched. We describe a general parallel search method, based on statistical mechanics, and we show how it leads to a general learning rule for modifying the connection strengths so as to incorporate knowledge about a task domain in an efficient way. We describe some simple examples in which the learning algorithm creates internal representations that are demonstrably the most efficient way of using the preexisting connectivity structure.

openalex-author · Perception

Why the Islands Move

Micronesian navigators routinely make voyages across large expanses of open ocean. To do this, a navigator must judge both the direction in which he is sailing and the distance he has travelled. The rising and setting points of the stars (and other cues) provide instantaneous information about direction, but distance can only be judged by integrating velocity-related information over time. Micronesian navigators judge distance in a way that seems odd. When they are out of sight of land, they imagine that the canoe is stationary and that the islands move back past them. For each voyage, they 'attend' to an island off to the side of the course which is out of sight over the horizon. As they sail, they imagine the island moving back along the horizon changing in bearing until it is imagined to be under the bearing it is known to have from the destination island. Then they know they are near their destination. There is good reason for using a frame of reference whose origin is defined by the boat. We show how it finesses a perceptual paradox--the rising and setting points of the stars do not exhibit motion parallax.

openalex-author · Journal of Motor Behavior

Parallel Computations for Controlling an Arm

In order to control a reaching movement of the arm and body, several different computational problems must be solved. Some parallel methods that could be implemented in networks of neuron-like processors are described. Each method solves a different part of the overall task. First, a method is described for finding the torques necessary to follow a desired trajectory. The methods is more economical and more versatile than table look-up and requires very few sequential steps. Then a way of generating an internal representation of a desired trajectory is described. This method shows the trajectory one piece at a time by applying a large set of heuristic rules to a "motion blackboard" that represents the static and dynamic parameters of the state of the body at the current point in the trajectory. The computations are simplified by expressing the positions, orientations, and motions of parts of the body in terms of a single, non-accelerating, world-based frame of reference, rather than in terms of the joint-angles or an egocentric frame based on the body itself.

openalex-author · Advances in Psychology

Chapter IVb Some Computational Solutions to Bernstein's Problems

No abstract available from the OpenAlex source record.

openalex-author · Proceedings of the Annual Meeting of the Cognitive Science Society, vol 6, iss 0

LEARNING SEMANTIC FEATURES

No abstract available from the OpenAlex source record.

openalex-author · Nature

Parallel visual computation

No abstract available from the OpenAlex source record.

openalex-author · Biotechnology reports (Amsterdam, Netherlands)

Massively parallel architectures for AI: netl, thistle, and boltzmann machines

Probiotic formulations must contain the right strain(s) in sufficient numbers when administered to confer the desired health benefit. However, significant cell death can occur during freeze-drying and over storage. This study assesses various saccharides for their ability to protect <i>Lactobacillus plantarum</i> cells over freeze-drying and storage, as well as their potential to act as prebiotics. The cryoprotective potential of 10% (m/v) of skimmed milk, inulin, maltodextrin, and sucrose were investigated during freeze-drying. Storage was assessed over 12 weeks at 4 °C and room temperature. Improved cell survival over freeze drying was observed with all the saccharides. However, only maltodextrin and sucrose retained cell viability over storage at 4 °C. Overall, skimmed milk demonstrated the highest survival up to 91%. Despite good cryoprotectant performance, inulin provided the least protection over storage, with <1% cell survival. Prebiotic potential was determined through growth experiments with 2% (m/v) of the saccharides in glucose-free MRS. All saccharides supported cell growth, with sucrose performing best and inulin worst.

openalex-author · Journal of rehabilitation

Preventive psychotherapeutic measures for use with non-vocal clients. A case study.

No abstract available from the OpenAlex source record.

openalex-author · http://bi.snu.ac.kr/Courses/4ai10f/Papers/Fahlman 1983 - Massively parallel architectures for AI NETL. Thistle and Boltzmann machines.pdf

MASSIVELY PARALLEL ARCHITECTURES FOR Al: METL, THISTLE, AND BOLTZMANN MACHINES

It is becoming increasingly apparent that some aspects of intelligent behavior rcquirc enormous computational power and that some sort of massively parallel computing architecture is the most plausible way to deliver such power. Parallelism, rather than raw speed of the computing elements. seems to be the way that the brain gets such jobs done. But even if the need for massive parallelism is admitted, there is still the question of what kind of parallel architecture needs of various AI tasks. best fits the In this paper we will attempt to isolate a number of basic computational tasks that an intelligent system must perform. We will describe several families of massively parallel computing architectures, and we will see which of these computational tasks can be handled by each of these families. In particular, we will describe a new architecture, which we call the Boltzmann machine, whose abilities appear to include a number of tasks that are inefficient or impossible on the other architectures. FAMILIES OF PARALLEL ARCHITECTURES By “massively parallel ” architectures, we mean machines with a very large number of processing elements (perhaps very simple ones) working on a single task. A massively parallel system may be complete and self-contained or it may be a special-purpose device, performing some particular task as part of a larger system that contains other modules of a different character. In this paper we will focus on the computation performed by a single parallel module, ignoring the issue of how to integrate a collection of modules into a complete system.

openalex-author · http://www.cs.toronto.edu/~hinton/absps/optimal.pdf

OPTIMAL PERCEPTUAL INFERENCE

When a vision system creates an interpretation of some input datn, it assigns truth values or probabilities to intcrnal hypothcses about the world. We present a non-dctcrministic method for assigning truth values that avoids many of the problcms encountered by existing relaxation methods. Instead of rcprcscnting probabilitics with real-numbers, we usc a more dircct encoding in which thc probability associated with a hypotlmis is rcprcscntcd by the probability hat it is in one of two states, true or false. Wc give a particular non-deterministic operator, based on statistical mechanics, for updating the truth values of hypothcses. The operator ensures that the probability of discovering a particular combination of hypothcscs is a simplc function of how good that combination is. Wc show that thcrc is a simple relationship bctween this operator and Bayesian inference, and we describe a learning rule which allows a parallel system to converge on a set ofweights that optimizes its perccptt~al inferences. lnt roduction One way of interpreting images is to formulate hypotheses about parts or aspects of the imagc and then decide which of these hypotheses are likely to be correct. Thc probability that each hypothesis is correct is determined partly by its fit to the imagc and partly by its fit to other hypothcses (hat are taken to be correct, so the truth&apos;value of an individual hypothesis cannot be decided in isolation. One method of searching for the most plausible combination of hypotheses is to use a rclaxation process in which a probability is associated with each hypothesis, and the probabilities arc then iteratively modified on the basis of the fit to the imagc and the known relationships bctwcen hypotheses. An attractive property of rclaxation methods is that they can be implemented in parallel hardwarc where one computational unit is used for each possible hypothcsis, and the interactions betwcen hypotheses are implemented by dircct hardwarc connections betwcen the units. Many variations of the basic relaxation idea have becn However, all the current methods suffer from one or more of the following problems:

openalex-author · Paper

Parallel models of associative memory

This update of the 1981 classic on neural networks includes new commentaries by the authors that show how the original ideas are related to subsequent developments. As researchers continue to uncover ways of applying the complex information processing abilities of neural networks, they give these models an exciting future which may well involve revolutionary developments in understanding the brain and the mind -- developments that may allow researchers to build adaptive intelligent machines. The original chapters show where the ideas came from and the new commentaries show where they are going

openalex-author · Behavioral and Brain Sciences

Inferring the meaning of direct perception

An abstract is not available for this content so a preview has been provided. Please use the Get access link above for information on how to access this content.

openalex-author · Behavioral and Brain Sciences

Imagery without arrays

An abstract is not available for this content so a preview has been provided. Please use the Get access link above for information on how to access this content.

openalex-author · Cognitive Science

Some Demonstrations of the Effects of Structural Descriptions in Mental Imagery*

A visual imagery task is presented which is beyond the limits of normal human ability, and some of the factors contributing to its difficulty are isolated by comparing the difficulty of related tasks. It is argued that complex objects are assigned hierarchical structural descriptions by being parsed into parts, each of which has its own local system of significant directions. Two quite different schemas for a wire‐frame cube are used to illustrate this theory, and some striking perceptual differences to which they give rise are described. The difficulty of certain mental imagery tasks is shown to depend on which of the alternative structural descriptions of an object is used, and this is interpreted as evidence that structural descriptions are an important component of mental images. Finally, it is argued that analog transformations like mental folding involve changing the values of continuous variables in a structural description.

openalex-author · Paper

Representation and control in vision

No abstract available from the OpenAlex source record.

openalex-author · Perception

Reviews: A Primer of Infant Development, Systems Neuroscience

No abstract available from the OpenAlex source record.

openalex-author · ERA

Relaxation and its Role in Vision

It is argued that a visual system, especially one which handles imperfect data, needs a way of selecting the best consistent combination from among the many interrelated, locally plausible hypotheses about how parts or aspects of the visual input may be interpreted. A method is presented in which each hypothesis is given a supposition value between 0 and 1. A parallel relaxation I operator, based on the plausibilities of hypotheses and the logical relations between them, is then used to modify the supposition values, and the process is repeated until the best consistent set of hypotheses have supposition values of approximately 1, and the rest have values of approximately 0. The method is incorporated in a program which can interpret configurations of overlapping rectangles as puppets. For this task it is possible to formulate all the potentially relevant hypotheses before using relaxation to select the best consistent set. For more complex tasks, it is necessary to use relaxation on the locally plausible interpretations to guide the search for locally less obvious ones. Ways of doing this are discussed. Finally, an implemented system is presented which allows the user to specify schemas and inference rules, and uses relaxation to control the building of a network of instances of the schemas, when presented with data about some instances and relations between them

openalex-author · Artificial Intelligence and the Simulation of Behaviour

Using relaxation to find a puppet

The problem of finding a puppet in a configuration of overlapping, transparent rectangles is used to show how a relaxation algorithm can extract the globally best figure from a network of conflicting local interpretations.